[00:05] jrwr, not that it helps, but I'm sending all my symphaty and good thoughts. [00:05] Im going to hack up some cgi-bin [00:05] and compile php7 on this box [00:05] wouldnt be the first time I've worked around shit like this (LOOKING AT YOUR FERAL) [00:56] so I turned on file based cache [00:57] it should help /some/ [00:57] I'm working with apache some to get this working its being a PITA, I suspect a apache module doing this [01:08] *** BnAboyZ has joined #archiveteam-bs [01:27] BTW, regarding WARC uploads going into the Wayback Machine -- I've now gotten confirmation that it is still a trusted-uploaders-only process (which isn't surprising). [01:28] JAA is trusted, and ivan as well, presumably. [01:42] So SketchCow, I'm stuck, the PHP is too old to update mediawiki, Apache is not behaving with the CGI override due to mod_security being forced on, and overall the entire account is limited to 7 (confirmed) connections concurrent (thats whats causing the resource limit pages currently) [01:42] I've added the static file cache and it is helping [01:44] There'll be some roughness as we figure out what to do. [01:44] Ya [01:44] its using 2.6 as its kernel.... [01:44] But if I have intelligent requests for the host, I'm sure they can help. [01:46] Ok, So the main one is can I have my limits increased for the number of CGI scripts ran at one one time. I keep getting resource limit errors on top of this error log: [Wed Dec 20 20:41:37 2017] [error] mod_hostinglimits:Error on LVE enter: LVE(527) HANDLER(application/x-httpd-php5) HOSTNAME(archiveteam.org) URL(/index.php) TID(318310) errno (7) Read more: http://e.cloudlinux.com/MHL-E2BIG min_uid (0) [01:46] Well, assemble them all in one place for me. [01:47] I mean after a day of looking it COMPLETELY over [01:47] And then I'll bring it to TQ and see what they thing [01:47] think [01:47] Ok [01:47] No sense in piecemealing [01:47] Also, let me glance at the cpanel [01:49] Ok [01:50] Ya, its pretty much those two issues I have with it, I'm compiling them into a google sheet for tracking [01:58] Wow. I was having problems compiling WARC files in javascript and was going to ask if there was a preexisting API for something like that, but I can barely even read what you guys are saying. [01:58] Im poking the poor wiki very hard [01:59] whats up jacketcha [01:59] WARC reading in javascript, hrm [01:59] yeah [01:59] Well, the WARC standard is pretty simple overall [01:59] its all about the indexing and lookups that make it fast and JavaScript is not that great at it. [02:00] https://www.npmjs.com/package/node-warc [02:00] have some node [02:00] its /javascript/ [02:00] thanks [02:01] I was planning to add it to my chrome extension [02:01] ah [02:02] im logging off for now SketchCow, See ya in the morning [02:13] *** robink has quit IRC (Ping timeout: 246 seconds) [02:15] ok, so can somebody explain to me how warc files work [02:15] sorry for being dumb [02:15] i whonestly have no idea [02:15] *honestly [02:16] do you have a more specific question? [02:20] How is the data structured? I am going to assume that it isn't just copying in the HTML source code after the headers are added. [02:20] It stores the full response headers and body [02:21] That includes responses containing binary data, HTML, CSS, plain text, whatever [02:22] Is there any specific order to that? [02:24] Records can be in any order AFAIK [02:25] Great, so it'll work just find with asynchronous saving. Thanks, that was actually really helpful. [02:28] http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/index.html [02:28] thanks! [02:29] there's WARC 1.1 (which is the latest version) linked there too [02:38] *** bithippo has joined #archiveteam-bs [02:43] Thinking about grabbing Imgur. All of it. Anything I should keep in mind prior to putting it in cold storage? [02:44] (iterating over ever permutation of image urls based on how Imgur generates image urls) [02:47] Is the way Imgur generates urls known? [02:47] Better question, is the source of data for the RNG Imgur uses known? [02:48] *** robink has joined #archiveteam-bs [02:50] https://blog.imgur.com/2013/01/18/more-characters-in-filenames/ [02:50] "Choosing 5 characters from 26 lowercase letters + 26 uppercase letters + 10 numerical digests, leaves us with 916,132,832 possible combinations (625). Upgrading to 7 characters gives us 3,521,614,606,208 (3.52 trillion) possibilities." [02:53] 404->check back in the future, 200->WARC gz [02:53] Etag header on a request is the MD5 of the image [02:53] That is still 3 trillion web requests [02:53] ¯\_(ツ)_/¯ [02:54] Alternatives? Besides waiting until Imgur runs out of runway and then its more pressing :/ [02:55] (ie Twitpic 2.0) [02:56] My only frustration is that the URL isn't deterministic from a hash of the image, so it's possible an image exists, is deleted, and then replaced without any way to know [02:58] Unless it was already archived [02:58] Look, it's a good idea in practice, but here's the thing [02:58] Ahh, truth [02:58] Imgur gets around 17.3611111111 new images per second [02:59] That would place it at around 2687000000 images today [03:01] Le sigh. [03:01] It gets worse [03:02] That means you have a roughly 0.076300228743465619423705658937702969914635168416766807608280491104220976146849283303277891127907878357259164917744009861455363906114286574925748247085571170136781627322552461470824865159611513732687195% chance of getting an image every time you send a request [03:03] Don't be fooled by the high precision, the accuracy of your plan is very low. [03:03] But, there is a way to make it higher. [03:04] Much higher, in fact [03:04] If you can figure out the source of the data that Imgur uses for its random number generation algorithms, you can at least grab the newest images [03:06] Sounds workable using their API to get latest image paths and then working backwards [03:07] possibly [03:08] But if it is randomly generated, even pseudorandomly generated, you're still screwed. [03:09] or [03:09] you know [03:09] you could just email them [03:09] and ask [03:10] "Hi. I will take one Imgur pls. Will be over with hard drives shortly." [03:10] Appreciate the input! [03:11] No problem [03:19] hold up [03:20] If a Warrior or ArchiveBot finds a WARC file, is it added to the collection of WARC files or is it added into the WARC file of the site it is located on? [03:23] i'm up to 18k items this month [03:23] this year has been slower then last year [03:23] i just hope i can get it past 100k for the year [03:24] https://archive.org/details/@chris85?&and[]=addeddate:2017 [03:24] it 97,682 so far [03:24] woah [03:24] maybe I should start counting mine [03:24] and putting it in actual warc files [03:33] *** pizzaiolo has quit IRC (Remote host closed the connection) [03:45] *** robink has quit IRC (Ping timeout: 246 seconds) [03:48] jacketcha: if a HTTP request returns a WARC file, and that HTTP request and response is being stored into a WARC file, [03:48] then, yes, you'll have nested WARC-formatted data [03:48] AFAIK, no WARC-recording tool will automatically un-nest it (and that would probably not be a good idea in any case) [03:55] wait [03:56] so that means that there possibly could be an archive of the entire internet floating around the wayback machine somewhere, but nobody would ever know because it was nested. [04:01] This is..... [04:01] See, this is one of the things [04:01] You are asking... well, you're asking for a college course in how WARC works [04:01] It's sort of on topic and sort of off [04:01] It's certainly sucking all the air out of the room [04:01] It's nice to see people talking [04:01] so nested warc files are basically politics [04:02] got it [04:11] *** bithippo has quit IRC (Ping timeout: 260 seconds) [04:14] No. [04:14] You're wandering into a welding shop going "So.... why cold welds" [04:16] that seems very accurate [04:20] *** bithippo has joined #archiveteam-bs [04:33] *** kyounko has joined #archiveteam-bs [04:55] *** qw3rty117 has joined #archiveteam-bs [05:01] *** qw3rty116 has quit IRC (Read error: Operation timed out) [05:22] *** bithippo has quit IRC (Quit: Page closed) [05:29] *** Stiletto has quit IRC (Read error: Operation timed out) [05:34] *** Stilett0 has joined #archiveteam-bs [05:35] *** BlueMaxim has quit IRC (Read error: Operation timed out) [05:36] *** BlueMaxim has joined #archiveteam-bs [06:02] *** wp494 has quit IRC (Ping timeout: 250 seconds) [06:02] *** wp494 has joined #archiveteam-bs [06:12] *** zgrant has left [06:15] *** wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) [06:16] *** wp494 has joined #archiveteam-bs [06:27] *** kimmer1 has joined #archiveteam-bs [06:29] *** midas2 has quit IRC (Ping timeout: 1212 seconds) [06:33] *** kimmer12 has quit IRC (Ping timeout: 633 seconds) [06:47] *** midas2 has joined #archiveteam-bs [07:24] *** wp494_ has joined #archiveteam-bs [07:29] *** ZexaronS has quit IRC (Read error: Connection reset by peer) [07:30] *** ZexaronS has joined #archiveteam-bs [07:31] *** wp494 has quit IRC (Read error: Operation timed out) [07:39] *** wp494_ has quit IRC (Ping timeout: 633 seconds) [07:41] *** odemg has quit IRC (Read error: Operation timed out) [07:46] *** odemg has joined #archiveteam-bs [07:53] *** wp494 has joined #archiveteam-bs [08:03] *** robink has joined #archiveteam-bs [08:06] I wonder how many times jquery has been archived [08:07] By this point, there must be at least a hundred copies made of it each week [09:09] *** jacketcha has quit IRC (Read error: Connection reset by peer) [09:09] *** jacketcha has joined #archiveteam-bs [09:16] *** Mateon1 has quit IRC (Ping timeout: 245 seconds) [09:16] *** Mateon1 has joined #archiveteam-bs [09:35] SWEET BABY JESUS [09:35] someone, I got php7 to run [09:35] on this holy shit old host [09:39] Is this a dedicated server, jrwr? [09:39] not even close [09:40] its a shared host running linux 2.6 on a old cpanel [09:40] running a god old apache + php 5.3 [09:40] I override the mod_security and mod_suphp all to fux to get PHP scripts to run with a custom statically linked php binary I made [09:41] Wtf? 2.6 EOL’d years ago. [09:43] I'm making do with what I have [09:43] its where its staying [09:43] I'm making its own little world on this webhost [09:43] * jrwr compiles memcached [09:45] now [09:45] comes the fun part [09:45] I'm going to update mediawiki [09:46] good luck [09:47] i can't even update windows without doing a ritual to please the tech gods [09:50] this is dark magic [09:50] php does not like doing this [09:52] I have a plan [09:52] to compile php with memcached, and then run a little memcached server so mediawiki can cache objects [09:53] nah [09:53] you don't even need php [09:54] just do what I do and use 437 IFTTT applets as your server [09:54] with a touch of github pages [09:54] lol [09:54] this is the archiveteam wiki I'm working on [09:55] Is the ArchiveTeam wiki archived? [09:55] Mostly :p [09:57] You know what I want to try? My school has unlimited storage on all google accounts under its organization. I wonder how far they would let me push that. [09:58] Its staying where it is for now [09:58] for ~reasons~ [09:58] is it because you missed a semicolon somewhere but there isn't a really good php linter yet [10:00] oh no [10:01] i just remembered i have midterms [10:01] gn [10:19] oh man [10:19] thats a ton better [10:19] *** jacketcha has quit IRC (Remote host closed the connection) [10:19] Archive team is now running on mediawiki 1.30.0 [10:21] *** jacketcha has joined #archiveteam-bs [10:24] *** jacketcha has quit IRC (Remote host closed the connection) [10:29] *** jacket has joined #archiveteam-bs [10:43] *** fie has joined #archiveteam-bs [10:48] *** fie has quit IRC (Read error: Connection reset by peer) [11:07] *** pizzaiolo has joined #archiveteam-bs [11:14] Igloo: better huh? [11:16] jrwr: miles and miles [11:16] Ya [11:17] Response times are sub 200ms [11:17] Before they where 1400ms [11:20] jrwr: Well done! Much, much better. <3 [11:20] Thanks [11:21] Somebody2: Ah, makes sense. Thanks for checking. [11:21] I woke up from a strange dream at 3am (flying a airplane and somewhat crashing it) [11:22] And then had a brainwave on how to get php working correctly [11:22] Been up since then, work is going to be hell today [12:08] *** BlueMaxim has quit IRC (Leaving) [12:40] Ok, So it looks like we can iterate through the numbers [12:41] For user profiles, maybe. For characters, no way. [12:41] 212 million users? Unlikely [12:42] But it should be fairly simple to scrape them from https://www.saintsrow.com/community/characters/mostrecent [12:42] The question is, how do we get the actual characters (not just the images)? [12:42] is there a log of what has gone before? [12:42] I have a 'archiveteam' account registered along with my personal one [12:42] not sure why [12:42] maybe i just suggested scraping for SR3 [12:46] https://www.saintsrow.com/users/show/212300001 appears to be the lowest. https://www.saintsrow.com/users/show/213056573 latest [12:46] 2300001 3056573 ~705,000 user profiles? [12:47] Those are easy [13:22] SketchCow: email fixed [13:22] confirmed working with password resets being sent to a gmail account [13:23] Great [13:32] https://usercontent.irccloud-cdn.com/file/brVInVWJ/image.png [13:32] you can see when I dropped in the php changes [13:52] joepie91: it doesnt work like that [13:52] this whole box is from 2011 [13:53] ah, just an old cpanel then that doesn't support it, or? [14:02] ya [14:02] I have methods and apis [14:02] im patching it in [14:05] *** icedice has joined #archiveteam-bs [14:13] *** icedice has quit IRC (Ping timeout: 250 seconds) [14:13] jrwr: LOL, that graph is beautiful! [14:28] Thanks [14:28] its going up on my wall [15:18] JAA: Igloo [15:18] guess what [15:18] SSL BITCHES [15:19] Yiss [15:19] https://www.ssllabs.com/ssltest/analyze.html?d=archiveteam.org [15:19] fucking A rating! [15:20] :D :D :D https://i.imgur.com/CloHYLR.png [15:20] with Strict Transport Security (HSTS) on (left it pretty short just in case) [15:20] Just need a redirect now ;-) [15:20] ^ [15:21] na, not going to enforce it [15:21] HTST is enough [15:23] *** zgrant has joined #archiveteam-bs [15:24] fuck it [15:24] done [15:24] SketchCow: SSL is now installed [15:24] anything else? [15:24] :) [15:24] hehe, all my home stuff with LE gets A rating too [15:24] Which is bonza [15:26] I think that's all I can think of [15:26] Someone proposed some sort of theme upgrade [15:27] But it all seems just fine to me now. [15:31] ah [15:31] its fine [15:32] I /might/ get bored and add in a new editor but the new editor requires all kinds of crazy [15:32] If people come up with things, we'll consider them now that it's possible [15:32] Generally, someone complaining they can't work on te Wiki because they miss a gimgaw are focused on the wrong things. [15:33] Ya [15:33] I am using the file based cache built into mw [15:33] so bots and stuff all get served static pages [15:37] I feel like I just refurbed my 1984 Chrysler lebaron convertible (I own one) https://drive.google.com/file/d/1AQqXNiluKTk5xuCYStfVexiH1LLUOYaLLQ/view?usp=sharing [15:43] Nice car [15:45] 900$ [15:45] runs great, and talks to you [15:46] https://www.youtube.com/watch?v=nGuRS-L2BN0 [15:53] I love the old DEC speech Synths [15:53] sound better then software ones [16:28] so another box of tapes i bought is shipped [16:50] *** dd0a13f37 has joined #archiveteam-bs [16:53] so this happenedhttp://mashable.com/2017/12/20/sesame-street-irc-macarthur-grant-refugee-middle-east/#fIS9la5_bSq7 [17:18] *** schbirid has joined #archiveteam-bs [17:21] *** jacket has quit IRC (Read error: Connection reset by peer) [17:22] *** jacket has joined #archiveteam-bs [17:23] aria2c is a mystery [17:23] if I have it use 1 connection or 10, I still get about 2r/s [17:24] if I split it up across 6 command windows, 12r/s [17:24] might have to do with the fact that it's split across multiple IPs though [17:25] Anyone know a good tool to do this automatically? Split up http requests over multiple proxies? [17:59] *** bithippo has joined #archiveteam-bs [18:42] is there some sort of code available to look at on how IA get urls from warcs? [18:43] or does it convert first, somehow, then do _that_ ? [18:45] i was kind of expecting warcs to a kind of archive with an index, not all data being a single file :/ [18:46] e.g containing something i could open in gedit etc.. [18:50] iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/index.html [18:51] https://github.com/recrm/ArchiveTools/wiki/warc-extractor [18:52] Can someone help me? [18:53] I'm trying to archive a site with aria2c. When using multiple concurrent connections, I get about 2r/s. The speed is around 50bkit, despite my internet connection being much faster. [18:53] When using multiple instances over multiple IPs, it's not much better. The individual speeds for some drop down to 5kbit. [18:54] Recommend using https://github.com/ludios/grab-site to archive instead [18:54] I am currently running 50 instances of aria2 with 20 concurrent connections each. I get 5r/s total. [18:54] If you need a WARC file, request/response headers, etc. [18:54] This is abysmally low, what gives? [18:54] You are most likely being throttled by IP [18:54] But I'm spreading it over 50 different IPs. [18:54] (even if distributing across multiple IPs) [18:55] Sliding window of bytes in the webserver config. Initial requests are fast, subsequent requests slow down if you try to firehose [18:55] What's the hostname? [18:55] ratsit.se [18:55] Or my hostname? [18:55] Nope, site hostname. Checking something. [18:55] So how can it throttle them to 2-5 KiB/s when using 50 different IPs, but 50 KiB/s when using 1? [18:57] Are these all anonymous web requests? Or are you signed in/setting a cookie to be logged in to fetch data? [18:58] These are all anonymous requests from tor exit nodes. No cookies are stored. [18:58] Could be throttling by tor IPs. I did that at my last gig on our Nginx servers. [18:58] (Tor requests were notoriously bad scraping actors in our case) [18:58] But I used tor IPs before too. And those were at 50KiB/s, not 2-5KiB/s. [18:59] I don't have a good answer unfortunately :/ Lots of variables that could be causing it. What's the purpose of using Tor to perform the requests? [18:59] The only logical explanation is my connection being the bottleneck, but that would put it at around 2 mbit, which is way too slow [19:00] Because I don't want to get in any trouble for the scraping, and they could ban my IP [19:01] Now it jumped up to 12 resp/s, which was my previous peak when using 6 different IPs. [19:01] Is a cloud provider VM out of the question with a slow concurrency rate? [19:01] 2 requests per second, say. [19:01] It could be on their end too. [19:02] I could just leave the computer on over night, but I would prefer not to pay any money. [19:02] Could it be they just serve 12 connections at the same time? [19:03] Now down to 7 again. Sure is a mystery what is going on... [19:06] And now back up to 14. At this speed, it will take 6 hours, which is slow but acceptable. [19:07] I can rip it for you and provide a torrent file when I'm done. [19:08] It went up to 34 now. [19:08] How? With grab-site? [19:08] They might block the IP being used in that case [19:09] 10 second wait between requests [19:09] It'll take a while, but it'll finish eventually. [19:09] Have to head out, leave me a note here if that's a plan [19:11] 10 seconds would take half a year for the whole site, and it would change during the time, so I don't think that's a good idea [19:12] But if it continues to be this fast then it should be done in a few hours, which is good. [19:21] Well, the only logical explanation is some advanced throttling algorithm in place. I can't find any other explanation for why it's so slow. [19:27] https://pastebin.com/VkCS1yJ1 It apparently got faster over time, with a peak of 46 resp/s, before slowing down. [19:45] any sqlite geninouses savvy to very basic sqlite reational database structure in here who wouldn't mind if ask some questions? [19:45] relational* [19:46] I have a basic knowledge, shoot [19:47] dd0a13f37: thanks. if you please take a look at the SQL here, (just page search for 'sqlqueries') https://github.com/DuckHP/twario-warrior-tool/blob/master/src/twario/sqlitetwario.py [19:48] i'm sure that sql could be done better, and i think you'll agree. Sadly my sql is quite shit [19:49] to optimize storage etc..i mean [19:49] and speed etc [19:50] The schema? [19:50] aye [19:50] You can add a constraint for TweetUserId so it has to have a corresponding entry in Users [19:51] And Users should either have id INTEGER PRIMARY KEY, or TweetUserID should be a username [19:51] yeah been thinking that so i made a 'users' table [19:51] ok [19:52] Display name isn't stable, but it might be overkill to provision for that [19:52] search for foreign key constraint [19:52] aye, i'm not even sure yet if 'tweep' reads display names [19:52] https://sqlite.org/foreignkeys.html http://www.sqlitetutorial.net/sqlite-foreign-key/ [19:53] Well, if you want to archive avatars etc it might be neat to have. You could have three tables, but it might be overkill [19:53] tweets - tweet text, date, username [19:53] users - username (not unique), date, avatar, displayname [19:53] or wait, that makes two [19:54] *** Valentine has quit IRC (Read error: Connection reset by peer) [19:54] avatar might be doable [19:54] And then just do SELECT * FROM users WHERE username = ... LIMIT 1 [19:54] ty [19:55] not sure about the syntax [19:58] the requests sql i can figure out i think, but i suck at schema/structure :/ [19:58] *** BnAboyZ has quit IRC (Quit: The Lounge - https://thelounge.github.io) [19:59] i'll check out that link you posted. thanks [20:00] Well, have one users table that for each username can have multiple entries (e.g. if they change their avatar you get a new entry with same username) [20:00] And one tweets table, since they are immutable [20:00] *** Valentine has joined #archiveteam-bs [20:03] e.g if a tweet is identical, just have a 'content' table perhaps, and refereance that? [20:03] If you're ever building your own scraper, it seems like mobile.twitter.com is more pleasant to work with [20:03] view-source:https://twitter.com/jack view-source:https://mobile.twitter.com/jack [20:03] i'm just "re-doing" a tool called 'tweep' [20:04] As in, modifying it? [20:05] aye, this: https://github.com/haccer/tweep ..it seems to work quite well, but could use some tweaking [20:05] If you want a complete archive, you could probably crawl pretty nicely. Start off by the timeline, then see what accounts and hashtags you find. Then traverse those accounts and hashtags, see what accounts and hashtags you find. [20:06] >The --fruit feature will display Tweets that might contain sensitive info [20:06] uh [20:07] aye, have not tested that yet, but i've been thinking of removing it [20:07] basically 'user' and 'search words' is my focus [20:07] not exactly too keen on archiving 'doxing' tweets [20:08] Well, why bother taking it out? Just don't use it, or remove all documentation references to it if you're really concerned. [20:08] aye [20:09] I think mobile.twitter.com is better. It shows 30 tweets/page instead of 20, and the pages are faster to download [20:10] JAA: Are you still grabbing the Catalonia cameras that update every 5 minutes or so? [20:10] dd0a13f37: it seems to require a signin/account [20:10] hook54321: Yeah [20:10] I think so, at least. [20:10] Let me check. [20:10] lol [20:10] dd0a13f37: i deliberaly made myself banned from twitter :/ [20:10] I need to start recording the cameras I was recording again [20:11] Yep, it's still grabbing... something. [20:11] mobile.twitter.com doesn't need an account [20:11] Haven't looked at the content in a long time though. [20:11] https://mobile.twitter.com/jack?max_id=938593014343024639 works just fine for me [20:12] dd0a13f37: so it's just because i'm using desktop browser then? [20:12] dd0a13f37: that link worked btw [20:13] dd0a13f37: doh, i got a "join today to see it all" when scrolling [20:13] JAA: Should I grab this whole youtube channel? https://www.youtube.com/user/gencat [20:14] I am using tor browser with JS disabled. [20:16] dd0a13f37: i think if 'twario/tweep' is made a bit less agressive, it wouldn't need to be 'torified' [20:16] https://mobile.twitter.com/jack?max_id=743833014343024639 I can go quite a bit back [20:17] Why castrate your perfectly working tweet scraping tool? Requests can use proxies, or multiple. [20:17] dd0a13f37: with original 'tweep' it seemed to stop at half a year or so back in time at search word [20:18] they could, but it would eventually get noticed i think if's running continously :/ [20:18] differnt users go differently far back https://mobile.twitter.com/realDonaldTrump?max_id=793833014343024639 [20:19] i don't mean users, but e.g one word [20:19] There are many tor exit nodes. [20:21] how could a python script be _fully_ torifyed? If it could be done without using a virtual machine, that would be cool :D [20:21] torsocks python ./myscript [20:21] ty [20:22] Or you can just have requests use a proxy [20:22] torsocks -i for guaranteed fresh ip [20:24] *** BnAboyZ has joined #archiveteam-bs [20:27] What collection should I upload that channel to? There's like 400 videos.... [20:28] dd0a13f37: will definetly test that. And i'm guessing just the tiny bit of extra time storing to an sqlitedb counts as tiny bit of it being nice-ifyed :D [20:29] The time the request takes will, unless you're using twisted/multithreading [20:30] dd0a13f37: the reason 'local capture time' column is in tweets i think i put in for exactly that purpose, since JAA pointed out that 'tweep' itself does not seem to be correct at keeping times [20:30] aye [20:31] The mobile search url query string is ... interesting... [20:31] https://mobile.twitter.com/hashtag/EU?src=hash [20:31] https://mobile.twitter.com/search?q=EU&next_cursor=TWEET-943937901217370114-943937901217370114-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWQABAAAAIAAAAAAAAQgAAAAAAAJAAAAAAAAAAABAAAAQAAAAAAAAAAiAAAQAAAAAAABAAAAAAAAAACBAAIAAAAAQAAAAAAIAAAAAACAAIAAAAAAAAAAAAAAACAAAhAAAAAAAAACAgAAAAAAAAAAAIAAAAAAAAAAAAAAAAgAAIAAAAAAFAAIIAAACCAAAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAAAAAAEAAAAAACgCAAAAgAABwAAABAAAAAAAAAAIAAAAARAAEAAAAAAAA [20:31] AAAAAAIAAAAgAAAAAAAAAAAAAAAAACAAAABAAAAABAAAAAAAAQAAAAQAAEAABAAAEAAEAQAAAAAgAAAAAAAAAAAAAwACAAAAAAAAAAAAAAAAABQAAAAAAAAAAAAAAAAAACQAACAAAAAAAAAIAAQACAAAAFABAAAAAAAQkAAAEAAAAAAAAoAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAIAAAAAICAAAAAAAAAAAAAAEAAAAAAEAAACAAAAAAAAAEAAAAAAAAAgAAAAAAQAEAAQAAAAAAAAABAUAAAEAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAIQAAAACACAQAAQAAIAAAAIAAAAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQBAAAAwAAAAAAAAAAAAAAAAAA [20:31] AAAAAQAAAAAAAAAAAgAAAAEAAAAAAACABAAAAAAAAAAAAAAEAAQAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAIAAAQAAAAAAAAAAEAAAAAAAAAQAAAAAABABAAAAAAACAAAQAAAAAAAAAAAAAAAAAgAAAAAIAABACIAAAAAAAAAIAAAQAAAAAAAAA%3D%3D-R-0 [20:31] (that was one link) [20:31] oh dear [20:31] the base64 encoded part is some kind of bitmask [20:31] my eyes! [20:32] i think i've seen that garbled shit before, at tweep crashing :/ [20:33] when adding 'loggin' module, that looks exactly like the output given on the line where it stopped [20:33] logging* [20:34] hmm, strange [20:34] because there is no base64 encoding or anything of the sort in tweep [20:34] maybe i have the output..one sec [20:36] seems i've deleted the log, I'll risk trying to run the same command one more time. brb [20:38] good news [20:39] I enabled the new editor toolbar in the wiki (cc SketchCow ) [20:40] dd0a13f37: could it be compressed stuff, like in 'header: gz' crap? [20:41] No, it's base64 [20:42] run base64 -d | xxd, then paste it in [20:42] You'll see most of the bytes only have one bit set [20:42] Hurrah [20:42] Since it doesn't do anything if you change the numbers at the beginning (max id), the max_id parameter is in there too [20:42] dd0a13f37: all i know that mess of "AAAAAAAA" was the end of the log line when i last tested tweep. And also where it apprently failed. [20:43] Not at the beginning, since that's the same across requests [20:43] i'm running the same command now, and will pastebin (when) it fails [20:44] SketchCow: its snazzy [20:45] dd0a13f37: for all i know it might've been some nasty character(s) that did it [20:46] makes it a /little/ simpler to edit pages [20:48] *** bithippo has quit IRC (Quit: Page closed) [20:54] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [20:59] Oh, regular twitter has that same AAAAAAAAAAA mess, just not as the requested URL [21:01] I think it does. Load a page in your browser (with JS enabled), enable dev console, scroll to the bottom, check out the requests that happen in the background. [21:02] I think tweep just tries to imitate what the browser would do. [21:02] hmm [21:04] it has not crashed yet here like last time, but seems like a lot of people love to tweet 'netneutrality' these days. So it's not even done with this month. I think last time it crashed at about this years month of may tweets [21:05] holy shit people have been tweeting 'netneutrality' lol [21:05] Twitter gets 6k tweets/sec, with 20 tweets/request archiving this is in the realm of possibility [21:06] dd0a13f37: aye :D and using webarchive.io, or wget/curl with requests to web.archive.org/save/ is quite futile :D [21:07] You would need to do a few hundred requests per second. The problem is archiving all those avatars, if you saturate a 1gbit/s line you can afford to archive 20kbit avatars assuming no overhead or IP bans [21:07] But the avatars aren't very important, are they? [21:07] reconstructing the links to the tweets is more important [21:08] That's possible too, all the info is in the HTML [21:08] and 'tweep' captures the id [21:08] aye [21:08] Does tweep have a mode where it can just show you all the tweets being done? [21:09] it does by default [21:09] Without narrowing down to a hashtag? Does it get 100%? [21:10] i don't know. I does a fuck of a lot of tweets though :D [21:11] if why it stopp(ed), could be worked out, i bet it could do 100% [21:12] for the wikidump nerds At 04:00 on Friday. a copy of the wiki's XML + Images are uploaded to the IA [21:12] for good measure [21:12] right now i'm just doing "python tweep -s 'netneutrality' > tweets.txt" ..to see if it eventually stops like last time. For all i know, piping to a textfile is what did it. [21:13] But can you just run python tweep > t.txt? [21:14] AND THE WINNER OF THE "DO YOU WANT THIS" SWEEPSTAKES FOR DECEMBER 21 IS [21:14] with '-s ' , yes [21:14] ...hundreds of gigs of funeral recordings in mp3 [21:14] *** BartoCH has joined #archiveteam-bs [21:14] But without -s parameter? [21:15] You could cheat and just use the X most common words, but that's not a nice solution [21:15] then it asks for parameters i think .. I think it's either '-u (user)' or '-s (word(s))' possible [21:16] either one of those are required.. _i think_ [21:16] Then a full scrape is difficult, or at least harder [21:16] SketchCow: your collection never ceases to amaze me [21:16] jrwr: Yay, finally. The last such dumps were uploaded in 2014 or 15. [21:17] they get dumped here after processing https://www.archiveteam.org/dumps/ [21:17] dd0a13f37: with a 'users' table i could be easier though perhaps..or :D [21:17] only keeps one [21:17] dd0a13f37: it* [21:17] the backup log for it is in https://www.archiveteam.org/backup.log [21:18] Well, there's still a few users who are never mentioned by others, never use certain hashtags, and never use certain words [21:18] dd0a13f37: yeah [21:18] Neat [21:19] dd0a13f37: not to mentioned banned, yet mentioned, and private..etc. i guess [21:20] dd0a13f37: i don't have much experience using tweep, so i don't even know how it behaves on finding disabled accounts, or banned users :/ [21:22] If you're scraping in realtime, that doesn't matter.. it would be one hell of a tweet to get banned in under 5 milliseconds [21:22] dd0a13f37: i think it just goes back in time from point of start [21:23] You'll never keep up, better to go in realtime [21:23] dd0a13f37: that would need some genoius at python threads i think..and perhaps faster bandwitch than mine :D [21:24] bandwidth* [21:24] twisted-http is fast, no? [21:24] *** jacketcha has joined #archiveteam-bs [21:25] *** pizzaiolo has quit IRC (Read error: Operation timed out) [21:25] dd0a13f37: i do not know..tweep uses 'request' / 'urllib(3?)' i think [21:26] one results page is 8.5k gzipped, contains 20 tweets, at 6k tweets/sec this gives 20 mbit/s [21:26] dd0a13f37: it wouldn't surprise me if in certain senarios a hashtag were quicker than i could process [21:26] Yeah, requests without threading. [21:27] *** jacket has quit IRC (Ping timeout: 248 seconds) [21:27] But I think caching will make such attempts impossible, if you do the same query multiple times you'll get the same result [21:27] when using crontab wget, i had to cut time from 5 minutes to 3 minutes between each web.archive.org/save/ request..just to have a chance [21:29] You can do those in the background. Fire and forget. But IA won't like it [21:29] dd0a13f37: going "upwards" in time in a twitter feed is most likely the best solution. But my grasp of how to do that..is weak :D [21:30] I think archiving twitter is an insanity project anyway, better to just wait for library of congress to get their shit together [21:30] dd0a13f37: i just focous on hastags, like netneutrality :D [21:30] that's probably possible [21:31] dd0a13f37: entire twitter, or twitter by even years or months..yeah, some congress would've have to do that :D [21:31] and now I rest from poking the wiki really hard over the last 24hr [21:33] *** jacketcha has quit IRC (Read error: Operation timed out) [21:33] dd0a13f37: tweets containing 'netneutrality' been scrolling on my screen for 'since i said i started the command' , and i'm still on 2017-12-19 :/ [21:34] dd0a13f37: though i expect it will speed up when getting past the 14th a bit [21:34] You could archive faster if you modify it to use twisted [21:35] just by the protocol stuff or using threading? [21:36] What? [21:36] >Twisted is an event-driven networking engine [21:37] https://twistedmatrix.com/documents/current/api/twisted.web.client.html [21:37] so its beatifulsoup that's bottleneck, or? [21:38] No, requests [21:38] and that it's not using requests with threads [21:40] could it "re-use" already established connections? because that is one thing that pisses me off about tweep. It seems to do one connection per damn tweet [21:40] yeah [21:40] ..or at least try, like wget [21:40] ty [21:41] anyway JAA I figured once a week is a good backup for a low traffic wiki [21:42] or apparently asyncio is recommended [21:42] Yeah, sounds reasonable. [21:43] ola_norsk, dd0a13f37: It would probably be easiest to reimplement the whole thing based on aiohttp or similar. [21:44] * ola_norsk taking notes of all :D [21:44] I've written scrapers with aiohttp before, it's really nice. [21:45] got git? :D [21:45] HTTP/2 support would be even better. [21:45] No, haven't shared it yet. [21:45] It's on my list for the holidays, uploading all my grabs and the corresponding code. [21:46] JAA: feel free to punch in some stuff :) https://github.com/DuckHP/twario-warrior-tool [21:46] i have to the get database thingy working first i guess, before i do anything else :/ [21:47] (that, and making sure it doesn't freeze) [21:47] http://www.sqlalchemy.org/ [21:48] What did you change so far? [21:48] Also, port to Python 3 please. [21:48] JAA: i've barely (not really) touched tweep itself so far :/ [21:49] Ah ok [21:49] bah...I've not python'ed in years..2.7 is new to me :D [21:50] Oh please, Python 3 was released in 2008. :-P [21:51] why is there not a python script that converts e.g 'print "shit"' ? [21:51] 2to3? [21:51] 2to3 [21:51] That should handle the most obvious stuff. [21:52] good [21:53] *** icedice has joined #archiveteam-bs [21:54] the need of porting an interpreted language...that's a travesty by itself :( [21:54] Well, it's necessary because they cleaned up a ton of poorly designed stuff in Python 3. [21:55] it truly boggles the mind [21:55] yet c can remain source compatible for 28 years and counting [21:55] Yeah, let's compare C to Python... [21:56] And I doubt that C was as stable in the early stages of development. [21:56] Let's discuss that again when Python is 45 years old. [21:56] :D [21:56] But python is old by now. The 2 to 3 migration was a complete catastrophe. [21:57] Well yeah, many of those things (e.g. string vs. unicode distinction) should've been fixed earlier. [21:57] But they waited and accumulated all those things and then made one big backwards-incompatible release. [21:58] Which makes sense, otherwise you'd have to keep changing the code all the time. [21:58] Should these videos be uploaded to community video, or a different collection? https://www.youtube.com/user/gencat [21:58] Anyway, this is getting way too offtopic for this channel. [21:58] C was standardized in 1989, and k&r was released in 1978 - 11 years [21:58] :D [21:58] python was released in 1991 [21:58] it was not standardized by 2002 [21:59] *** JAA changes topic to: Lengthy Archive Team and archive discussions here | Offtopic: #archiveteam-ot | SketchCow: your porn tapes are getting digitized right now [21:59] *** schbirid has quit IRC (Quit: Leaving) [21:59] Or, to be fair, python2 was released in 2000, and it wasn't standardized by 2011 [22:00] -> #archiveteam-ot [22:01] I didn't know we had an offtopic channel for the offtopic channel [22:01] This isn't the offtopic channel, it was always about lengthy discussions (because #archiveteam is limited to announcements). [22:01] And -ot is new, just opened last week I think. [22:02] oh great another channel [22:02] lol [22:04] #archiveteam-ot-bs when? [22:06] *** icedice2 has joined #archiveteam-bs [22:08] hook54321: Community video sounds reasonable to me. Are you uploading each video as its own item? If so, you should probably ask info@ to create a collection of all of them in the end. [22:08] *** icedice has quit IRC (Ping timeout: 250 seconds) [22:09] Each of them there own item yeah. I'm using tubeup to do it. I'll email info@ when it's done I guess. [22:10] *** ola_norsk has quit IRC (R.I.P dear known Python :( https://youtu.be/uy9Mc_ozoP4) [22:12] is -bs not the off topic channel tho?! [22:12] * Smiley so confuse. fuck that. [22:14] When do Igloo's pipelines upload? As part of [22:14] Archiveteam: Archivebot GO Pack? [22:15] Yes [22:15] All pipelines do, except astrid's and FalconK's. [22:16] So why can't I find a certain !ao job in it? [22:16] Let's go to #archivebot. [22:19] *** pizzaiolo has joined #archiveteam-bs [22:46] *** icedice has joined #archiveteam-bs [22:48] *** icedice2 has quit IRC (Ping timeout: 245 seconds) [22:51] *** icedice2 has joined #archiveteam-bs [22:53] *** icedice has quit IRC (Ping timeout: 245 seconds) [22:55] *** icedice2 has quit IRC (Client Quit) [22:55] *** kristian_ has joined #archiveteam-bs [22:56] *** icedice has joined #archiveteam-bs [23:06] *** icedice2 has joined #archiveteam-bs [23:08] *** icedice has quit IRC (Ping timeout: 245 seconds) [23:48] *** jacketcha has joined #archiveteam-bs [23:56] Hey, does anybody know if there is a node.js implementation of the Warrior program?