[00:00] https://github.com/HarryC145/PythonBits/blob/master/al_uploader.py there we go [00:00] Grumble grumble kids these days grumble grumble when we used to want to share code we had to break out the carbon paper grumble grumble get off my lawn :p [00:01] Haha. Copy and paste is a glorious thing [00:02] Feel free to tell me how rubbish I am at Python etc [00:07] *** tomwsmf-a (~tomwsmfa@[redacted]) has joined #archiveteam-bs [00:10] *** VADemon has quit (Read error: Operation timed out) [00:28] *** SimpBrain has quit (Read error: Operation timed out) [00:45] *** RichardG (richardg86@[redacted]) has joined #archiveteam-bs [00:50] Ive just realised, Al Jazeera use AWS, so are paying through the teeth for my video downloads [01:18] S'ok [01:18] The king can take it [01:20] Are you sure their videos are hosted on AWS? [01:20] Like, Netflix uses AWS for most of their infrastructure, but they don't really host video streaming from it. [01:26] *** icedice (MangaReade@[redacted]) has joined #archiveteam-bs [01:32] Yeah, their video cdn is [01:33] grab-site users who run multi-month crawls are encouraged to upgrade for really convoluted reasons involving removing a port listener in the future [01:34] Can grab-site crawls be paused and resumed? I know wpull supports resumption and grab-site builds on it [01:35] they cannot. does wpull resumption work for you? I tried it once on a big crawl; took 10h30m to resume [01:35] Hmm... I've never tried it on anything big [01:35] maybe it's better now, I forgot if that was before or after the sqlalchemy fix [01:36] I have a really terrible pause/resume solution in https://github.com/ludios/grab-site/issues/58#issuecomment-186730028 [01:36] you have to fight a program called CRIU until it stops spitting errors [01:37] Isn't there the script that pauses it on low disk space [01:37] https://github.com/ludios/grab-site/blob/master/extra_docs/pause_resume_grab_sites.sh [01:38] I thought you were talking about being able to resume a crawl after rebooting [01:38] Ah yes. I was playing with that earlier, but I run it all on a different disk [01:39] Should be a simple change [01:39] yes [01:55] *** SimpBrain (~SimpleBra@[redacted]) has joined #archiveteam-bs [02:05] *** Start_ is now known as Start [02:09] *** icedice has quit (Quit: Leaving) [02:28] *** JesseW (~jesse@[redacted]) has joined #archiveteam-bs [02:36] *** JesseW has quit (Quit: Leaving.) [02:38] *** dashcloud has quit (Read error: Operation timed out) [02:40] *** JesseW (~jesse@[redacted]) has joined #archiveteam-bs [02:41] *** JesseW has quit (Client Quit) [02:49] *** dashcloud (~quassel@[redacted]) has joined #archiveteam-bs [03:31] *** SimpBrain has quit (Read error: Operation timed out) [03:48] *** FalconK (nobody@[redacted]) has joined #archiveteam-bs [04:01] *** dashcloud has quit (Read error: Operation timed out) [04:04] *** dashcloud (~quassel@[redacted]) has joined #archiveteam-bs [04:06] *** JesseW (~jesse@[redacted]) has joined #archiveteam-bs [04:30] opencongress.org shutting down March 1st, per here: https://twitter.com/knowladgeispwr/status/702671846609539072 [04:36] Thanks, put it in Archivebot [04:36] *I put it [04:38] K, TYVM. [04:42] *** tomwsmf-a has quit (Ping timeout: 258 seconds) [04:46] Wow, that site is *huge* [04:46] I don't know if ArchiveBot would be able to finish it in even a month, let alone less than a week [04:47] Can someone with a standalone grab-site instance also hit it? Grab-site runs like 10x faster than ArchiveBot [04:48] *** bwn has quit (Read error: Operation timed out) [04:51] Ugh, that might require more than grab-site to get in six days. I didn't realize quite how extensive it is... [05:00] *** SimpBrain (~SimpleBra@[redacted]) has joined #archiveteam-bs [05:01] *** JetBalsa has quit (Read error: Connection reset by peer) [05:03] I suggest emailing the operators of that site [05:03] they likely have a much faster way to get data [05:04] *** vitzli (~vitzli@[redacted]) has joined #archiveteam-bs [05:17] *** vitzli has quit (Quit: Leaving) [05:25] *** Sk1d has quit (Ping timeout: 250 seconds) [05:26] *** karen has quit (Quit: leaving) [05:33] I've emailed the Sunlight Foundation, the folks who run OpenCongress.org, asking them to keep the site up long enough for it to be copied into the Wayback Machine, and pointing them at the #archiveteam channel if they have questions. Hopefully I'll get a useful response. [05:35] *** Sk1d (~Sk1d@[redacted]) has joined #archiveteam-bs [05:35] I'm scraping the site with wpull at home, so far its found over 60k URLs [05:53] Now over 80k from 10k retrieved URLs [06:24] *** logan2 has quit (Read error: Connection reset by peer) [06:25] *** logan (~a@[redacted]) has joined #archiveteam-bs [06:26] *** metalcamp (~metalcamp@[redacted]) has joined #archiveteam-bs [07:44] *** JesseW is examining 3,458 IA identifiers that aren't darked, but that I didn't get data from in the census. [07:54] *** bwn (~bwn@[redacted]) has joined #archiveteam-bs [07:58] MrRadar: got a response from Clayton at the Sunlight Foundation, about OpenCongress.org ! [07:59] They are very willing to facilitate a scrape for the wayback machine. They said "We don't plan to complete the shut down until early-mid March." and asked how to request a scrape. [07:59] Could you get in contact with them and discuss things further? [08:00] *** Sk2d (~Sk1d@[redacted]) has joined #archiveteam-bs [08:01] *** metalcamp has quit (Ping timeout: 252 seconds) [08:05] *** Sk1d has quit (hub.se irc.du.se) [08:12] Interesting -- http://cryptobin.org appears to be down [08:13] https://www.riskbasedsecurity.com/2016/02/cryptobin-down-after-dhs-fbi-leaks/ [08:14] *** kvieta has quit (Read error: Operation timed out) [08:14] MrRadar: here's what I wrote back to the Sunlight Foundation person: https://paste.ubuntu.com/15195515/ [08:14] *** rduser has quit (Read error: Operation timed out) [08:15] *** mr-b has quit (Read error: Operation timed out) [08:15] *** beardicus has quit (Read error: Operation timed out) [08:16] *** botpie91 has quit (Read error: Operation timed out) [08:16] *** closure has quit (Read error: Operation timed out) [08:16] *** kvieta (~kvieta@[redacted]) has joined #archiveteam-bs [08:16] *** botpie91 (~botpie91@[redacted]) has joined #archiveteam-bs [08:16] *** remsen has quit (Read error: Operation timed out) [08:17] *** closure (~lambda@[redacted]) has joined #archiveteam-bs [08:17] *** beardicus (~beardicus@[redacted]) has joined #archiveteam-bs [08:19] *** remsen (~remsen@[redacted]) has joined #archiveteam-bs [08:19] Ah, I'd gotten cryptobin and 0bin confused... [08:19] 0bin is still doing just fine, apparently. [08:20] *** Sk2d is now known as Sk1d [08:21] *** mr-b (~mr-b@[redacted]) has joined #archiveteam-bs [08:29] *** JesseW has quit (Ping timeout: 252 seconds) [08:29] *** rduser (~rduser@[redacted]) has joined #archiveteam-bs [08:29] *** schbirid (~schbirid4@[redacted]) has joined #archiveteam-bs [08:30] *** toad1 has quit (Read error: Operation timed out) [08:39] *** toad1 (~toad@[redacted]) has joined #archiveteam-bs [08:50] *** kvieta has quit (Read error: Operation timed out) [08:50] *** mr-b has quit (Read error: Operation timed out) [08:52] *** beardicus has quit (Read error: Operation timed out) [08:52] *** botpie91 has quit (Read error: Operation timed out) [08:52] *** remsen has quit (Read error: Operation timed out) [08:53] *** closure has quit (Read error: Operation timed out) [08:54] *** botpie91 (~botpie91@[redacted]) has joined #archiveteam-bs [08:56] *** closure (~lambda@[redacted]) has joined #archiveteam-bs [08:56] *** kvieta (~kvieta@[redacted]) has joined #archiveteam-bs [08:57] *** beardicus (~beardicus@[redacted]) has joined #archiveteam-bs [08:59] *** remsen (~remsen@[redacted]) has joined #archiveteam-bs [09:00] *** mr-b (~mr-b@[redacted]) has joined #archiveteam-bs [09:02] *** godane has quit (Quit: Leaving.) [09:29] *** snape has quit (Remote host closed the connection) [09:45] *** bwn has quit (Ping timeout: 499 seconds) [10:27] *** godane (~slacker@[redacted]) has joined #archiveteam-bs [10:40] *** bwn (~bwn@[redacted]) has joined #archiveteam-bs [10:42] *** bwn_ (~bwn@[redacted]) has joined #archiveteam-bs [10:53] *** bwn has quit (Read error: Operation timed out) [10:59] *** xmc has quit (Read error: Operation timed out) [11:01] *** achip has quit (Ping timeout: 258 seconds) [11:01] *** godane has quit (Ping timeout: 258 seconds) [11:01] *** schbirid has quit (Ping timeout: 258 seconds) [11:03] *** schbirid (~schbirid4@[redacted]) has joined #archiveteam-bs [11:04] *** vtyl has quit (Read error: Connection reset by peer) [11:04] *** xmc (~chronomex@[redacted]) has joined #archiveteam-bs [11:05] *** swebb gives channel operator status to xmc [11:05] *** espes___ has quit (Remote host closed the connection) [11:06] *** godane (~slacker@[redacted]) has joined #archiveteam-bs [11:08] *** achip (~thechip@[redacted]) has joined #archiveteam-bs [11:18] *** espes__ (~espes@[redacted]) has joined #archiveteam-bs [11:19] *** lytv (~lytv@[redacted]) has joined #archiveteam-bs [13:51] *** vitzli (~vitzli@[redacted]) has joined #archiveteam-bs [13:51] and it only took 10 minutes to connect to EFNet :) SlySoft.com closed, but forum is still up [13:55] *** VADemon (~VADemon@[redacted]) has joined #archiveteam-bs [14:40] SketchCow: this is the one with a mp3 at 72+ hours: https://archive.org/details/kpfa-archives-radio-podcast-2009-09-29 [15:19] *** VADemon has quit (Read error: Connection reset by peer) [15:31] *** brayden has quit (Quit: Leaving) [15:34] *** snape (~snape@[redacted]) has joined #archiveteam-bs [16:00] *** bauruine has quit (Ping timeout: 260 seconds) [16:04] *** bauruine (~bauruine@[redacted]) has joined #archiveteam-bs [16:10] *** brayden (~brayden@[redacted]) has joined #archiveteam-bs [16:10] *** swebb gives channel operator status to brayden [16:26] *** JesseW (~jesse@[redacted]) has joined #archiveteam-bs [16:29] *** midas has quit (Quit: WeeChat 1.3) [16:30] *** midas (~midas@[redacted]) has joined #archiveteam-bs [16:43] *** JesseW has quit (Quit: Leaving.) [16:45] *** xXx_ndidd (~Nathan@[redacted]) has joined #archiveteam-bs [16:45] *** vitzli has quit (Quit: Leaving) [16:47] *** logchfoo1 starts logging #archiveteam-bs at Thu Feb 25 16:47:23 2016 [16:47] *** logchfoo1 has joined #archiveteam-bs [16:47] *** bwn_ has quit IRC (Read error: Operation timed out) [16:49] *** metalcamp has joined #archiveteam-bs [16:58] *** ndiddy has quit IRC (Read error: Operation timed out) [17:01] *** tomwsmf-a has joined #archiveteam-bs [17:24] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [17:29] *** mismatch_ has quit IRC (Ping timeout: 260 seconds) [17:36] *** mismatch_ has joined #archiveteam-bs [17:53] *** matthusby has joined #archiveteam-bs [17:54] *** SadDM has joined #archiveteam-bs [17:54] *** swebb sets mode: +o SadDM [17:56] *** jspiros has joined #archiveteam-bs [18:12] *** xmc is now known as butts [18:13] *** butts is now known as xmc [18:59] *** lytv has quit IRC (Read error: Operation timed out) [19:02] *** lytv has joined #archiveteam-bs [19:03] *** JW_work1 has quit IRC (Read error: Operation timed out) [19:20] *** JW_work has joined #archiveteam-bs [20:12] *** Protab has joined #archiveteam-bs [20:51] *** xXx_ndidd has quit IRC (Ping timeout: 252 seconds) [21:13] SIGH https://mta.openssl.org/pipermail/openssl-announce/2016-February/000063.html [21:32] *** JW_work1 has joined #archiveteam-bs [21:33] *** JW_work has quit IRC (Read error: Operation timed out) [21:37] *** Famicoma1 has joined #archiveteam-bs [21:37] *** Famicoma1 has quit IRC (Client Quit) [21:38] *** bwn has joined #archiveteam-bs [21:40] *** JW_work1 has quit IRC (Ping timeout: 362 seconds) [21:44] The last year has been a good year not running on OpenSSL based SSL. [21:46] the changelog for 1.0.2 on that release is light on the vuln fixes unless it's just not up to date yet [21:55] *** metalcamp has quit IRC (Ping timeout: 252 seconds) [21:57] 67 hours in one mp3: https://archive.org/details/kpfa-archives-radio-podcast-2009-10-09 [21:58] its cause it was another fund drive special [21:58] https://kpfa.org/archives/2009/10/9/ [22:05] *** schbirid has quit IRC (Quit: Leaving) [22:10] *** Famicoma1 has joined #archiveteam-bs [22:15] SketchCow: i'm up to 2009-10-31 with kpfa [22:17] *** ndiddy has joined #archiveteam-bs [22:25] *** Boppen has joined #archiveteam-bs [22:30] *** JW_work has joined #archiveteam-bs [22:31] *** JW_work1 has joined #archiveteam-bs [22:36] *** JW_work has quit IRC (Ping timeout: 362 seconds) [22:50] *** Chorca1 has quit IRC (Read error: Operation timed out) [22:51] this feels weird to say, but gitlab releases too fast for me :? [22:51] *** tomwsmf-a has joined #archiveteam-bs [22:52] like I installed 8.5 not too long ago and now they have a patch release [22:52] I mean I really like the fact that they're on the ball [22:52] it's just like "wat" [22:52] er 8.4 that is [22:54] *** Famicoma1 has quit IRC (Quit: leaving) [22:54] *** Famicoma1 has joined #archiveteam-bs [22:54] *** Chorca has joined #archiveteam-bs [23:08] *** xXx_ndidd has joined #archiveteam-bs [23:10] swebb: just an FYI - Al Jazeera seem to be throttling me, I cant get more than 4 videos at once, and then they are also limiting me to a max of 60Mbps outbound [23:10] what monsters [23:11] there is a lot of content to get [23:11] Sketchcow, did you see the stuff this morning about OpenCongress? Goes away in a week, folks who run it were contacted and I guess they're quite amenable to enduring a deep crawl for posterity. Looks like it might be 200-300k URLs, all told. Not sure who's got what planned, if anything, though... [23:12] oh yeah for those of you where panicking last week, ia brought 2.5P of new disk online https://archive.org/~tracey/mrtg/df.html [23:13] right on cue [23:16] Well, have at [23:20] *** ndiddy has quit IRC (Read error: Operation timed out) [23:23] snape, SketchCow: yeah, I suggested they run a crawler locally, then upload it to IA, and if so, to contact SketchCow about getting it into the Wayback Machine. Otherwise, I warned them about archivebot working on it, but being slow. [23:36] SketchCow: i'm uploading some higher res copies of EGM [23:37] for example issue 025 is only 60mb in your copy [23:37] but i have one thats close 230mb