[00:10] *** BlueMaxim has joined #archiveteam [00:12] *** tomwsmf-a has joined #archiveteam [00:18] *** anjacks0n has joined #archiveteam [00:21] *** anjacks0n has quit IRC (Ping timeout: 190 seconds) [00:29] *** j08nY has quit IRC (Ping timeout: 633 seconds) [00:31] *** JesseW has joined #archiveteam [00:56] *** _ris has quit IRC () [00:57] *** DoomTay has joined #archiveteam [00:57] Sorry about that. What'd I miss? [00:58] *** kken has joined #archiveteam [00:58] *** kken has quit IRC (Client Quit) [01:01] *** Pudsey has joined #archiveteam [01:03] *** JesseW has quit IRC (Ping timeout: 370 seconds) [01:03] Anyone used the wayback API? I get one result via the website but none via the API. I've tried encoding the URL but still no [01:04] What's the URL? [01:11] https://blip.tv/file/get/ActorAJ-DeerLakesEnding728.wmv one result from 2014 but if even if you pass the timestamp in the API request it still doesn't know about it [01:13] ....yeah... [01:13] http://web.archive.org/web/*/https://blip.tv/file/get/ActorAJ-DeerLakesEnding728.wmv [01:15] Well that's odd I just got one result a few minutes ago. [01:16] The weird thing is accessing the robots.txt file itself yields a 523 [01:27] *** kristian_ has quit IRC (Leaving) [01:31] *** Stiletto has quit IRC (Ping timeout: 246 seconds) [01:40] *** Pudsey has quit IRC (Remote host closed the connection) [01:57] *** VADemon has quit IRC (Quit: left4dead) [01:58] *** redlob has quit IRC (Read error: Operation timed out) [02:00] *** schbirid has quit IRC (Read error: Connection refused) [02:02] *** Stiletto has joined #archiveteam [02:10] I am getting hammered on coursera [02:10] arkiver: we need all gawker and all related properties like io9 [02:10] *** redlob has joined #archiveteam [02:11] *** BlueMaxim has quit IRC (Read error: Operation timed out) [02:13] Maybe assemble a warrior project for the gawker business? [02:13] *** schbirid has joined #archiveteam [02:33] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [02:40] *** BartoCH has joined #archiveteam [02:46] i'm still doing the sitemap grabs [03:05] i'm starting to upload my collection of LDS Magazines i found [03:06] they have pdfs going back to 2001 [03:08] *** Aranje has quit IRC (Quit: Three sheets to the wind) [03:12] *** Emcy has quit IRC (Read error: Operation timed out) [03:32] Arto project is about to run out jobs but there are 25,000 out. Does someone start expiring older 'out' jobs at this point? [03:45] What do you mean "run out of jobs"? [03:50] *** bugcat has joined #archiveteam [03:50] *** bugcat has quit IRC (Client Quit) [03:58] Whopper: yes [04:21] cool [04:22] DoomTay: in the to-do part there are only < 200 jobs but in the out part there are 25,588. so there could be lots of clients sitting around doing nothing while valid jobs are assigned to inactive clients [04:25] *** Coderjoe has quit IRC (Read error: Operation timed out) [04:34] *** Coderjoe has joined #archiveteam [04:48] *** devrandom has joined #archiveteam [04:49] Is anyone working on downloading Coursera? They are going to delete 472 courses on June 30. [04:49] http://makemeflow.org/advice/2016/06/how-to-download-courseras-courses-before-theyre-gone-forever/ [04:53] *** devrandom has quit IRC (Client Quit) [04:54] .... [04:54] Did he just pop in, drop a message, then jeave? [04:57] *** DoomTay has quit IRC (Quit: Page closed) [05:08] *** DoomTay has joined #archiveteam [05:10] *** redlob has quit IRC (Read error: Operation timed out) [05:11] *** redlob has joined #archiveteam [05:20] *** BlueMaxim has joined #archiveteam [05:23] *** anjacks0n has joined #archiveteam [05:25] http://www.openculture.com/2016/06/a-handy-guide-on-how-to-download-old-coursera-courses-before-they-disappear.html [05:25] well, I see someone else has mentioned it already [05:26] *** anjacks0n has quit IRC (Ping timeout: 190 seconds) [05:26] DoomTay: well, he did wait 5 minutes between joining and leaving. I guess he expects that channels with 199 people will have more activity. [05:27] That's pretty much how I felt when I first joined [05:39] *** JesseW has joined #archiveteam [06:07] *** DoomTay has quit IRC (Quit: Page closed) [06:36] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:44] we know [06:44] you're impatient and not very understanding [06:45] *** anjacks0n has joined #archiveteam [06:48] *** anjacks0n has quit IRC (Ping timeout: 190 seconds) [07:05] *** anjacks0n has joined #archiveteam [07:08] *** Honno_ has joined #archiveteam [07:33] *** anjacks0n has quit IRC (anjacks0n) [07:54] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [08:34] *** SDragon has joined #archiveteam [08:34] hello #archiveteam, 2 queries: [08:36] warning: query has timed out [08:36] 1, I have an internal corp wiki, that is essentially the better half of my brain, started writing it in 2004, expanded considerably in 2011, contains ~1.2MB of plaintext, and ~15K URLs. I have been browsing ancient pages of these, and to my dismay, ~40% of the links are nowhere to be found; 50% of which isn't even on archive.org . This is very not good. What tool would you recommend, to [08:36] which I can plug a list of URLs, and it will archive all the pages as comprehensibly as possible? [08:37] have tried so far: curl, httrack, wkhtmltopdf [08:37] curl won't pull all resources; httrack does, but configuring it to limit to one specific page is tricky; wkhtmltopdf crashes on some pages with ads [08:39] *** tomwsmf-a has joined #archiveteam [08:44] https://github.com/ludios/grab-site [08:44] should be perfect [08:46] dxrt, okay. How can I open a page saved in WARC format? [08:47] You can use something like webarchiveplayer [08:48] and you can also extract the contents of WARCs using various tools [08:48] http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem [08:49] okay, yeah, I now know what my weekend will look like [08:50] #2: [08:51] There exists ~20 groups, and ~15 persons of high interest (to me), who are publishing original content to facebook. That is, content not accessible anywhere else on the Internet. This is very, very not good, for all the obvious reasons archiveteam exists (and I contribute to). What tool would you recommend, to which I can plug a list of Facebook pages / FBIDs (and prolly my own auth token, [08:51] as they're "shared to friends only"), and it will archive all of the posts belonging to the group / person? [08:53] have tried so far: fb's "export" tool (only pulls my own), archiving via tools above (except WARC; none of them can do paging) [08:53] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [08:56] grab-site supports phantom-js. Not sure if that’s enough to grab Facebook though. [09:00] Otherwise use a WARC recording HTTP proxy and scroll through the site manually. [09:01] that..... might take a while [09:01] also, it's not automagic. Which means it will have a high false-negative rate ( stuff needs to be pulled, but won't, because it's manual) [09:26] *** Emcy has joined #archiveteam [09:30] SketchCow: yeah, we're being spammed about coursera [09:30] I plan on setting that up this weekend so the grab can be done next week [09:31] we need a bot that picks up any mention of "Coursera" and just goes "We're on it. Quit bugging us" [09:32] Okay, I'll set that up [09:32] xD [09:33] YES! that was my #3 as well [09:34] .torrent balls of coursera, or even better: pirate edx instances of their courses would get my infomorph salivating [09:35] We'll get it as WARCs. After that they'll be uploaded as items to IA too [09:43] *** PurpleSym changes topic to: Archive Team: We're not archive.org | http://archiveteam.org/ | Coursera grab starting soon | lengthy/off-topic in #archiveteam-bs [09:43] HCross: ^ [09:44] :) [09:48] *** coursebug has joined #archiveteam [09:48] coursera [09:48] arkiver: We're working it. [09:50] HCross ^ [09:51] haha. awesome :) [09:55] *** _ris has joined #archiveteam [10:42] *** _ris is now known as ris [10:43] * ris ponders the practicality of backing up mapillary's content http://www.mapillary.com (it is CC licensed) [10:43] it would be *huge* though [10:50] iirc they would be supportive [10:55] it would just be gigs & gigs & gigs & gigs [11:00] *** Honno__ has joined #archiveteam [11:00] (i guess it depends how "supportive" they are - if they're supportive enough would they just mail an hd to archive.org?) [11:03] *** Honno_ has quit IRC (Ping timeout: 492 seconds) [11:06] *** metalcamp has joined #archiveteam [11:22] *** xhdr has quit IRC (Quit: ZNC - http://znc.in) [11:25] *** xhdr has joined #archiveteam [11:25] *** xhdr has quit IRC (Excess Flood) [11:28] *** xhdr has joined #archiveteam [11:28] *** xhdr has quit IRC (Excess Flood) [11:30] *** xhdr has joined #archiveteam [11:30] *** xhdr has quit IRC (Excess Flood) [11:33] *** xhdr has joined #archiveteam [11:33] *** xhdr has quit IRC (Excess Flood) [11:36] *** BlueMaxim has quit IRC (Quit: Leaving) [11:38] *** xhdr has joined #archiveteam [11:38] *** xhdr has quit IRC (Excess Flood) [11:42] *** xhdr has joined #archiveteam [11:42] *** xhdr has quit IRC (Excess Flood) [11:47] *** xhdr has joined #archiveteam [11:47] *** xhdr has quit IRC (Excess Flood) [11:51] *** xhdr has joined #archiveteam [11:51] *** xhdr has quit IRC (Excess Flood) [11:56] *** xhdr has joined #archiveteam [11:56] *** xhdr has quit IRC (Excess Flood) [12:00] *** xhdr has joined #archiveteam [12:00] *** xhdr has quit IRC (Excess Flood) [12:05] *** xhdr has joined #archiveteam [12:05] *** xhdr has quit IRC (Excess Flood) [12:07] *** j08nY has joined #archiveteam [12:09] *** xhdr has joined #archiveteam [12:09] *** xhdr has quit IRC (Excess Flood) [12:14] *** xhdr has joined #archiveteam [12:14] *** xhdr has quit IRC (Excess Flood) [12:15] *** PurpleSym sets mode: +b *!~xhdr@static.182.114.9.176.clients.your-server.de [12:17] *** j08nY has quit IRC (Quit: Leaving) [12:35] *** j08nY has joined #archiveteam [12:51] *** metalcamp has quit IRC (Read error: Connection reset by peer) [13:03] *** dashcloud has quit IRC (Read error: Operation timed out) [13:06] *** dashcloud has joined #archiveteam [13:08] *** Atom-- has joined #archiveteam [13:13] *** Atom__ has quit IRC (Read error: Operation timed out) [13:30] *** coursebug has quit IRC (Remote host closed the connection) [13:52] *** ris has quit IRC (Read error: Operation timed out) [14:35] I have dragged my feet long enough, time to complete Gamefront. I sent you a mail with the proposed config changes for processing arkiver. [14:49] *** WinterFox has quit IRC (Remote host closed the connection) [14:49] *** DigDug has joined #archiveteam [14:53] gj zino [15:08] does archiveteam maintains a repo of their own, or is everything going to archive.org, and private repos only? [15:09] containing what? [15:19] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [15:21] *** RichardG has joined #archiveteam [15:38] *** ris has joined #archiveteam [15:56] *** Legionof7 has joined #archiveteam [15:57] I hope feelthebern.org gets archived. [15:59] Legionof7: it does now [15:59] :p [15:59] Wait wait [15:59] Joepie [15:59] I think I know you from somewhere [15:59] very possible [16:00] Have you ever gone onto anonops? [16:00] or #reddit ? [16:00] Legionof7: yes, anonops, but let's move to #archiveteam-bs for off-topic discussion :P [16:26] *** Coderjoe has quit IRC (Read error: Connection reset by peer) [16:26] *** Coderjoe has joined #archiveteam [16:27] *** Legionof7 has quit IRC (Quit: Page closed) [16:43] *** kristian_ has joined #archiveteam [16:44] *** Froggypwn has quit IRC (Quit: ~ Trillian Astra - www.trillian.im ~) [16:51] *** mr-b has quit IRC (Ping timeout: 246 seconds) [16:58] *** mr-b has joined #archiveteam [17:11] *** VADemon has joined #archiveteam [17:14] *** JesseW has joined #archiveteam [17:27] *** anjacks0n has joined #archiveteam [17:28] *** anjacks0n has quit IRC (Client Quit) [17:29] *** ris has quit IRC () [17:29] *** DoomTay has joined #archiveteam [17:45] *** JesseW has quit IRC (Ping timeout: 370 seconds) [17:47] *** dashcloud has quit IRC (Ping timeout: 244 seconds) [17:47] *** dashcloud has joined #archiveteam [18:23] *** Honno__ has quit IRC (Ping timeout: 492 seconds) [18:30] *** ris has joined #archiveteam [18:40] *** JesseW has joined #archiveteam [19:00] SDragon: By "a repo" do you mean a copy of the content we grab, or a version-control repository of programs we use to grab it? We do have multiple version-control repos with various programs in them, but no, we don't have a complete 3rd copy of the content we grab -- it's just at archive.org (and random pieces in various other places). [19:08] ris: regarding mapillary -- please join #archiveteam-bs to discuss this further [19:18] *** MMovie has joined #archiveteam [19:19] *** anjacks0n has joined #archiveteam [19:20] *** Aranje has joined #archiveteam [19:21] *** kris33 has joined #archiveteam [19:22] *** bwn has quit IRC (Ping timeout: 244 seconds) [19:23] I know you're aware of the issue with Coursera right now, but did you know someone made a script to help with downloading? [19:23] https://github.com/Chillee/coursera-dl-all [19:25] DoomTay: yes, yes are we. :-) [19:26] DoomTay: btw, this channel *is* logged (and the logs are searchable) -- you can check this sort of thing. :-) [19:27] hm, it does look like *that* particular script hadn't been mentioned before, though [19:28] hmm [19:28] oh, bot was down [19:29] Who wants to run the Coursera 'We're on it' bot? [19:29] Maybe make a project page on it? [19:29] please do [19:29] go ahead [19:29] *** coursebug has joined #archiveteam [19:30] *** kris33 has quit IRC (Textual IRC Client: www.textualapp.com) [19:30] *** anjacks0n has quit IRC (anjacks0n) [19:32] *** bwn has joined #archiveteam [19:40] *** tomwsmf-a has joined #archiveteam [19:48] Well, I got a basic page going [19:48] I think I better log off for a bit. It's getting stormy here [19:48] *** DoomTay has quit IRC (Quit: Page closed) [19:54] *** kristian_ has quit IRC (Leaving) [19:55] *** DoomTay has joined #archiveteam [19:57] *** PurpleSym sets mode: -b *!~xhdr@static.182.114.9.176.clients.your-server.de [19:57] *** xhdr has joined #archiveteam [20:03] *** nwf has joined #archiveteam [20:06] *** metalcamp has joined #archiveteam [20:13] *** zxtx has joined #archiveteam [20:36] *** JesseW has quit IRC (Read error: Operation timed out) [20:47] *** JesseW has joined #archiveteam [20:47] *** DoomTay has quit IRC (Quit: Page closed) [20:56] *** DoomTay has joined #archiveteam [21:08] *** JesseW has quit IRC (Ping timeout: 370 seconds) [21:24] *** Rye has quit IRC (Quit: ZNC - http://znc.in) [21:25] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [21:27] *** Rye has joined #archiveteam [22:04] *** tomwsmf-a has quit IRC (Read error: Connection reset by peer) [22:11] *** DoomTay has quit IRC (Quit: Page closed) [22:18] *** DoomTay has joined #archiveteam [22:21] *** tomwsmf-a has joined #archiveteam [22:24] *** dashcloud has quit IRC (Remote host closed the connection) [22:26] *** Pudsey has joined #archiveteam [22:28] *** Pudsey has quit IRC (Remote host closed the connection) [22:34] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [22:34] *** dashcloud has joined #archiveteam [22:41] *** BartoCH has joined #archiveteam [22:48] *** ohhdemgir has joined #archiveteam [23:17] *** mutoso_ has joined #archiveteam [23:19] *** mutoso has quit IRC (Read error: Operation timed out) [23:41] *** ariscop has quit IRC (Quit: Leaving) [23:50] *** ohhdemgir has quit IRC (Read error: Operation timed out) [23:58] *** BlueMaxim has joined #archiveteam [23:59] *** ohhdemgir has joined #archiveteam