[00:06] *** logchfoo2 starts logging #archiveteam at Thu Mar 03 00:06:50 2016 [00:06] *** logchfoo2 has joined #archiveteam [00:07] *** Simpbrai1 has joined #archiveteam [00:07] *** antomatic has joined #archiveteam [00:07] *** swebb sets mode: +o antomatic [00:07] *** vtyl has joined #archiveteam [00:07] *** lbft_ has joined #archiveteam [00:09] *** ersi has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** lytv has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** tomwsmf-a has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** winr4r has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** SketchCow has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** Simpbrai_ has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** bai has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** db48x has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** Morbus has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** antomati_ has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** logchfoo1 has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** lbft has quit IRC (hub.efnet.us irc.Prison.NET) [00:09] *** achip has quit IRC (hub.efnet.us irc.Prison.NET) [00:10] *** rossdylan has joined #archiveteam [00:11] *** bai_ has joined #archiveteam [00:13] *** winr5r has joined #archiveteam [00:15] *** Swizzle__ has quit IRC (Read error: Operation timed out) [00:18] *** SirCmpwn has quit IRC (Ping timeout: 260 seconds) [00:21] *** MMovie has quit IRC (Read error: Operation timed out) [00:22] *** Mayonaise has joined #archiveteam [00:22] *** MMovie has joined #archiveteam [00:22] *** mr-b has joined #archiveteam [00:27] *** SirCmpwn has joined #archiveteam [00:37] *** MMovie has quit IRC (Read error: Operation timed out) [00:39] *** MMovie has joined #archiveteam [00:54] *** MMovie has quit IRC (Read error: Operation timed out) [00:55] *** MMovie has joined #archiveteam [01:10] *** MMovie has quit IRC (Read error: Operation timed out) [01:12] *** MMovie has joined #archiveteam [01:26] *** MMovie has quit IRC (Read error: Operation timed out) [01:27] *** MMovie has joined #archiveteam [01:42] *** MMovie has quit IRC (Read error: Operation timed out) [01:42] *** MMovie has joined #archiveteam [01:51] *** bwn_ has joined #archiveteam [02:04] *** bwn__ has quit IRC (Read error: Operation timed out) [02:05] *** philpem has quit IRC (Ping timeout: 260 seconds) [02:09] *** JesseW has joined #archiveteam [02:13] *** HCross has quit IRC (Ping timeout: 362 seconds) [02:14] *** hawc145 has joined #archiveteam [02:14] *** wp494 has joined #archiveteam [02:15] *** wp494_ has quit IRC (Read error: Operation timed out) [02:20] *** concerned has joined #archiveteam [02:20] *** concerned has quit IRC (Client Quit) [02:21] *** bsmith093 has quit IRC (Ping timeout: 190 seconds) [02:23] *** kcaj has quit IRC (Read error: Operation timed out) [02:24] *** kcaj has joined #archiveteam [02:54] *** MMovie has quit IRC (Read error: Operation timed out) [02:56] *** MMovie has joined #archiveteam [03:02] *** achip has joined #archiveteam [03:10] *** MMovie has quit IRC (Read error: Operation timed out) [03:11] *** MMovie has joined #archiveteam [03:13] *** ersi has joined #archiveteam [03:13] *** swebb sets mode: +o ersi [03:17] *** bsmith093 has joined #archiveteam [03:32] *** MMovie has quit IRC (Read error: Operation timed out) [03:32] *** ndiddy has joined #archiveteam [03:33] *** MMovie has joined #archiveteam [03:42] *** achip has quit IRC () [03:43] *** davidar has joined #archiveteam [03:44] arkiver: JesseW suggested I talk to you about an open-access pdf archiving project [03:52] yes, I did. :-) welcome! [03:54] davidar: BTW, this channel is logged in public here (if you don't get a response while you are in the channel): http://archive.fart.website/bin/irclogger_log/archiveteam [03:57] JesseW: thanks [04:02] basically, the problem is that we have a long list (~20M) of links to pdfs scattered over the web, and want to get them archived (ideally before some of them start disappearing :) [04:08] *** achip has joined #archiveteam [04:13] davidar, do you want them all in one big list or viewable from web.archive.org/original.url.com? [04:15] Fletcher: a big list so that they can also be loaded into http://citeseerx.ist.psu.edu [04:16] but yes, with some kind of mapping from original urls to the files [04:22] *** MMovie has quit IRC (Read error: Operation timed out) [04:22] *** MMovie has joined #archiveteam [04:23] That shoudn't be too difficult [04:27] *** SmileyG has joined #archiveteam [04:28] davidar, do you have an approximate average size for the pdfs? [04:28] (getting an idea of time/resources required) [04:32] Fletcher: I'm not exactly sure, but I think ~1MB would be a reasonable guess [04:33] yeah 20-30TB is fairly doable [04:33] the pdfs are mostly OCR'd text? [04:33] or scanned text? [04:33] or images? [04:33] or a mix? [04:34] (I'll let arkiver go through the technical details with you, these were just questions that would need answers at some point) [04:34] they should be textual pdfs (plus images) for the most part I would say [04:34] so like prepared documents that were printed to or saved as PDFs? [04:35] yeah [04:35] yeah distilled text is pretty small, the images would ramp up the file size averages though [04:36] yeah, I imagine most would be less than 1MB, but occasionally you get some unusually huge ones [04:37] *** Smiley has quit IRC (Read error: Operation timed out) [04:37] I can try to get some more specific stats if that would help [04:38] If you've got the resources to spare it may be a good idea, a sample of ~1000 I guess [04:39] Also, do you have a point of contact you could leave with us? (you can private message if you don't want it to appear in the logs etc) [04:42] my email is on https://github.com/davidar, I also tend to idle on freenode most of the time (same nick), or you can get JesseW to nudge me :) [04:43] perfect [04:43] *** MMovie has quit IRC (Read error: Operation timed out) [04:43] :-) [04:45] *** MMovie has joined #archiveteam [04:52] *** espes__ has quit IRC (Read error: Operation timed out) [04:52] *** Smiley has joined #archiveteam [04:52] *** bwn_ has quit IRC (Read error: Operation timed out) [04:53] *** SmileyG has quit IRC (Read error: Operation timed out) [04:53] *** wacky has quit IRC (Read error: Operation timed out) [04:53] *** DFJustin has quit IRC (Read error: Operation timed out) [04:54] *** bsmith094 has quit IRC (Read error: Operation timed out) [04:54] *** mutoso has quit IRC (Read error: Operation timed out) [04:54] *** Kenshin has quit IRC (Read error: Operation timed out) [04:55] *** Kenshin has joined #archiveteam [04:56] *** bsmith094 has joined #archiveteam [04:57] *** w0rp has quit IRC (Read error: Operation timed out) [04:57] *** [phire] has quit IRC (Read error: Operation timed out) [04:57] *** brayden_ has quit IRC (Read error: Operation timed out) [04:58] *** w0rp has joined #archiveteam [04:58] *** SilSte has quit IRC (Read error: Operation timed out) [04:59] *** ivan` has quit IRC (Ping timeout: 633 seconds) [04:59] *** RedType has quit IRC (Read error: Connection reset by peer) [05:00] *** lysobit has quit IRC (Read error: Operation timed out) [05:00] *** lukeman_ has quit IRC (Ping timeout: 633 seconds) [05:00] *** lukeman has joined #archiveteam [05:01] *** Fusl has quit IRC (Ping timeout: 633 seconds) [05:01] *** mutoso has joined #archiveteam [05:01] *** Apathy has quit IRC (Read error: Operation timed out) [05:02] *** Apathy has joined #archiveteam [05:02] *** Fusl has joined #archiveteam [05:03] *** cadbury has quit IRC (Ping timeout: 633 seconds) [05:03] *** xXx_ndidd has joined #archiveteam [05:03] *** lysobit has joined #archiveteam [05:04] *** marvinw has joined #archiveteam [05:04] *** espes__ has joined #archiveteam [05:04] *** Atom has quit IRC (Ping timeout: 633 seconds) [05:06] *** MMovie has quit IRC (Read error: Operation timed out) [05:07] *** RedType has joined #archiveteam [05:07] *** DFJustin has joined #archiveteam [05:07] *** swebb sets mode: +o DFJustin [05:07] *** MMovie has joined #archiveteam [05:07] *** Famicoman has quit IRC (Read error: Connection reset by peer) [05:08] *** xmc has quit IRC (Ping timeout: 633 seconds) [05:09] *** chfoo has quit IRC (Ping timeout: 633 seconds) [05:09] *** is-_ has joined #archiveteam [05:09] *** is- has quit IRC (Read error: Connection reset by peer) [05:10] *** [phire] has joined #archiveteam [05:10] *** SilSte has joined #archiveteam [05:10] *** oli has quit IRC (Read error: Connection reset by peer) [05:10] *** chfoo has joined #archiveteam [05:11] *** xmc has joined #archiveteam [05:11] *** swebb sets mode: +o xmc [05:13] *** oli has joined #archiveteam [05:16] *** ndiddy has quit IRC (Read error: Operation timed out) [05:18] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [05:19] *** cadbury has joined #archiveteam [05:25] *** Sk1d has joined #archiveteam [05:25] *** SketchCow has joined #archiveteam [05:25] *** swebb sets mode: +o SketchCow [05:29] *** cadbury_ has joined #archiveteam [05:29] *** cadbury has quit IRC (Read error: Connection reset by peer) [05:37] *** cadbury_ has quit IRC (Read error: Connection reset by peer) [05:37] *** cadbury has joined #archiveteam [05:40] *** MMovie has quit IRC (Read error: Operation timed out) [05:42] *** MMovie has joined #archiveteam [05:48] *** cadbury has quit IRC (Read error: Operation timed out) [05:55] *** bai_ is now known as bai [05:57] *** ndizzle has joined #archiveteam [05:59] *** cadbury has joined #archiveteam [06:09] *** xXx_ndidd has quit IRC (Read error: Operation timed out) [06:11] *** MMovie has quit IRC (Read error: Operation timed out) [06:12] *** MMovie has joined #archiveteam [06:18] *** cadbury has quit IRC (Read error: Connection reset by peer) [06:22] *** cadbury has joined #archiveteam [06:28] *** MMovie has quit IRC (Read error: Operation timed out) [06:31] *** MMovie has joined #archiveteam [06:38] *** WinterFox has joined #archiveteam [06:48] *** MMovie has quit IRC (Read error: Operation timed out) [06:50] *** MMovie has joined #archiveteam [06:52] *** Famicoman has joined #archiveteam [06:52] *** vitzli has joined #archiveteam [07:05] *** phuzion has quit IRC (Read error: Operation timed out) [07:11] *** MMovie has quit IRC (Read error: Operation timed out) [07:13] *** MMovie has joined #archiveteam [07:14] *** brayden has joined #archiveteam [07:14] *** swebb sets mode: +o brayden [07:27] *** robink has quit IRC (Ping timeout: 260 seconds) [07:28] *** DopefishJ has joined #archiveteam [07:28] *** swebb sets mode: +o DopefishJ [07:29] *** DFJustin has quit IRC (Ping timeout: 260 seconds) [07:38] *** robink has joined #archiveteam [07:41] *** megaminxw has joined #archiveteam [07:44] *** megaminxw has left [07:44] *** is-_ is now known as is- [07:50] *** JesseW has quit IRC (Quit: Leaving.) [07:57] *** MMovie has quit IRC (Read error: Operation timed out) [07:59] *** MMovie has joined #archiveteam [08:13] *** logchfoo4 starts logging #archiveteam at Thu Mar 03 08:13:44 2016 [08:13] *** logchfoo4 has joined #archiveteam [08:16] *** Atom has joined #archiveteam [08:18] *** SirCmpwn has joined #archiveteam [08:24] *** MMovie has quit IRC (Read error: Operation timed out) [08:24] *** MMovie has joined #archiveteam [08:33] *** schbirid has joined #archiveteam [08:53] *** MMovie has quit IRC (Read error: Operation timed out) [08:54] *** MMovie has joined #archiveteam [09:10] *** MMovie has quit IRC (Read error: Operation timed out) [09:11] *** MMovie has joined #archiveteam [09:12] *** atomotic has joined #archiveteam [09:15] *** jut has joined #archiveteam [09:26] *** hawc145 is now known as HCross [09:30] *** Swizzle__ has joined #archiveteam [09:34] *** RichardG has quit IRC (Read error: Connection reset by peer) [09:46] *** Swizzle__ has quit IRC (Read error: Operation timed out) [10:07] *** MMovie has quit IRC (Read error: Operation timed out) [10:08] *** MMovie has joined #archiveteam [10:11] Hi davidar [10:11] Would you like to have the PDF files as normal PDF files or as WARC (WebARChive) files? [10:11] hi arkiver [10:12] um, not sure - just normal pdfs probably [10:12] Ok [10:13] I can create a warrior project to grab all of the PDFs and sent them to a rsync target [10:13] Can you please send the me the list URLs here or privately? [10:14] arkiver: that would be very cool, thanks. I don't have the list with me at the moment, but I'll send it through to you asap [10:15] Thank you! [10:24] *** MMovie has quit IRC (Read error: Operation timed out) [10:25] *** MMovie has joined #archiveteam [10:29] *** bwn has quit IRC (Read error: Operation timed out) [10:41] *** bwn has joined #archiveteam [10:56] *** RichardG has joined #archiveteam [11:16] *** MMovie has quit IRC (Read error: Operation timed out) [11:17] *** Swizzle__ has joined #archiveteam [11:17] *** MMovie has joined #archiveteam [11:26] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [11:29] *** brayden has quit IRC (Ping timeout: 633 seconds) [11:40] *** brayden has joined #archiveteam [11:40] *** swebb sets mode: +o brayden [11:56] *** Swizzle__ has quit IRC (Read error: Operation timed out) [11:58] *** xXx_ndidd has joined #archiveteam [11:58] *** MMovie has quit IRC (Read error: Operation timed out) [11:59] *** MMovie has joined #archiveteam [12:11] *** ndizzle has quit IRC (Read error: Operation timed out) [12:16] *** MMovie has quit IRC (Read error: Operation timed out) [12:17] *** MMovie has joined #archiveteam [12:25] *** WinterFox has quit IRC (Remote host closed the connection) [12:34] *** MMovie has quit IRC (Read error: Operation timed out) [12:34] *** MMovie has joined #archiveteam [12:35] *** atomotic has joined #archiveteam [12:50] *** VADemon has joined #archiveteam [12:51] *** MMovie has quit IRC (Read error: Operation timed out) [12:51] *** MMovie has joined #archiveteam [13:00] *** atomotic has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [13:01] *** atomotic has joined #archiveteam [13:02] *** phuzion has joined #archiveteam [13:17] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [13:17] *** Zei-Pii has joined #archiveteam [13:17] *** MMovie has quit IRC (Read error: Operation timed out) [13:19] *** MMovie has joined #archiveteam [13:33] *** dashcloud has quit IRC (Read error: Operation timed out) [13:40] *** dashcloud has joined #archiveteam [13:43] *** dashcloud has quit IRC (Read error: Operation timed out) [13:43] *** Zei-Pii has quit IRC (Read error: Operation timed out) [13:44] *** Zei-Pii has joined #archiveteam [13:50] *** dashcloud has joined #archiveteam [13:53] *** MMovie has quit IRC (Read error: Operation timed out) [13:54] *** atomotic has joined #archiveteam [13:55] *** MMovie has joined #archiveteam [14:08] ohhh snap, fotolog is past it's expected possible unavail date >.< [14:29] *** MMovie has quit IRC (Read error: Operation timed out) [14:31] *** MMovie has joined #archiveteam [14:48] *** MMovie has quit IRC (Read error: Operation timed out) [14:49] *** MMovie has joined #archiveteam [14:59] *** Start has quit IRC (Quit: Disconnected.) [15:09] *** VADemon has quit IRC (Quit: left4dead) [15:18] *** MMovie has quit IRC (Read error: Operation timed out) [15:20] *** MMovie has joined #archiveteam [15:30] *** Start has joined #archiveteam [15:33] *** MMovie has quit IRC (Read error: Operation timed out) [15:35] *** MMovie has joined #archiveteam [15:36] *** RichardG has quit IRC (Read error: Connection reset by peer) [15:40] *** Start has quit IRC (Quit: Disconnected.) [15:42] *** RichardG has joined #archiveteam [15:42] *** Start has joined #archiveteam [15:53] *** Jonimus has joined #archiveteam [15:53] *** swebb sets mode: +o Jonimus [15:55] *** Froggypwn has quit IRC (Ping timeout: 258 seconds) [15:58] *** Froggypwn has joined #archiveteam [16:01] *** vitzli has quit IRC (Quit: Leaving) [16:09] *** MMovie has quit IRC (Read error: Operation timed out) [16:10] *** vitzli has joined #archiveteam [16:11] *** MMovie has joined #archiveteam [16:23] *** vitzli has quit IRC (Quit: Leaving) [16:26] *** JesseW has joined #archiveteam [16:30] *** ohhdemgir has quit IRC (Read error: Operation timed out) [16:38] *** JesseW has quit IRC (Quit: Leaving.) [16:41] *** MMovie has quit IRC (Read error: Operation timed out) [16:43] *** MMovie has joined #archiveteam [17:00] *** MMovie has quit IRC (Read error: Operation timed out) [17:01] *** MMovie has joined #archiveteam [17:07] *** Start has quit IRC (Quit: Disconnected.) [17:11] *** Swizzle__ has joined #archiveteam [17:31] *** MMovie has quit IRC (Read error: Operation timed out) [17:34] *** MMovie has joined #archiveteam [17:40] *** hive-mind has quit IRC (Remote host closed the connection) [17:42] *** hive-mind has joined #archiveteam [17:50] *** philpem has joined #archiveteam [18:01] *** MMovie has quit IRC (Read error: Operation timed out) [18:10] *** MMovie has joined #archiveteam [18:28] *** MMovie has quit IRC (Read error: Operation timed out) [18:30] *** MMovie has joined #archiveteam [18:31] 200GB in 1 day [18:31] neat [18:40] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [18:45] *** jut has quit IRC (Ping timeout: 250 seconds) [18:49] *** MMovie has quit IRC (Read error: Operation timed out) [18:50] *** MMovie has joined #archiveteam [19:01] *** BubuAnabe has joined #archiveteam [19:06] *** MMovie has quit IRC (Read error: Operation timed out) [19:08] *** MMovie has joined #archiveteam [19:22] *** Start has joined #archiveteam [19:27] *** Start has quit IRC (Client Quit) [19:33] *** bwn has quit IRC (Read error: Operation timed out) [19:38] *** MMovie has quit IRC (Read error: Operation timed out) [19:40] *** MMovie has joined #archiveteam [19:44] *** BnA-Robin has quit IRC (Ping timeout: 362 seconds) [19:44] *** BnA-Robin has joined #archiveteam [19:49] *** Ghost_of_ has joined #archiveteam [19:50] *** bwn has joined #archiveteam [19:55] *** BnA-Robin has quit IRC (Ping timeout: 633 seconds) [20:10] *** MMovie has quit IRC (Read error: Operation timed out) [20:11] *** MMovie has joined #archiveteam [20:29] *** MMovie has quit IRC (Read error: Operation timed out) [20:29] *** Start has joined #archiveteam [20:30] *** MMovie has joined #archiveteam [20:35] *** Zei-Pii has quit IRC (Read error: Connection reset by peer) [20:45] *** Start has quit IRC (Quit: Disconnected.) [21:05] *** MMovie has quit IRC (Read error: Operation timed out) [21:06] *** MMovie has joined #archiveteam [21:20] *** Ghost_of_ has quit IRC (Quit: Leaving) [21:26] Does someone have a small rsync target availabel for some testing of the fotolog project? [21:27] We think the rsync target for the discovery files might be holding up the grab [21:27] So we'd lie to test with a different rsync target [21:27] It only needs a few GB [21:27] mine from a while ago might still be open, cant remember [21:27] Also, snape ^ [21:28] *** Boppen has joined #archiveteam [21:32] *** brabo has joined #archiveteam [21:32] *** matthusb- has joined #archiveteam [21:34] *** SadDM has joined #archiveteam [21:34] *** swebb sets mode: +o SadDM [21:34] *** schbirid has quit IRC (Quit: Leaving) [21:34] *** jspiros has joined #archiveteam [21:36] *** bsmith093 has quit IRC (Ping timeout: 190 seconds) [21:44] *** Boppen has quit IRC (Ping timeout: 194 seconds) [21:48] *** MMovie has quit IRC (Read error: Operation timed out) [21:49] *** MMovie has joined #archiveteam [21:52] https://www.youtube.com/c/paramountvault [21:53] Smiley, want me to grab it? [21:55] HCross: if you can [21:55] US only appently [21:56] Hmm [21:57] I might grab a US VPS, get it all down the pipe, then rsync it over to France and handle it from there [21:57] :D [22:00] the clips work in the UK, the main films dont [22:12] *** MMovie has quit IRC (Read error: Operation timed out) [22:13] *** MMovie has joined #archiveteam [22:18] *** BubuAnabe has quit IRC (Ping timeout: 362 seconds) [22:29] *** MMovie has quit IRC (Read error: Operation timed out) [22:31] HCross: so this link isn't a full movie for you? https://www.youtube.com/watch?v=k3l4Wfrxf4c [22:31] *** MMovie has joined #archiveteam [22:31] This video is not available. [22:31] that sucks [22:32] both from the UK and from France [22:34] I have no problem grabbing it with youtube-dl from my home connection in the US [22:43] *** MMovie has quit IRC (Read error: Operation timed out) [22:46] *** MMovie has joined #archiveteam [23:03] *** MMovie has quit IRC (Read error: Operation timed out) [23:04] *** MMovie has joined #archiveteam [23:11] *** Start has joined #archiveteam [23:13] *** nox has quit IRC (Read error: Operation timed out) [23:13] *** nox_ has joined #archiveteam [23:17] *** nox_ is now known as nox [23:23] arkiver: I have something set up for your ftpgrab. See PM. [23:32] *** MMovie has quit IRC (Read error: Operation timed out) [23:34] *** MMovie has joined #archiveteam [23:41] thanks Smiley! [23:51] *** MMovie has quit IRC (Read error: Operation timed out) [23:51] *** Stiletto has quit IRC (Ping timeout: 260 seconds) [23:52] *** MMovie has joined #archiveteam