[04:30] SketchCow: yeah, ftp. [04:30] I'm in bumfuck utah this week [09:32] SketchCow: stupid question but would like to make sure, do we have a grab of: https://www.aclu.org/nsa-documents-search ? [10:06] SketchCow: i see that your sorting thur my stuff [10:53] and with 3790 pages of threads in just that subforum, that's going to slow things down a lot. [10:55] is it possible to change the omits on a running archive project? [11:01] you mean in archivebot? [12:35] ftp.netscape.com [12:35] probably should be archived [12:52] ok balrog [12:52] grabbing it now [13:07] http://www.irishtimes.com/culture/books/crowds-retrieve-100-000-books-dumped-in-skip-1.1827142 [13:07] midas, got it [13:07] https://archive.org/details/ftp.netscape.com [13:11] lol, cancelling mine [13:16] sorry ;) [13:31] Hey, everybody. Saw a Defcon presentation for this project, and I absolutely love the work being done. Just moved into a new place, so my workshop isn't set up yet, but I think I may get in and throw in my efforts as well. Awesome work. [13:32] JohnnyJac: welcome :) [13:33] JohnnyJac: quick note; off-topic conversations and lengthy discussions generally take place in #archiveteam-bs, so as to not clog up this channel... it's not unusual for people to come in, shout that something's going down, and leave again [13:33] and keeping off-topic separate makes it easier to keep track of that [13:35] Noted, and changing IRC client accordingly. [13:38] :) [14:20] ftp.netscape.com has been grabbed three times. [14:43] only three? [14:43] i grabbed it once too [14:48] 18:01:30 midas you mean in archivebot? [14:48] yes [14:49] etesp: #archivebot [15:18] I just see two, with the other one being some different old incarnation https://archive.org/search.php?query=%22ftp.netscape.com%22 [15:19] I mean, keep grabbing it, sure. [17:09] did http://archivebot.at.ninjawedding.org:4567/#/histories/http://forums.spacebattles.com/forums/vs-debates.4/ get all the thread contents, or just the pages of threads? [17:09] Looking at the other spacebattles archiving, i'm not seeing any topics being saved [17:10] threads have urls like http://forums.spacebattles.com/threads/the-hundred-lives-of-the-dragon.301712/ [17:10] the threads are in a different folder so no [17:10] ah [17:11] could it be set to include links to threads? [17:11] or too large a forum for that? [17:14] they're mostly sequential, the text between /threads/ and .number/ is irrelevant, it redirects you to the right place [17:14] if that helps [17:40] SketchCow, asie https://archive.org/details/ftp_netscape_com_2013_04 says "Captured on April 2013. Contains the FTP.NETSCAPE.COM site which shut down in 2005." and is 2.4GB, the last I did was https://archive.org/details/ftp.netscape.com and is 18GB larger.. heh, still up, we do need to start checking though I guess [17:40] oh, that one [17:40] midas, did you get back into your box? [18:09] ohhdemgir: yep [18:10] and then i cancelled them all [18:10] wut wut [18:16] with the amount of money i save i can get a bigger and faster box with a different ISP [18:38] Hmmm, it appears to be taking Internet Archive a while to transfer this tiny 683gb file to their servers [18:40] just 683GB? [18:41] Yeah, a pittance [18:41] rubbish service I say, apple provided me with 10.000 petabyte on a mobile phone. [18:43] hey SketchCow, what about a tour of IA streamed on justin.tv [18:43] ha ha [18:44] make sure to turn archiving off [18:44] I think that's the default [18:44] Everywhere [18:44] but seriously, im up for a tour im just about 10 hours away.. would be cool if there was a internet tour of some sort [20:45] does anyone know if Cameron Kaiser (of tenfourfox/classilla) is on twitter? Classilla throws an SSL error on https://archive.org [22:49] so, if I want wget to grab every page/subdomain on a site, but never go to any external domains, what commands do I need? [23:05] is the section "Creating a WARC with Wget" here: http://www.archiveteam.org/index.php?title=Wget as close to a canonical/recommended Wget WARC command as exists currently? [23:13] dashcloud: https://github.com/ArchiveTeam/ArchiveBot/commit/4f7e460add8a3d56debf7062674cece81fd6818e#diff-5016cb0693c5048fba29437c7301f0a8L372 [23:13] That's what Archivebot used when it was still wget-based [23:17] so, I understand most of that, but why is output-document and truncate output used? [23:24] dashcloud: output-document and truncate-output were used to avoid generating temporary files [23:24] by default, wget in recursive mode will mirror (as best as possible) the URL structure in the directory structure [23:24] we don't really care about that when generating a WARC [23:24] so I do in fact want those if I'm planning to do a large grab then [23:24] not if you're writing a WARC [23:25] or you can afford the 2x disk space requirement [23:25] I'll leave them out then [23:25] keep in mind too that recursive grab is problematic on bizarro directory names [23:25] or really fucking huge URLs [23:25] writing as WARC records sidesteps these problems [23:39] is there something special I need to know about inserting a warc header for wget? [23:44] taking the recommendations from all of the pages, here's the command I'm using: http://paste.archivingyoursh.it/kulekemini.mel wget complains about the second warc header, saying it is invalid- does every warc header need to be like the first warc header?