#archiveteam 2014-06-11,Wed

↑back Search

Time Nickname Message
04:30 🔗 exmic SketchCow: yeah, ftp.
04:30 🔗 exmic I'm in bumfuck utah this week
09:32 🔗 tephra SketchCow: stupid question but would like to make sure, do we have a grab of: https://www.aclu.org/nsa-documents-search ?
10:06 🔗 godane SketchCow: i see that your sorting thur my stuff
10:53 🔗 etesp and with 3790 pages of threads in just that subforum, that's going to slow things down a lot.
10:55 🔗 etesp is it possible to change the omits on a running archive project?
11:01 🔗 midas you mean in archivebot?
12:35 🔗 balrog ftp.netscape.com
12:35 🔗 balrog probably should be archived
12:52 🔗 midas ok balrog
12:52 🔗 midas grabbing it now
13:07 🔗 ohhdemgir http://www.irishtimes.com/culture/books/crowds-retrieve-100-000-books-dumped-in-skip-1.1827142
13:07 🔗 ohhdemgir midas, got it
13:07 🔗 ohhdemgir https://archive.org/details/ftp.netscape.com
13:11 🔗 midas lol, cancelling mine
13:16 🔗 balrog sorry ;)
13:31 🔗 JohnnyJac Hey, everybody. Saw a Defcon presentation for this project, and I absolutely love the work being done. Just moved into a new place, so my workshop isn't set up yet, but I think I may get in and throw in my efforts as well. Awesome work.
13:32 🔗 joepie91 JohnnyJac: welcome :)
13:33 🔗 joepie91 JohnnyJac: quick note; off-topic conversations and lengthy discussions generally take place in #archiveteam-bs, so as to not clog up this channel... it's not unusual for people to come in, shout that something's going down, and leave again
13:33 🔗 joepie91 and keeping off-topic separate makes it easier to keep track of that
13:35 🔗 JohnnyJac Noted, and changing IRC client accordingly.
13:38 🔗 joepie91 :)
14:20 🔗 SketchCow ftp.netscape.com has been grabbed three times.
14:43 🔗 asie only three?
14:43 🔗 asie i grabbed it once too
14:48 🔗 etesp 18:01:30 midas you mean in archivebot?
14:48 🔗 etesp yes
14:49 🔗 midas etesp: #archivebot
15:18 🔗 DFJustin I just see two, with the other one being some different old incarnation https://archive.org/search.php?query=%22ftp.netscape.com%22
15:19 🔗 SketchCow I mean, keep grabbing it, sure.
17:09 🔗 etesp did http://archivebot.at.ninjawedding.org:4567/#/histories/http://forums.spacebattles.com/forums/vs-debates.4/ get all the thread contents, or just the pages of threads?
17:09 🔗 etesp Looking at the other spacebattles archiving, i'm not seeing any topics being saved
17:10 🔗 etesp threads have urls like http://forums.spacebattles.com/threads/the-hundred-lives-of-the-dragon.301712/
17:10 🔗 DFJustin the threads are in a different folder so no
17:10 🔗 etesp ah
17:11 🔗 etesp could it be set to include links to threads?
17:11 🔗 etesp or too large a forum for that?
17:14 🔗 etesp they're mostly sequential, the text between /threads/ and .number/ is irrelevant, it redirects you to the right place
17:14 🔗 etesp if that helps
17:40 🔗 ohhdemgir SketchCow, asie https://archive.org/details/ftp_netscape_com_2013_04 says "Captured on April 2013. Contains the FTP.NETSCAPE.COM site which shut down in 2005." and is 2.4GB, the last I did was https://archive.org/details/ftp.netscape.com and is 18GB larger.. heh, still up, we do need to start checking though I guess
17:40 🔗 asie oh, that one
17:40 🔗 ohhdemgir midas, did you get back into your box?
18:09 🔗 midas ohhdemgir: yep
18:10 🔗 midas and then i cancelled them all
18:10 🔗 ohhdemgir wut wut
18:16 🔗 midas with the amount of money i save i can get a bigger and faster box with a different ISP
18:38 🔗 SketchCow Hmmm, it appears to be taking Internet Archive a while to transfer this tiny 683gb file to their servers
18:40 🔗 midas just 683GB?
18:41 🔗 SketchCow Yeah, a pittance
18:41 🔗 midas rubbish service I say, apple provided me with 10.000 petabyte on a mobile phone.
18:43 🔗 midas hey SketchCow, what about a tour of IA streamed on justin.tv
18:43 🔗 SketchCow ha ha
18:44 🔗 yipdw make sure to turn archiving off
18:44 🔗 SketchCow I think that's the default
18:44 🔗 SketchCow Everywhere
18:44 🔗 midas but seriously, im up for a tour im just about 10 hours away.. would be cool if there was a internet tour of some sort
20:45 🔗 balrog does anyone know if Cameron Kaiser (of tenfourfox/classilla) is on twitter? Classilla throws an SSL error on https://archive.org
22:49 🔗 dashcloud so, if I want wget to grab every page/subdomain on a site, but never go to any external domains, what commands do I need?
23:05 🔗 dashcloud is the section "Creating a WARC with Wget" here: http://www.archiveteam.org/index.php?title=Wget as close to a canonical/recommended Wget WARC command as exists currently?
23:13 🔗 nico dashcloud: https://github.com/ArchiveTeam/ArchiveBot/commit/4f7e460add8a3d56debf7062674cece81fd6818e#diff-5016cb0693c5048fba29437c7301f0a8L372
23:13 🔗 nico That's what Archivebot used when it was still wget-based
23:17 🔗 dashcloud so, I understand most of that, but why is output-document and truncate output used?
23:24 🔗 yipdw dashcloud: output-document and truncate-output were used to avoid generating temporary files
23:24 🔗 yipdw by default, wget in recursive mode will mirror (as best as possible) the URL structure in the directory structure
23:24 🔗 yipdw we don't really care about that when generating a WARC
23:24 🔗 dashcloud so I do in fact want those if I'm planning to do a large grab then
23:24 🔗 yipdw not if you're writing a WARC
23:25 🔗 yipdw or you can afford the 2x disk space requirement
23:25 🔗 dashcloud I'll leave them out then
23:25 🔗 yipdw keep in mind too that recursive grab is problematic on bizarro directory names
23:25 🔗 yipdw or really fucking huge URLs
23:25 🔗 yipdw writing as WARC records sidesteps these problems
23:39 🔗 dashcloud is there something special I need to know about inserting a warc header for wget?
23:44 🔗 dashcloud taking the recommendations from all of the pages, here's the command I'm using: http://paste.archivingyoursh.it/kulekemini.mel wget complains about the second warc header, saying it is invalid- does every warc header need to be like the first warc header?

irclogger-viewer