#archiveteam 2014-06-11,Wed

↑back Search

Time	Nickname	Message
04:30 ^🔗	exmic	SketchCow: yeah, ftp.
04:30 ^🔗	exmic	I'm in bumfuck utah this week
09:32 ^🔗	tephra	SketchCow: stupid question but would like to make sure, do we have a grab of: https://www.aclu.org/nsa-documents-search ?
10:06 ^🔗	godane	SketchCow: i see that your sorting thur my stuff
10:53 ^🔗	etesp	and with 3790 pages of threads in just that subforum, that's going to slow things down a lot.
10:55 ^🔗	etesp	is it possible to change the omits on a running archive project?
11:01 ^🔗	midas	you mean in archivebot?
12:35 ^🔗	balrog	ftp.netscape.com
12:35 ^🔗	balrog	probably should be archived
12:52 ^🔗	midas	ok balrog
12:52 ^🔗	midas	grabbing it now
13:07 ^🔗	ohhdemgir	http://www.irishtimes.com/culture/books/crowds-retrieve-100-000-books-dumped-in-skip-1.1827142
13:07 ^🔗	ohhdemgir	midas, got it
13:07 ^🔗	ohhdemgir	https://archive.org/details/ftp.netscape.com
13:11 ^🔗	midas	lol, cancelling mine
13:16 ^🔗	balrog	sorry ;)
13:31 ^🔗	JohnnyJac	Hey, everybody. Saw a Defcon presentation for this project, and I absolutely love the work being done. Just moved into a new place, so my workshop isn't set up yet, but I think I may get in and throw in my efforts as well. Awesome work.
13:32 ^🔗	joepie91	JohnnyJac: welcome :)
13:33 ^🔗	joepie91	JohnnyJac: quick note; off-topic conversations and lengthy discussions generally take place in #archiveteam-bs, so as to not clog up this channel... it's not unusual for people to come in, shout that something's going down, and leave again
13:33 ^🔗	joepie91	and keeping off-topic separate makes it easier to keep track of that
13:35 ^🔗	JohnnyJac	Noted, and changing IRC client accordingly.
13:38 ^🔗	joepie91	:)
14:20 ^🔗	SketchCow	ftp.netscape.com has been grabbed three times.
14:43 ^🔗	asie	only three?
14:43 ^🔗	asie	i grabbed it once too
14:48 ^🔗	etesp	18:01:30 midas you mean in archivebot?
14:48 ^🔗	etesp	yes
14:49 ^🔗	midas	etesp: #archivebot
15:18 ^🔗	DFJustin	I just see two, with the other one being some different old incarnation https://archive.org/search.php?query=%22ftp.netscape.com%22
15:19 ^🔗	SketchCow	I mean, keep grabbing it, sure.
17:09 ^🔗	etesp	did http://archivebot.at.ninjawedding.org:4567/#/histories/http://forums.spacebattles.com/forums/vs-debates.4/ get all the thread contents, or just the pages of threads?
17:09 ^🔗	etesp	Looking at the other spacebattles archiving, i'm not seeing any topics being saved
17:10 ^🔗	etesp	threads have urls like http://forums.spacebattles.com/threads/the-hundred-lives-of-the-dragon.301712/
17:10 ^🔗	DFJustin	the threads are in a different folder so no
17:10 ^🔗	etesp	ah
17:11 ^🔗	etesp	could it be set to include links to threads?
17:11 ^🔗	etesp	or too large a forum for that?
17:14 ^🔗	etesp	they're mostly sequential, the text between /threads/ and .number/ is irrelevant, it redirects you to the right place
17:14 ^🔗	etesp	if that helps
17:40 ^🔗	ohhdemgir	SketchCow, asie https://archive.org/details/ftp_netscape_com_2013_04 says "Captured on April 2013. Contains the FTP.NETSCAPE.COM site which shut down in 2005." and is 2.4GB, the last I did was https://archive.org/details/ftp.netscape.com and is 18GB larger.. heh, still up, we do need to start checking though I guess
17:40 ^🔗	asie	oh, that one
17:40 ^🔗	ohhdemgir	midas, did you get back into your box?
18:09 ^🔗	midas	ohhdemgir: yep
18:10 ^🔗	midas	and then i cancelled them all
18:10 ^🔗	ohhdemgir	wut wut
18:16 ^🔗	midas	with the amount of money i save i can get a bigger and faster box with a different ISP
18:38 ^🔗	SketchCow	Hmmm, it appears to be taking Internet Archive a while to transfer this tiny 683gb file to their servers
18:40 ^🔗	midas	just 683GB?
18:41 ^🔗	SketchCow	Yeah, a pittance
18:41 ^🔗	midas	rubbish service I say, apple provided me with 10.000 petabyte on a mobile phone.
18:43 ^🔗	midas	hey SketchCow, what about a tour of IA streamed on justin.tv
18:43 ^🔗	SketchCow	ha ha
18:44 ^🔗	yipdw	make sure to turn archiving off
18:44 ^🔗	SketchCow	I think that's the default
18:44 ^🔗	SketchCow	Everywhere
18:44 ^🔗	midas	but seriously, im up for a tour im just about 10 hours away.. would be cool if there was a internet tour of some sort
20:45 ^🔗	balrog	does anyone know if Cameron Kaiser (of tenfourfox/classilla) is on twitter? Classilla throws an SSL error on https://archive.org
22:49 ^🔗	dashcloud	so, if I want wget to grab every page/subdomain on a site, but never go to any external domains, what commands do I need?
23:05 ^🔗	dashcloud	is the section "Creating a WARC with Wget" here: http://www.archiveteam.org/index.php?title=Wget as close to a canonical/recommended Wget WARC command as exists currently?
23:13 ^🔗	nico	dashcloud: https://github.com/ArchiveTeam/ArchiveBot/commit/4f7e460add8a3d56debf7062674cece81fd6818e#diff-5016cb0693c5048fba29437c7301f0a8L372
23:13 ^🔗	nico	That's what Archivebot used when it was still wget-based
23:17 ^🔗	dashcloud	so, I understand most of that, but why is output-document and truncate output used?
23:24 ^🔗	yipdw	dashcloud: output-document and truncate-output were used to avoid generating temporary files
23:24 ^🔗	yipdw	by default, wget in recursive mode will mirror (as best as possible) the URL structure in the directory structure
23:24 ^🔗	yipdw	we don't really care about that when generating a WARC
23:24 ^🔗	dashcloud	so I do in fact want those if I'm planning to do a large grab then
23:24 ^🔗	yipdw	not if you're writing a WARC
23:25 ^🔗	yipdw	or you can afford the 2x disk space requirement
23:25 ^🔗	dashcloud	I'll leave them out then
23:25 ^🔗	yipdw	keep in mind too that recursive grab is problematic on bizarro directory names
23:25 ^🔗	yipdw	or really fucking huge URLs
23:25 ^🔗	yipdw	writing as WARC records sidesteps these problems
23:39 ^🔗	dashcloud	is there something special I need to know about inserting a warc header for wget?
23:44 ^🔗	dashcloud	taking the recommendations from all of the pages, here's the command I'm using: http://paste.archivingyoursh.it/kulekemini.mel wget complains about the second warc header, saying it is invalid- does every warc header need to be like the first warc header?

irclogger-viewer