#archiveteam 2012-12-04,Tue

↑back Search

Time	Nickname	Message
00:20 ^🔗	SmileyG	urgh, my cousin in law was once on a telly program called crazy cottage
00:20 ^🔗	SmileyG	need to try and see if i can find it Â¬_Â¬
01:53 ^🔗	godane	uploaded: http://archive.org/details/arstechnica.com-articles-1998-2004-mirror
02:03 ^🔗	godane	uploaded: http://archive.org/details/arstechnica.com-articles-2005-mirror
02:05 ^🔗	godane	1998 to 2004 is not much bigger then the full 2005 article mirror
02:37 ^🔗	godane	uploaded: http://archive.org/details/arstechnica.com-articles-2006-mirror
05:55 ^🔗	godane	uploaded: http://archive.org/details/arstechnica.com-articles-2007-mirror
05:55 ^🔗	godane	uploaded: http://archive.org/details/arstechnica.com-articles-2008-mirror
08:03 ^🔗	hiker1	Hi. What is the easiest way to access .warc file contents on Windows?
08:06 ^🔗	SketchCow	Never ask if you should archive something. Archive it and ask if any of us assholes want a copy
08:06 ^🔗	SketchCow	and then keep it yourself
08:08 ^🔗	Coderjoe	SketchCow: want a copy of ftp.cavedog.com? uncompressed tarball is 1.6GB, xz-compressed is 1GB
08:09 ^🔗	SketchCow	Duh
08:10 ^🔗	SketchCow	What was it?
08:10 ^🔗	Coderjoe	the game developer that made games like Total Annihilation
08:12 ^🔗	Coderjoe	(that is, Total Annihilation, and another similar RTS with a more medieval theme)
08:13 ^🔗	SketchCow	Approved.
08:13 ^🔗	Coderjoe	it also includes updates for their parent company, Humongous Entertainment
08:13 ^🔗	SketchCow	Do you have a place for me to download from or do I need to give you a slot?
08:15 ^🔗	Coderjoe	I can set up a download a moment
08:20 ^🔗	godane	uploaded: http://archive.org/details/arstechnica.com-articles-2009-mirror
08:25 ^🔗	nova	archiving feels so good
08:25 ^🔗	nova	especially when the original disappears
08:26 ^🔗	hiker1	Can anyone help me access the contents of a .warc file on Windows?
08:43 ^🔗	ersi	Coderjoe: Oooh, I want that as well
08:48 ^🔗	Coderjoe	SketchCow: rsync path sent via PM
08:48 ^🔗	Coderjoe	ersi: I only have so much upstream bandwidth :-\
08:49 ^🔗	Coderjoe	looks like I probably did the last mirroring pass in 11/2005
09:02 ^🔗	godane	so i'm starting to do image grabs for each of my arstechnica dumps
09:09 ^🔗	hiker1	godane: What programs are you using for the archival process?
09:14 ^🔗	godane	wget
09:16 ^🔗	hiker1	thanks
09:17 ^🔗	ersi	wget-1.14 (the latest) has support for writing to WARC files, thanks to alard
09:18 ^🔗	hiker1	I'm still trying to get stuff out of the WARC files.
09:19 ^🔗	ersi	Hmm, there's warc2zip, that might help you since you're on windows - hold on a moment
09:20 ^🔗	ersi	hiker1: http://warctozip.archive.org/
09:20 ^🔗	hiker1	What are non-Windows users using?
09:22 ^🔗	hiker1	ersi: That website requires you upload the entire warc file to the server. In some cases thats hundreds and hundreds of MB.
09:22 ^🔗	Coderjoe	you know what would be really crazy? implementing that warctozip using javascript, so you don't actually need to upload the file to a server
09:23 ^🔗	hiker1	But isn't there a reason stuff is stored using warc instead of zip?
09:23 ^🔗	ersi	I never really need to open up WARCs, when I do; I just `less` or `zless them and read them straight. But there's a bunch of tools, like warc-tools from hanzoarchives (ie tef)
09:24 ^🔗	hiker1	How else do you read the contents of archived websites?
09:24 ^🔗	ersi	https://github.com/tef/warctools
09:24 ^🔗	ersi	The reason for WARC is that; Metadata. You'll know from WHERE and WHEN the data was downloaded. Because you have the HTTP Headers for both the Request and Response
09:24 ^🔗	Coderjoe	there is, for archives. the warc includes metadata like the original URL, request headers, response headers, date and time of the request, etc
09:25 ^🔗	*	ersi nods
09:25 ^🔗	ersi	The most common interface to actually view WARCs is, the Internet Archive Wayback Machine. But you can't use that for your own WARCs though ;)
09:25 ^🔗	Coderjoe	if you just need to get files out of it, converting it to zip is fine (provided you don't delete the warc)
09:25 ^🔗	Coderjoe	well, there is the open-source wayback codebase
09:25 ^🔗	ersi	true, but it's a pain in the ass to setup
09:25 ^🔗	Coderjoe	iirc, yipdw had an instance of that set up
09:26 ^🔗	hiker1	I would have thought there would be a program which hosts the warc file on a web server, or directly explores the contents without requiring a conversion.
09:26 ^🔗	hiker1	I'm trying out IA's warc library for python https://github.com/internetarchive/warc
09:26 ^🔗	Coderjoe	that's the wayback
09:27 ^🔗	ersi	IMO warc-tools from tef is better than IA warc
09:27 ^🔗	SketchCow	Coderjoe: Absorbing your cavedog as we speak.
09:27 ^🔗	ersi	Om nom nom
09:27 ^🔗	hiker1	Can you use the wayback machine to view warc's that are uploaded to IA?
09:27 ^🔗	SketchCow	After I get this, ersi, it'll be on archive.org in seconds. No worries.
09:28 ^🔗	ersi	SketchCow: I'll nom on it then
09:28 ^🔗	Coderjoe	I noticed (but only because I was watching the log)
09:28 ^🔗	Coderjoe	mmm
09:28 ^🔗	Coderjoe	traffic shaping and prioritizing really takes the pain away
09:40 ^🔗	hiker1	So it wget outputting to warc the preferred method for archiving sites? Not HTTrack?
09:40 ^🔗	Coderjoe	depends
09:41 ^🔗	Coderjoe	though I don't know about httrack
09:41 ^🔗	Coderjoe	IA has a tool called Heretrix for their normal crawls
09:42 ^🔗	Coderjoe	we use wget here because we can make it ignore robots.txt. and with the lua scripting, we can specialize things for each site.
09:42 ^🔗	chronomex	yup
09:58 ^🔗	alard	Or try viewing the warc with this: https://github.com/alard/warc-proxy :)
09:59 ^🔗	hiker1	That looks a lot closer to what I want
10:00 ^🔗	ersi	Oops, thought about writing that one out as well :)
10:00 ^🔗	hiker1	alard: Why does it use a proxy instead of just running a web server?
10:02 ^🔗	alard	hiker1: The thing is the proxy. That's the easiest way to do it -- from a technical perspective, that is -- since you don't have to rewrite any urls.
10:02 ^🔗	alard	The wayback web interface has to replace the URLs in every web page it serves. The warc-proxy addon just configures its little web server as a proxy, and it's done.
10:03 ^🔗	hiker1	well, the nice thing about rewriting is then you can serve the files to other people through the web.
10:04 ^🔗	hiker1	with the proxy method, only a local user can access them, unless you make the proxy public which would not be easy for most users to access
10:04 ^🔗	ersi	the non-nice thing is that it's a pain in the ass
10:04 ^🔗	alard	Yes, but that's not what this tool is for. If you want to do that there's the wayback tool.
10:04 ^🔗	hiker1	What is the wayback tool?
10:04 ^🔗	ersi	Wayback Machine
10:04 ^🔗	hiker1	but that won't serve private warc files
10:04 ^🔗	alard	https://github.com/internetarchive/wayback
10:04 ^🔗	ersi	https://github.com/internetarchive/wayback
10:04 ^🔗	ersi	damn it
10:04 ^🔗	alard	Heh.
10:05 ^🔗	alard	But as you can see it's much harder to get that running than the warc-proxy + firefox addon.
10:05 ^🔗	hiker1	does warcproxy just grab whatever .warc files it sees?
10:06 ^🔗	hiker1	ah, nvm, it has a neat interface!
10:07 ^🔗	hiker1	wow, this is really impressive work
10:13 ^🔗	ersi	+1 alard
10:16 ^🔗	norbert79	alard: Holy-moly, this goes to my favourites
10:22 ^🔗	godane	alard: the urls in menu for warc-proxy don't work for me for some reason
10:22 ^🔗	godane	it doesn't take in the baseurl
10:22 ^🔗	hiker1	The base url didn't work for me, but the other ones did
10:23 ^🔗	godane	so it will go to folder/file instead of example.com/folder/file or something like that
10:23 ^🔗	godane	and so it would error
10:23 ^🔗	alard	That's strange.
10:24 ^🔗	godane	also when testing my eff.org grab it would just go to real site
10:24 ^🔗	alard	(Whether the base url works depends on the contents of your warc file. If the base url isn't in there it won't be visible.)
10:25 ^🔗	alard	godane: Is that an https site?
10:25 ^🔗	godane	yes
10:33 ^🔗	alard	godane: The https doesn't work yet. For some reason those requests aren't proxied. I've added it to the list: https://github.com/alard/warc-proxy/issues/2
10:44 ^🔗	ats	is there an Internet Archive IRC channel somewhere, or is this the best bet?
10:45 ^🔗	ersi	#internetarchive unofficial/semi-officialo channel
10:45 ^🔗	ats	cheers :)
10:45 ^🔗	ersi	mostly just to get IA shizzle out of this channel :)
10:46 ^🔗	chronomex	yes, same people here and there mostly
11:38 ^🔗	SketchCow	More hugs here
11:41 ^🔗	SketchCow	Hey, someone's using the warrior, it spent 45 minutes on "setting up data partition".
11:41 ^🔗	SketchCow	And he stopped it.
11:41 ^🔗	SketchCow	Any ideas?
11:42 ^🔗	ersi	scrap and start it again?
12:12 ^🔗	SmileyG	did you givbe it like a 10tb partition for /data?
12:24 ^🔗	tuabkiet	10TB???
12:48 ^🔗	hiker1	tuabkiet: You don't have 10 TB of RAID space lying around?
12:49 ^🔗	tuabkiet	I don't use RAID, and my hard disk is 10 times smaller
12:53 ^🔗	hiker1	How do I get wget 1.14?
12:58 ^🔗	ersi	hiker1: It's not in many repositories. You'll probably have to compile it yourself
12:59 ^🔗	hiker1	damn. I'm downloading Linux Mint Debian Edition which uses Debian Testing. I hope it's in there... Is there a compile guide by ArchiveTeam?
13:00 ^🔗	ersi	No, but I can probably help
13:00 ^🔗	hiker1	how long are you going to be on? I'm still downloading the Mint dvd.
13:01 ^🔗	ersi	debian testing has wget 1.13.4-3
13:01 ^🔗	hiker1	How did you find that out?
13:01 ^🔗	hiker1	I was looking for a package listing but couldn't find one
13:01 ^🔗	ersi	debian sid has wget 1.14
13:01 ^🔗	ersi	http://packages.debian.org bro
13:01 ^🔗	hiker1	they hid it on their packages subdomain! those sneaky...
13:02 ^🔗	ersi	you can probably install that .deb and everything will be fine
13:02 ^🔗	hiker1	I think there was an aptosid...
13:03 ^🔗	ersi	you can probably just dpkg -i the .deb if you're inclined
13:03 ^🔗	hiker1	ersi: Do you use a linux distro? if so, which?
13:04 ^🔗	ersi	Ubuntu, Red Hat Enterprise Server, Gentoo, crappy version of SuSE and I've used Debian
13:04 ^🔗	hiker1	oh.
13:04 ^🔗	hiker1	no mint?
13:04 ^🔗	ersi	nope. But it's just another Debian deriative
13:06 ^🔗	SketchCow	http://archive.org/details/ftp_cavedog.com now up
13:07 ^🔗	hiker1	Where are archives of known dead sites kept?
13:07 ^🔗	hiker1	I only saw the just in time captures
13:09 ^🔗	hiker1	SketchCow: Any chance you could post a file listing along with the FTP Snapshot? It would be nice to know what I'm getting before grabbing 1.5 GB.
13:10 ^🔗	ersi	Most are up on archive.org
13:10 ^🔗	ersi	SketchCow: thx~
13:11 ^🔗	hiker1	Does http://archive.org/details/archiveteam-fire include known dead sites?
13:11 ^🔗	alard	hiker1: http://archive.org/download/ftp_cavedog.com/ftp.cavedog.com.tar/
13:12 ^🔗	alard	(a slash at the end of the .tar usually gives you an index)
13:12 ^🔗	hiker1	alard: oh, wow, that is handy. Thank you.
13:12 ^🔗	SketchCow	Also, you should trust me
13:12 ^🔗	SketchCow	Everything I upload is awesome
13:12 ^🔗	hiker1	hah
13:13 ^🔗	ersi	Indeed
13:26 ^🔗	hiker1	I downloaded a forum about 3 years ago. The place is gone now. IA has some of the forum archived, but I'm pretty sure my archive has everything. Can I distribute it through ArchiveTeam?
13:28 ^🔗	hiker1	The forum had a few thousand posts. It was the official forum for a video game called Lord of the Rings Online TCG. The whole archive is only 11 MB.
13:43 ^🔗	tuabkiet	hiker1: Up it to Internet Archive NOW!
13:43 ^🔗	hiker1	I am not sure how
13:44 ^🔗	ersi	Create an account first and foremost
16:44 ^🔗	SketchCow	Bagger 288! Bagger 288!
16:46 ^🔗	soultcer	SketchCow: Did you find the two Dailybooth warc files I asked for?
16:47 ^🔗	schbiridi	the tracker thing eg used at http://tracker.archiveteam.org/webshots/ could use a link "Wanna join? http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior" link
16:48 ^🔗	SketchCow	Agreed on wanna join.
16:48 ^🔗	SketchCow	soultcer: No, I've been working on my presentation.
16:48 ^🔗	SketchCow	E-mail me. jason@textfiles.com.
16:48 ^🔗	soultcer	Will do
19:00 ^🔗	alard	Has someone saved the http://blog.webshots.com/ ?
23:27 ^🔗	Nemo_bis	slowly redoing wikia dumps mirror: https://archive.org/details/wikia_dump_20121204
23:28 ^🔗	Nemo_bis	now 5704 wikis begining by "a" vs. 872 in previous snapshot
23:29 ^🔗	Nemo_bis	still, looks like dumps are not generated for 80 % of wikis they have even if requested
23:39 ^🔗	alard	---------------------------------------------------------------------------
23:39 ^🔗	alard	Hi all. Webshots is done. 109 TB saved by 134 downloaders. Thanks!
23:39 ^🔗	alard	It's available on the projects tab of your warrior.
23:39 ^🔗	alard	Next station: DailyBooth.com, closing at the end of the year.
23:39 ^🔗	alard	If you want to run it yourself: https://github.com/ArchiveTeam/dailybooth-grab
23:39 ^🔗	alard	(All very similar to WebShots and previous projects.)
23:39 ^🔗	alard	Join #dailybooth for more detailed discussions.
23:39 ^🔗	alard	---------------------------------------------------------------------------

irclogger-viewer