[00:20] <SmileyG> urgh, my cousin in law was once on a telly program called crazy cottage
[00:20] <SmileyG> need to try and see if i can find it ¬_¬
[01:53] <godane> uploaded: http://archive.org/details/arstechnica.com-articles-1998-2004-mirror
[02:03] <godane> uploaded: http://archive.org/details/arstechnica.com-articles-2005-mirror
[02:05] <godane> 1998 to 2004 is not much bigger then the full 2005 article mirror
[02:37] <godane> uploaded: http://archive.org/details/arstechnica.com-articles-2006-mirror
[05:55] <godane> uploaded: http://archive.org/details/arstechnica.com-articles-2007-mirror
[05:55] <godane> uploaded: http://archive.org/details/arstechnica.com-articles-2008-mirror
[08:03] <hiker1> Hi. What is the easiest way to access .warc file contents on Windows?
[08:06] <SketchCow> Never ask if you should archive something. Archive it and ask if any of us assholes want a copy
[08:06] <SketchCow> and then keep it yourself
[08:08] <Coderjoe> SketchCow: want a copy of ftp.cavedog.com? uncompressed tarball is 1.6GB, xz-compressed is 1GB
[08:09] <SketchCow> Duh
[08:10] <SketchCow> What was it?
[08:10] <Coderjoe> the game developer that made games like Total Annihilation
[08:12] <Coderjoe> (that is, Total Annihilation, and another similar RTS with a more medieval theme)
[08:13] <SketchCow> Approved.
[08:13] <Coderjoe> it also includes updates for their parent company, Humongous Entertainment
[08:13] <SketchCow> Do you have a place for me to download from or do I need to give you a slot?
[08:15] <Coderjoe> I can set up a download a moment
[08:20] <godane> uploaded: http://archive.org/details/arstechnica.com-articles-2009-mirror
[08:25] <nova> archiving feels so good
[08:25] <nova> especially when the original disappears
[08:26] <hiker1> Can anyone help me access the contents of a .warc file on Windows?
[08:43] <ersi> Coderjoe: Oooh, I want that as well
[08:48] <Coderjoe> SketchCow: rsync path sent via PM
[08:48] <Coderjoe> ersi: I only have so much upstream bandwidth :-\
[08:49] <Coderjoe> looks like I probably did the last mirroring pass in 11/2005
[09:02] <godane> so i'm starting to do image grabs for each of my arstechnica dumps
[09:09] <hiker1> godane: What programs are you using for the archival process?
[09:14] <godane> wget
[09:16] <hiker1> thanks
[09:17] <ersi> wget-1.14 (the latest) has support for writing to WARC files, thanks to alard
[09:18] <hiker1> I'm still trying to get stuff out of the WARC files.
[09:19] <ersi> Hmm, there's warc2zip, that might help you since you're on windows - hold on a moment
[09:20] <ersi> hiker1: http://warctozip.archive.org/
[09:20] <hiker1> What are non-Windows users using?
[09:22] <hiker1> ersi: That website requires you upload the entire warc file to the server. In some cases thats hundreds and hundreds of MB.
[09:22] <Coderjoe> you know what would be really crazy? implementing that warctozip using javascript, so you don't actually need to upload the file to a server
[09:23] <hiker1> But isn't there a reason stuff is stored using warc instead of zip?
[09:23] <ersi> I never really need to open up WARCs, when I do; I just `less` or `zless them and read them straight. But there's a bunch of tools, like warc-tools from hanzoarchives (ie tef)
[09:24] <hiker1> How else do you read the contents of archived websites?
[09:24] <ersi> https://github.com/tef/warctools
[09:24] <ersi> The reason for WARC is that; Metadata. You'll know from WHERE and WHEN the data was downloaded. Because you have the HTTP Headers for both the Request and Response
[09:24] <Coderjoe> there is, for archives. the warc includes metadata like the original URL, request headers, response headers, date and time of the request, etc
[09:25] * ersi nods
[09:25] <ersi> The most common interface to actually view WARCs is, the Internet Archive Wayback Machine. But you can't use that for your own WARCs though ;)
[09:25] <Coderjoe> if you just need to get files out of it, converting it to zip is fine (provided you don't delete the warc)
[09:25] <Coderjoe> well, there is the open-source wayback codebase
[09:25] <ersi> true, but it's a pain in the ass to setup
[09:25] <Coderjoe> iirc, yipdw had an instance of that set up
[09:26] <hiker1> I would have thought there would be a program which hosts the warc file on a web server, or directly explores the contents without requiring a conversion.
[09:26] <hiker1> I'm trying out IA's warc library for python https://github.com/internetarchive/warc
[09:26] <Coderjoe> that's the wayback
[09:27] <ersi> IMO warc-tools from tef is better than IA warc
[09:27] <SketchCow> Coderjoe: Absorbing your cavedog as we speak.
[09:27] <ersi> Om nom nom
[09:27] <hiker1> Can you use the wayback machine to view warc's that are uploaded to IA?
[09:27] <SketchCow> After I get this, ersi, it'll be on archive.org in seconds. No worries.
[09:28] <ersi> SketchCow: I'll nom on it then
[09:28] <Coderjoe> I noticed (but only because I was watching the log)
[09:28] <Coderjoe> mmm
[09:28] <Coderjoe> traffic shaping and prioritizing really takes the pain away
[09:40] <hiker1> So it wget outputting to warc the preferred method for archiving sites? Not HTTrack?
[09:40] <Coderjoe> depends
[09:41] <Coderjoe> though I don't know about httrack
[09:41] <Coderjoe> IA has a tool called Heretrix for their normal crawls
[09:42] <Coderjoe> we use wget here because we can make it ignore robots.txt. and with the lua scripting, we can specialize things for each site.
[09:42] <chronomex> yup
[09:58] <alard> Or try viewing the warc with this: https://github.com/alard/warc-proxy :)
[09:59] <hiker1> That looks a lot closer to what I want
[10:00] <ersi> Oops, thought about writing that one out as well :)
[10:00] <hiker1> alard: Why does it use a proxy instead of just running a web server?
[10:02] <alard> hiker1: The thing *is* the proxy. That's the easiest way to do it -- from a technical perspective, that is -- since you don't have to rewrite any urls.
[10:02] <alard> The wayback web interface has to replace the URLs in every web page it serves. The warc-proxy addon just configures its little web server as a proxy, and it's done.
[10:03] <hiker1> well, the nice thing about rewriting is then you can serve the files to other people through the web.
[10:04] <hiker1> with the proxy method, only a local user can access them, unless you make the proxy public which would not be easy for most users to access
[10:04] <ersi> the non-nice thing is that it's a pain in the ass
[10:04] <alard> Yes, but that's not what this tool is for. If you want to do that there's the wayback tool.
[10:04] <hiker1> What is the wayback tool?
[10:04] <ersi> Wayback Machine
[10:04] <hiker1> but that won't serve private warc files
[10:04] <alard> https://github.com/internetarchive/wayback
[10:04] <ersi> https://github.com/internetarchive/wayback
[10:04] <ersi> damn it
[10:04] <alard> Heh.
[10:05] <alard> But as you can see it's much harder to get that running than the warc-proxy + firefox addon.
[10:05] <hiker1> does warcproxy just grab whatever .warc files it sees?
[10:06] <hiker1> ah, nvm, it has a neat interface!
[10:07] <hiker1> wow, this is really impressive work
[10:13] <ersi> +1 alard
[10:16] <norbert79> alard: Holy-moly, this goes to my favourites
[10:22] <godane> alard: the urls in menu for warc-proxy don't work for me for some reason
[10:22] <godane> it doesn't take in the baseurl
[10:22] <hiker1> The base url didn't work for me, but the other ones did
[10:23] <godane> so it will go to folder/file instead of example.com/folder/file or something like that
[10:23] <godane> and so it would error
[10:23] <alard> That's strange.
[10:24] <godane> also when testing my eff.org grab it would just go to real site
[10:24] <alard> (Whether the base url works depends on the contents of your warc file. If the base url isn't in there it won't be visible.)
[10:25] <alard> godane: Is that an https site?
[10:25] <godane> yes
[10:33] <alard> godane: The https doesn't work yet. For some reason those requests aren't proxied. I've added it to the list: https://github.com/alard/warc-proxy/issues/2
[10:44] <ats> is there an Internet Archive IRC channel somewhere, or is this the best bet?
[10:45] <ersi> #internetarchive unofficial/semi-officialo channel
[10:45] <ats> cheers :)
[10:45] <ersi> mostly just to get IA shizzle out of this channel :)
[10:46] <chronomex> yes, same people here and there mostly
[11:38] <SketchCow> More hugs here
[11:41] <SketchCow> Hey, someone's using the warrior, it spent 45 minutes on "setting up data partition".
[11:41] <SketchCow> And he stopped it.
[11:41] <SketchCow> Any ideas?
[11:42] <ersi> scrap and start it again?
[12:12] <SmileyG> did you givbe it like a 10tb partition for /data?
[12:24] <tuabkiet> 10TB???
[12:48] <hiker1> tuabkiet: You don't have 10 TB of RAID space lying around?
[12:49] <tuabkiet> I don't use RAID, and my hard disk is 10 times smaller
[12:53] <hiker1> How do I get wget 1.14?
[12:58] <ersi> hiker1: It's not in many repositories. You'll probably have to compile it yourself
[12:59] <hiker1> damn. I'm downloading Linux Mint Debian Edition which uses Debian Testing. I hope it's in there... Is there a compile guide by ArchiveTeam?
[13:00] <ersi> No, but I can probably help
[13:00] <hiker1> how long are you going to be on? I'm still downloading the Mint dvd.
[13:01] <ersi> debian testing has wget 1.13.4-3
[13:01] <hiker1> How did you find that out?
[13:01] <hiker1> I was looking for a package listing but couldn't find one
[13:01] <ersi> debian sid has wget 1.14
[13:01] <ersi> http://packages.debian.org bro
[13:01] <hiker1> they hid it on their packages subdomain! those sneaky...
[13:02] <ersi> you can probably install that .deb and everything will be fine
[13:02] <hiker1> I think there was an aptosid...
[13:03] <ersi> you can probably just dpkg -i the .deb if you're inclined
[13:03] <hiker1> ersi: Do you use a linux distro? if so, which?
[13:04] <ersi> Ubuntu, Red Hat Enterprise Server, Gentoo, crappy version of SuSE and I've used Debian
[13:04] <hiker1> oh.
[13:04] <hiker1> no mint?
[13:04] <ersi> nope. But it's just another Debian deriative
[13:06] <SketchCow> http://archive.org/details/ftp_cavedog.com now up
[13:07] <hiker1> Where are archives of known dead sites kept?
[13:07] <hiker1> I only saw the just in time captures
[13:09] <hiker1> SketchCow: Any chance you could post a file listing along with the FTP Snapshot? It would be nice to know what I'm getting before grabbing 1.5 GB.
[13:10] <ersi> Most are up on archive.org
[13:10] <ersi> SketchCow: thx~
[13:11] <hiker1> Does http://archive.org/details/archiveteam-fire include known dead sites?
[13:11] <alard> hiker1: http://archive.org/download/ftp_cavedog.com/ftp.cavedog.com.tar/
[13:12] <alard> (a slash at the end of the .tar usually gives you an index)
[13:12] <hiker1> alard: oh, wow, that is handy. Thank you.
[13:12] <SketchCow> Also, you should trust me
[13:12] <SketchCow> Everything I upload is awesome
[13:12] <hiker1> hah
[13:13] <ersi> Indeed
[13:26] <hiker1> I downloaded a forum about 3 years ago. The place is gone now. IA has some of the forum archived, but I'm pretty sure my archive has everything. Can I distribute it through ArchiveTeam?
[13:28] <hiker1> The forum had a few thousand posts. It was the official forum for a video game called Lord of the Rings Online TCG. The whole archive is only 11 MB.
[13:43] <tuabkiet> hiker1: Up it to Internet Archive NOW!
[13:43] <hiker1> I am not sure how
[13:44] <ersi> Create an account first and foremost
[16:44] <SketchCow> Bagger 288! Bagger 288!
[16:46] <soultcer> SketchCow: Did you find the two Dailybooth warc files I asked for?
[16:47] <schbiridi> the tracker thing eg used at http://tracker.archiveteam.org/webshots/ could use a link "Wanna join? http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior" link
[16:48] <SketchCow> Agreed on wanna join.
[16:48] <SketchCow> soultcer: No, I've been working on my presentation.
[16:48] <SketchCow> E-mail me. jason@textfiles.com.
[16:48] <soultcer> Will do
[19:00] <alard> Has someone saved the http://blog.webshots.com/ ?
[23:27] <Nemo_bis> slowly redoing wikia dumps mirror: https://archive.org/details/wikia_dump_20121204
[23:28] <Nemo_bis> now 5704 wikis begining by "a" vs. 872 in previous snapshot
[23:29] <Nemo_bis> still, looks like dumps are not generated for 80 % of wikis they have even if requested
[23:39] <alard> ---------------------------------------------------------------------------
[23:39] <alard> Hi all. Webshots is done. 109 TB saved by 134 downloaders. Thanks!
[23:39] <alard> It's available on the projects tab of your warrior.
[23:39] <alard> Next station: DailyBooth.com, closing at the end of the year.
[23:39] <alard> If you want to run it yourself: https://github.com/ArchiveTeam/dailybooth-grab
[23:39] <alard> (All very similar to WebShots and previous projects.)
[23:39] <alard> Join #dailybooth for more detailed discussions.
[23:39] <alard> ---------------------------------------------------------------------------