[00:20] urgh, my cousin in law was once on a telly program called crazy cottage [00:20] need to try and see if i can find it Â¬_Â¬ [01:53] uploaded: http://archive.org/details/arstechnica.com-articles-1998-2004-mirror [02:03] uploaded: http://archive.org/details/arstechnica.com-articles-2005-mirror [02:05] 1998 to 2004 is not much bigger then the full 2005 article mirror [02:37] uploaded: http://archive.org/details/arstechnica.com-articles-2006-mirror [05:55] uploaded: http://archive.org/details/arstechnica.com-articles-2007-mirror [05:55] uploaded: http://archive.org/details/arstechnica.com-articles-2008-mirror [08:03] Hi. What is the easiest way to access .warc file contents on Windows? [08:06] Never ask if you should archive something. Archive it and ask if any of us assholes want a copy [08:06] and then keep it yourself [08:08] SketchCow: want a copy of ftp.cavedog.com? uncompressed tarball is 1.6GB, xz-compressed is 1GB [08:09] Duh [08:10] What was it? [08:10] the game developer that made games like Total Annihilation [08:12] (that is, Total Annihilation, and another similar RTS with a more medieval theme) [08:13] Approved. [08:13] it also includes updates for their parent company, Humongous Entertainment [08:13] Do you have a place for me to download from or do I need to give you a slot? [08:15] I can set up a download a moment [08:20] uploaded: http://archive.org/details/arstechnica.com-articles-2009-mirror [08:25] archiving feels so good [08:25] especially when the original disappears [08:26] Can anyone help me access the contents of a .warc file on Windows? [08:43] Coderjoe: Oooh, I want that as well [08:48] SketchCow: rsync path sent via PM [08:48] ersi: I only have so much upstream bandwidth :-\ [08:49] looks like I probably did the last mirroring pass in 11/2005 [09:02] so i'm starting to do image grabs for each of my arstechnica dumps [09:09] godane: What programs are you using for the archival process? [09:14] wget [09:16] thanks [09:17] wget-1.14 (the latest) has support for writing to WARC files, thanks to alard [09:18] I'm still trying to get stuff out of the WARC files. [09:19] Hmm, there's warc2zip, that might help you since you're on windows - hold on a moment [09:20] hiker1: http://warctozip.archive.org/ [09:20] What are non-Windows users using? [09:22] ersi: That website requires you upload the entire warc file to the server. In some cases thats hundreds and hundreds of MB. [09:22] you know what would be really crazy? implementing that warctozip using javascript, so you don't actually need to upload the file to a server [09:23] But isn't there a reason stuff is stored using warc instead of zip? [09:23] I never really need to open up WARCs, when I do; I just `less` or `zless them and read them straight. But there's a bunch of tools, like warc-tools from hanzoarchives (ie tef) [09:24] How else do you read the contents of archived websites? [09:24] https://github.com/tef/warctools [09:24] The reason for WARC is that; Metadata. You'll know from WHERE and WHEN the data was downloaded. Because you have the HTTP Headers for both the Request and Response [09:24] there is, for archives. the warc includes metadata like the original URL, request headers, response headers, date and time of the request, etc [09:25] * ersi nods [09:25] The most common interface to actually view WARCs is, the Internet Archive Wayback Machine. But you can't use that for your own WARCs though ;) [09:25] if you just need to get files out of it, converting it to zip is fine (provided you don't delete the warc) [09:25] well, there is the open-source wayback codebase [09:25] true, but it's a pain in the ass to setup [09:25] iirc, yipdw had an instance of that set up [09:26] I would have thought there would be a program which hosts the warc file on a web server, or directly explores the contents without requiring a conversion. [09:26] I'm trying out IA's warc library for python https://github.com/internetarchive/warc [09:26] that's the wayback [09:27] IMO warc-tools from tef is better than IA warc [09:27] Coderjoe: Absorbing your cavedog as we speak. [09:27] Om nom nom [09:27] Can you use the wayback machine to view warc's that are uploaded to IA? [09:27] After I get this, ersi, it'll be on archive.org in seconds. No worries. [09:28] SketchCow: I'll nom on it then [09:28] I noticed (but only because I was watching the log) [09:28] mmm [09:28] traffic shaping and prioritizing really takes the pain away [09:40] So it wget outputting to warc the preferred method for archiving sites? Not HTTrack? [09:40] depends [09:41] though I don't know about httrack [09:41] IA has a tool called Heretrix for their normal crawls [09:42] we use wget here because we can make it ignore robots.txt. and with the lua scripting, we can specialize things for each site. [09:42] yup [09:58] Or try viewing the warc with this: https://github.com/alard/warc-proxy :) [09:59] That looks a lot closer to what I want [10:00] Oops, thought about writing that one out as well :) [10:00] alard: Why does it use a proxy instead of just running a web server? [10:02] hiker1: The thing *is* the proxy. That's the easiest way to do it -- from a technical perspective, that is -- since you don't have to rewrite any urls. [10:02] The wayback web interface has to replace the URLs in every web page it serves. The warc-proxy addon just configures its little web server as a proxy, and it's done. [10:03] well, the nice thing about rewriting is then you can serve the files to other people through the web. [10:04] with the proxy method, only a local user can access them, unless you make the proxy public which would not be easy for most users to access [10:04] the non-nice thing is that it's a pain in the ass [10:04] Yes, but that's not what this tool is for. If you want to do that there's the wayback tool. [10:04] What is the wayback tool? [10:04] Wayback Machine [10:04] but that won't serve private warc files [10:04] https://github.com/internetarchive/wayback [10:04] https://github.com/internetarchive/wayback [10:04] damn it [10:04] Heh. [10:05] But as you can see it's much harder to get that running than the warc-proxy + firefox addon. [10:05] does warcproxy just grab whatever .warc files it sees? [10:06] ah, nvm, it has a neat interface! [10:07] wow, this is really impressive work [10:13] +1 alard [10:16] alard: Holy-moly, this goes to my favourites [10:22] alard: the urls in menu for warc-proxy don't work for me for some reason [10:22] it doesn't take in the baseurl [10:22] The base url didn't work for me, but the other ones did [10:23] so it will go to folder/file instead of example.com/folder/file or something like that [10:23] and so it would error [10:23] That's strange. [10:24] also when testing my eff.org grab it would just go to real site [10:24] (Whether the base url works depends on the contents of your warc file. If the base url isn't in there it won't be visible.) [10:25] godane: Is that an https site? [10:25] yes [10:33] godane: The https doesn't work yet. For some reason those requests aren't proxied. I've added it to the list: https://github.com/alard/warc-proxy/issues/2 [10:44] is there an Internet Archive IRC channel somewhere, or is this the best bet? [10:45] #internetarchive unofficial/semi-officialo channel [10:45] cheers :) [10:45] mostly just to get IA shizzle out of this channel :) [10:46] yes, same people here and there mostly [11:38] More hugs here [11:41] Hey, someone's using the warrior, it spent 45 minutes on "setting up data partition". [11:41] And he stopped it. [11:41] Any ideas? [11:42] scrap and start it again? [12:12] did you givbe it like a 10tb partition for /data? [12:24] 10TB??? [12:48] tuabkiet: You don't have 10 TB of RAID space lying around? [12:49] I don't use RAID, and my hard disk is 10 times smaller [12:53] How do I get wget 1.14? [12:58] hiker1: It's not in many repositories. You'll probably have to compile it yourself [12:59] damn. I'm downloading Linux Mint Debian Edition which uses Debian Testing. I hope it's in there... Is there a compile guide by ArchiveTeam? [13:00] No, but I can probably help [13:00] how long are you going to be on? I'm still downloading the Mint dvd. [13:01] debian testing has wget 1.13.4-3 [13:01] How did you find that out? [13:01] I was looking for a package listing but couldn't find one [13:01] debian sid has wget 1.14 [13:01] http://packages.debian.org bro [13:01] they hid it on their packages subdomain! those sneaky... [13:02] you can probably install that .deb and everything will be fine [13:02] I think there was an aptosid... [13:03] you can probably just dpkg -i the .deb if you're inclined [13:03] ersi: Do you use a linux distro? if so, which? [13:04] Ubuntu, Red Hat Enterprise Server, Gentoo, crappy version of SuSE and I've used Debian [13:04] oh. [13:04] no mint? [13:04] nope. But it's just another Debian deriative [13:06] http://archive.org/details/ftp_cavedog.com now up [13:07] Where are archives of known dead sites kept? [13:07] I only saw the just in time captures [13:09] SketchCow: Any chance you could post a file listing along with the FTP Snapshot? It would be nice to know what I'm getting before grabbing 1.5 GB. [13:10] Most are up on archive.org [13:10] SketchCow: thx~ [13:11] Does http://archive.org/details/archiveteam-fire include known dead sites? [13:11] hiker1: http://archive.org/download/ftp_cavedog.com/ftp.cavedog.com.tar/ [13:12] (a slash at the end of the .tar usually gives you an index) [13:12] alard: oh, wow, that is handy. Thank you. [13:12] Also, you should trust me [13:12] Everything I upload is awesome [13:12] hah [13:13] Indeed [13:26] I downloaded a forum about 3 years ago. The place is gone now. IA has some of the forum archived, but I'm pretty sure my archive has everything. Can I distribute it through ArchiveTeam? [13:28] The forum had a few thousand posts. It was the official forum for a video game called Lord of the Rings Online TCG. The whole archive is only 11 MB. [13:43] hiker1: Up it to Internet Archive NOW! [13:43] I am not sure how [13:44] Create an account first and foremost [16:44] Bagger 288! Bagger 288! [16:46] SketchCow: Did you find the two Dailybooth warc files I asked for? [16:47] the tracker thing eg used at http://tracker.archiveteam.org/webshots/ could use a link "Wanna join? http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior" link [16:48] Agreed on wanna join. [16:48] soultcer: No, I've been working on my presentation. [16:48] E-mail me. jason@textfiles.com. [16:48] Will do [19:00] Has someone saved the http://blog.webshots.com/ ? [23:27] slowly redoing wikia dumps mirror: https://archive.org/details/wikia_dump_20121204 [23:28] now 5704 wikis begining by "a" vs. 872 in previous snapshot [23:29] still, looks like dumps are not generated for 80 % of wikis they have even if requested [23:39] --------------------------------------------------------------------------- [23:39] Hi all. Webshots is done. 109 TB saved by 134 downloaders. Thanks! [23:39] It's available on the projects tab of your warrior. [23:39] Next station: DailyBooth.com, closing at the end of the year. [23:39] If you want to run it yourself: https://github.com/ArchiveTeam/dailybooth-grab [23:39] (All very similar to WebShots and previous projects.) [23:39] Join #dailybooth for more detailed discussions. [23:39] ---------------------------------------------------------------------------