#archiveteam 2012-12-04,Tue

↑back Search

Time Nickname Message
00:20 🔗 SmileyG urgh, my cousin in law was once on a telly program called crazy cottage
00:20 🔗 SmileyG need to try and see if i can find it ¬_¬
01:53 🔗 godane uploaded: http://archive.org/details/arstechnica.com-articles-1998-2004-mirror
02:03 🔗 godane uploaded: http://archive.org/details/arstechnica.com-articles-2005-mirror
02:05 🔗 godane 1998 to 2004 is not much bigger then the full 2005 article mirror
02:37 🔗 godane uploaded: http://archive.org/details/arstechnica.com-articles-2006-mirror
05:55 🔗 godane uploaded: http://archive.org/details/arstechnica.com-articles-2007-mirror
05:55 🔗 godane uploaded: http://archive.org/details/arstechnica.com-articles-2008-mirror
08:03 🔗 hiker1 Hi. What is the easiest way to access .warc file contents on Windows?
08:06 🔗 SketchCow Never ask if you should archive something. Archive it and ask if any of us assholes want a copy
08:06 🔗 SketchCow and then keep it yourself
08:08 🔗 Coderjoe SketchCow: want a copy of ftp.cavedog.com? uncompressed tarball is 1.6GB, xz-compressed is 1GB
08:09 🔗 SketchCow Duh
08:10 🔗 SketchCow What was it?
08:10 🔗 Coderjoe the game developer that made games like Total Annihilation
08:12 🔗 Coderjoe (that is, Total Annihilation, and another similar RTS with a more medieval theme)
08:13 🔗 SketchCow Approved.
08:13 🔗 Coderjoe it also includes updates for their parent company, Humongous Entertainment
08:13 🔗 SketchCow Do you have a place for me to download from or do I need to give you a slot?
08:15 🔗 Coderjoe I can set up a download a moment
08:20 🔗 godane uploaded: http://archive.org/details/arstechnica.com-articles-2009-mirror
08:25 🔗 nova archiving feels so good
08:25 🔗 nova especially when the original disappears
08:26 🔗 hiker1 Can anyone help me access the contents of a .warc file on Windows?
08:43 🔗 ersi Coderjoe: Oooh, I want that as well
08:48 🔗 Coderjoe SketchCow: rsync path sent via PM
08:48 🔗 Coderjoe ersi: I only have so much upstream bandwidth :-\
08:49 🔗 Coderjoe looks like I probably did the last mirroring pass in 11/2005
09:02 🔗 godane so i'm starting to do image grabs for each of my arstechnica dumps
09:09 🔗 hiker1 godane: What programs are you using for the archival process?
09:14 🔗 godane wget
09:16 🔗 hiker1 thanks
09:17 🔗 ersi wget-1.14 (the latest) has support for writing to WARC files, thanks to alard
09:18 🔗 hiker1 I'm still trying to get stuff out of the WARC files.
09:19 🔗 ersi Hmm, there's warc2zip, that might help you since you're on windows - hold on a moment
09:20 🔗 ersi hiker1: http://warctozip.archive.org/
09:20 🔗 hiker1 What are non-Windows users using?
09:22 🔗 hiker1 ersi: That website requires you upload the entire warc file to the server. In some cases thats hundreds and hundreds of MB.
09:22 🔗 Coderjoe you know what would be really crazy? implementing that warctozip using javascript, so you don't actually need to upload the file to a server
09:23 🔗 hiker1 But isn't there a reason stuff is stored using warc instead of zip?
09:23 🔗 ersi I never really need to open up WARCs, when I do; I just `less` or `zless them and read them straight. But there's a bunch of tools, like warc-tools from hanzoarchives (ie tef)
09:24 🔗 hiker1 How else do you read the contents of archived websites?
09:24 🔗 ersi https://github.com/tef/warctools
09:24 🔗 ersi The reason for WARC is that; Metadata. You'll know from WHERE and WHEN the data was downloaded. Because you have the HTTP Headers for both the Request and Response
09:24 🔗 Coderjoe there is, for archives. the warc includes metadata like the original URL, request headers, response headers, date and time of the request, etc
09:25 🔗 * ersi nods
09:25 🔗 ersi The most common interface to actually view WARCs is, the Internet Archive Wayback Machine. But you can't use that for your own WARCs though ;)
09:25 🔗 Coderjoe if you just need to get files out of it, converting it to zip is fine (provided you don't delete the warc)
09:25 🔗 Coderjoe well, there is the open-source wayback codebase
09:25 🔗 ersi true, but it's a pain in the ass to setup
09:25 🔗 Coderjoe iirc, yipdw had an instance of that set up
09:26 🔗 hiker1 I would have thought there would be a program which hosts the warc file on a web server, or directly explores the contents without requiring a conversion.
09:26 🔗 hiker1 I'm trying out IA's warc library for python https://github.com/internetarchive/warc
09:26 🔗 Coderjoe that's the wayback
09:27 🔗 ersi IMO warc-tools from tef is better than IA warc
09:27 🔗 SketchCow Coderjoe: Absorbing your cavedog as we speak.
09:27 🔗 ersi Om nom nom
09:27 🔗 hiker1 Can you use the wayback machine to view warc's that are uploaded to IA?
09:27 🔗 SketchCow After I get this, ersi, it'll be on archive.org in seconds. No worries.
09:28 🔗 ersi SketchCow: I'll nom on it then
09:28 🔗 Coderjoe I noticed (but only because I was watching the log)
09:28 🔗 Coderjoe mmm
09:28 🔗 Coderjoe traffic shaping and prioritizing really takes the pain away
09:40 🔗 hiker1 So it wget outputting to warc the preferred method for archiving sites? Not HTTrack?
09:40 🔗 Coderjoe depends
09:41 🔗 Coderjoe though I don't know about httrack
09:41 🔗 Coderjoe IA has a tool called Heretrix for their normal crawls
09:42 🔗 Coderjoe we use wget here because we can make it ignore robots.txt. and with the lua scripting, we can specialize things for each site.
09:42 🔗 chronomex yup
09:58 🔗 alard Or try viewing the warc with this: https://github.com/alard/warc-proxy :)
09:59 🔗 hiker1 That looks a lot closer to what I want
10:00 🔗 ersi Oops, thought about writing that one out as well :)
10:00 🔗 hiker1 alard: Why does it use a proxy instead of just running a web server?
10:02 🔗 alard hiker1: The thing *is* the proxy. That's the easiest way to do it -- from a technical perspective, that is -- since you don't have to rewrite any urls.
10:02 🔗 alard The wayback web interface has to replace the URLs in every web page it serves. The warc-proxy addon just configures its little web server as a proxy, and it's done.
10:03 🔗 hiker1 well, the nice thing about rewriting is then you can serve the files to other people through the web.
10:04 🔗 hiker1 with the proxy method, only a local user can access them, unless you make the proxy public which would not be easy for most users to access
10:04 🔗 ersi the non-nice thing is that it's a pain in the ass
10:04 🔗 alard Yes, but that's not what this tool is for. If you want to do that there's the wayback tool.
10:04 🔗 hiker1 What is the wayback tool?
10:04 🔗 ersi Wayback Machine
10:04 🔗 hiker1 but that won't serve private warc files
10:04 🔗 alard https://github.com/internetarchive/wayback
10:04 🔗 ersi https://github.com/internetarchive/wayback
10:04 🔗 ersi damn it
10:04 🔗 alard Heh.
10:05 🔗 alard But as you can see it's much harder to get that running than the warc-proxy + firefox addon.
10:05 🔗 hiker1 does warcproxy just grab whatever .warc files it sees?
10:06 🔗 hiker1 ah, nvm, it has a neat interface!
10:07 🔗 hiker1 wow, this is really impressive work
10:13 🔗 ersi +1 alard
10:16 🔗 norbert79 alard: Holy-moly, this goes to my favourites
10:22 🔗 godane alard: the urls in menu for warc-proxy don't work for me for some reason
10:22 🔗 godane it doesn't take in the baseurl
10:22 🔗 hiker1 The base url didn't work for me, but the other ones did
10:23 🔗 godane so it will go to folder/file instead of example.com/folder/file or something like that
10:23 🔗 godane and so it would error
10:23 🔗 alard That's strange.
10:24 🔗 godane also when testing my eff.org grab it would just go to real site
10:24 🔗 alard (Whether the base url works depends on the contents of your warc file. If the base url isn't in there it won't be visible.)
10:25 🔗 alard godane: Is that an https site?
10:25 🔗 godane yes
10:33 🔗 alard godane: The https doesn't work yet. For some reason those requests aren't proxied. I've added it to the list: https://github.com/alard/warc-proxy/issues/2
10:44 🔗 ats is there an Internet Archive IRC channel somewhere, or is this the best bet?
10:45 🔗 ersi #internetarchive unofficial/semi-officialo channel
10:45 🔗 ats cheers :)
10:45 🔗 ersi mostly just to get IA shizzle out of this channel :)
10:46 🔗 chronomex yes, same people here and there mostly
11:38 🔗 SketchCow More hugs here
11:41 🔗 SketchCow Hey, someone's using the warrior, it spent 45 minutes on "setting up data partition".
11:41 🔗 SketchCow And he stopped it.
11:41 🔗 SketchCow Any ideas?
11:42 🔗 ersi scrap and start it again?
12:12 🔗 SmileyG did you givbe it like a 10tb partition for /data?
12:24 🔗 tuabkiet 10TB???
12:48 🔗 hiker1 tuabkiet: You don't have 10 TB of RAID space lying around?
12:49 🔗 tuabkiet I don't use RAID, and my hard disk is 10 times smaller
12:53 🔗 hiker1 How do I get wget 1.14?
12:58 🔗 ersi hiker1: It's not in many repositories. You'll probably have to compile it yourself
12:59 🔗 hiker1 damn. I'm downloading Linux Mint Debian Edition which uses Debian Testing. I hope it's in there... Is there a compile guide by ArchiveTeam?
13:00 🔗 ersi No, but I can probably help
13:00 🔗 hiker1 how long are you going to be on? I'm still downloading the Mint dvd.
13:01 🔗 ersi debian testing has wget 1.13.4-3
13:01 🔗 hiker1 How did you find that out?
13:01 🔗 hiker1 I was looking for a package listing but couldn't find one
13:01 🔗 ersi debian sid has wget 1.14
13:01 🔗 ersi http://packages.debian.org bro
13:01 🔗 hiker1 they hid it on their packages subdomain! those sneaky...
13:02 🔗 ersi you can probably install that .deb and everything will be fine
13:02 🔗 hiker1 I think there was an aptosid...
13:03 🔗 ersi you can probably just dpkg -i the .deb if you're inclined
13:03 🔗 hiker1 ersi: Do you use a linux distro? if so, which?
13:04 🔗 ersi Ubuntu, Red Hat Enterprise Server, Gentoo, crappy version of SuSE and I've used Debian
13:04 🔗 hiker1 oh.
13:04 🔗 hiker1 no mint?
13:04 🔗 ersi nope. But it's just another Debian deriative
13:06 🔗 SketchCow http://archive.org/details/ftp_cavedog.com now up
13:07 🔗 hiker1 Where are archives of known dead sites kept?
13:07 🔗 hiker1 I only saw the just in time captures
13:09 🔗 hiker1 SketchCow: Any chance you could post a file listing along with the FTP Snapshot? It would be nice to know what I'm getting before grabbing 1.5 GB.
13:10 🔗 ersi Most are up on archive.org
13:10 🔗 ersi SketchCow: thx~
13:11 🔗 hiker1 Does http://archive.org/details/archiveteam-fire include known dead sites?
13:11 🔗 alard hiker1: http://archive.org/download/ftp_cavedog.com/ftp.cavedog.com.tar/
13:12 🔗 alard (a slash at the end of the .tar usually gives you an index)
13:12 🔗 hiker1 alard: oh, wow, that is handy. Thank you.
13:12 🔗 SketchCow Also, you should trust me
13:12 🔗 SketchCow Everything I upload is awesome
13:12 🔗 hiker1 hah
13:13 🔗 ersi Indeed
13:26 🔗 hiker1 I downloaded a forum about 3 years ago. The place is gone now. IA has some of the forum archived, but I'm pretty sure my archive has everything. Can I distribute it through ArchiveTeam?
13:28 🔗 hiker1 The forum had a few thousand posts. It was the official forum for a video game called Lord of the Rings Online TCG. The whole archive is only 11 MB.
13:43 🔗 tuabkiet hiker1: Up it to Internet Archive NOW!
13:43 🔗 hiker1 I am not sure how
13:44 🔗 ersi Create an account first and foremost
16:44 🔗 SketchCow Bagger 288! Bagger 288!
16:46 🔗 soultcer SketchCow: Did you find the two Dailybooth warc files I asked for?
16:47 🔗 schbiridi the tracker thing eg used at http://tracker.archiveteam.org/webshots/ could use a link "Wanna join? http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior" link
16:48 🔗 SketchCow Agreed on wanna join.
16:48 🔗 SketchCow soultcer: No, I've been working on my presentation.
16:48 🔗 SketchCow E-mail me. jason@textfiles.com.
16:48 🔗 soultcer Will do
19:00 🔗 alard Has someone saved the http://blog.webshots.com/ ?
23:27 🔗 Nemo_bis slowly redoing wikia dumps mirror: https://archive.org/details/wikia_dump_20121204
23:28 🔗 Nemo_bis now 5704 wikis begining by "a" vs. 872 in previous snapshot
23:29 🔗 Nemo_bis still, looks like dumps are not generated for 80 % of wikis they have even if requested
23:39 🔗 alard ---------------------------------------------------------------------------
23:39 🔗 alard Hi all. Webshots is done. 109 TB saved by 134 downloaders. Thanks!
23:39 🔗 alard It's available on the projects tab of your warrior.
23:39 🔗 alard Next station: DailyBooth.com, closing at the end of the year.
23:39 🔗 alard If you want to run it yourself: https://github.com/ArchiveTeam/dailybooth-grab
23:39 🔗 alard (All very similar to WebShots and previous projects.)
23:39 🔗 alard Join #dailybooth for more detailed discussions.
23:39 🔗 alard ---------------------------------------------------------------------------

irclogger-viewer