[04:25] Is there an easy way to convert a warc to a warc.gz? [04:29] you wanto compress it with gzip? [04:40] I think warc.gz technically compresses the individual blocks, and is not simply a compressed file. [04:45] Correct. [04:57] Are mirrors classified as panic grabs even if there isn't any real worry the site will go down? [05:03] hiker2: i'm always doing panic grabs of sites that are not going down [05:04] Where do I upload them to? [05:04] i normally upload my stuff to texts [05:05] when its warc.gz dumps [05:11] Wouldn't the files compress a lot better if you compressed them after saving them to e.g. a 7z file? [05:13] yes but then the wayback machine's software wouldn't be able to ingest them directly [05:13] Is anyone actually using the wayback machine software to view warcs besides IA? [05:14] Remember: warc.gz isn't just a compressed warc file [05:14] It compresses the chunks, but not the headers and hashes [05:14] or something [05:14] and that's why its compression suffers [05:14] there's the IA partners under the Archive-It umbrella http://archive-it.org/ [05:14] GLaDOS: It appears to compress the entire Record, including headers. [05:15] DFJustin: but no one here uses it. And the WARCs generated here are not being used with the wayback machine. [05:15] that second point is incorrect [05:16] jason has worked with the IA guys to ingest everything we've done into the new beta wayback machine [05:16] except for the very latest new grabs [05:16] Can the beta be accessed from anywhere? [05:16] yes [05:16] http://wayback-beta.archive.org/ [05:17] DFJustin: could you give me an example of a site that #archiveteam saved and is now available through the beta? [05:18] look at Nov/Dec here http://web-beta.archive.org/web/20110701000000*/http://www.splinder.com/ [05:18] the regular wayback machine was doing spotty crawls but the huge spike is us [05:19] Do they load for you? [05:19] ah, it loaded [05:20] neat. I didn't realize this stuff was being pooled together somewhere [05:20] How quickly are sites added to the machine after being grabbed on here? [05:21] updating the wayback machine has so far been a manual process done irregularly at multi-month intervals, I think the plan with the new one is to do it more often but I don't know the details [05:26] here's the spreadsheet jason was using, blue means go for wayback ingestion https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0 [05:28] so not actually everything because mobileme is too freaking huge for right now [06:03] there must have been older pdfs on computerpoweruser.com [06:06] nevermind [06:06] it was a dead link [06:23] If the wayback machine already has a good archive, should I bother archiving a site? [06:23] i'm grabing ftp://ftp.futurenet.com [06:23] hiker2: i say archive it again [06:23] sometimes wayback machine can't get stuff cause of robots.txt [06:24] some of the waybackmachine grabs don't properly archive external images either [06:25] The first real spider I wrote now grabs all the urls in a sitemap.xml. Seems to work well for blogspot sites, so you can just download the sitemap and feed it into the spider. [06:45] godane: How can I tell which uploads are yours? [06:59] hiker2: https://archive.org/search.php?query=uploader%3A%22slaxemulator%40gmail.com%22 [07:00] Do individual items not show who uploaded them? [07:36] they do but you have to look at the meta.xml [07:37] unless you're a collection admin [07:37] i'm grabbing www.polymicrosystems.com/files/ but am NOT using warc... not even sure HOW to use warc [07:38] http://www.archiveteam.org/index.php?title=Wget_with_WARC_output [08:09] i'm now uploading the offical xbox magazine web archive [08:23] looks like the wayback machine has all of the xbox podcast from dl.oxmonline.com [08:41] LilyLivingstone: 5 megabyte hard drive from 1956, being loaded via forklift onto plane. http://t.co/Cop9kR0l [09:43] is anyone mirroring g4tv videos? [10:17] hiker2: The warcproxy also depends on the per-record gzip compression. [10:17] Is this archived somewhere? http://torcache.net/ [10:33] i think this needs to be backed up: ftp://ftp.download.packardbell.com/ [10:34] it has manuals and drivers packardbell or hp stuff [10:52] godane: on it [10:54] the NATO FTP is still downloading... at 30 KiB/s now [10:54] nice [10:54] packard bell is similarly 90s-bound [11:09] godane: do you also try eMule to grab stuff? [11:10] in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems [11:20] Do you know any examples off the top of your head? [11:24] chazchaz: examples of what? [11:38] 03:10:23 <@Nemo_bis> in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems [11:38] countries, I'd wager [11:42] Nemo_bis: i did try to find the techtv music wars special on emule [11:43] but turns out that the server it was called razor [11:44] it was raided in feb 2006 [11:51] chronomex: I'm sure of Italy and Spain, for instance [11:51] godane: eMule is serverless since ages, it uses KAD [11:52] i'm on amule since i run linux [11:52] so? [11:52] i have search techtv and i'm still not finding it [11:53] I have to use eMule on wine because I don't have a public IP, Fastweb uses NAT [11:53] and only MorphXT has a decent support for it [11:54] godane: KAD needs some time [11:55] you can try and add to downloads other techtv things and find more noes [11:55] *nodes [11:56] that said, it might be the wrong thing to search there, dunno [12:37] http://www.introni.it/marzaglia.html [12:38] godane: What exactly from the official xbox mag are you archiving? [12:44] hiker2: everything here: http://www.oxmonline.com/secretstash [12:48] godane: Are those the full issues? [12:48] chronomex: interested in archiving these? http://www.tubebooks.org/technical_books_online.htm [12:48] wow. Why do they offer them for free? Most sites would charge for them. [12:50] godane: Do you delete the archives from your computer after you upload them to IA? [12:54] hiker2: no [12:54] What do you do with them? [12:54] i most burn them to bluray when i need space [12:54] wow. you are serious about archiving! [12:55] living with dialup made me serious about archiving [12:55] I had dialup as well.. But you don't have it anymore I assume [13:56] Nemo_bis: do you know how to get better search results in emule? [14:13] godane: you have to know the nodes closer to those who have that stuff [14:13] when you've been downloading/uploading some things for a while, you're more likely to find similar things [14:14] you also have to try all possible combinations and orders for your keywords in KAD searches, because they're a bit silly [14:15] if your first keyword is full of noise, subsequent keywords usually will not help narrowing the search [14:15] but if it's too specific you may not find anything [14:16] of course it's better if you have a "high id", which needs a public ip, properly configured firewall etc. [15:27] Nemo_bis: Examples of niche things one might only find on eMule. [15:36] chazchaz: every network/service/community might have things you can find nowhere else [15:38] I know, I was just curious about examples for eMule [15:59] there are tons of trails on future publising ftp [15:59] *game trailers [17:19] chazchaz: I can't provide examples, one just has to try... and the results also depend on "where" one is, I suspect, nor I have a good setup to have all possible search results [18:25] I've grabbed some shareware cds off emule [18:26] as Nemo_bis says it's more popular with italians so e.g. I found https://archive.org/details/cdrom-hackers-magazine-57 [18:30] Nice. :) [18:30] If you keep it in the shared files, you'll later find more similar stuff. [18:32] That's the only ISO I see, too. I'm downloading a couple PDFs though [18:32] yeah lots of ebooks [18:41] also all this stuff came from packs on emule https://archive.org/details/firearmsmanuals https://archive.org/details/manuals-apple https://archive.org/details/printer-manuals https://archive.org/details/yamaha_bike_manuals [19:04] I see a bunch of photoshop magazine CDs, grabbing those [19:41] Are there any WARC guis? [20:03] This guy claims to have scanned 10 000 magazines: http://www.blogdopicco.blogspot.com/ [20:03] And he has uploaded only a tiny fraction of them. [20:04] People like to exaggerate their claims. [22:30] <_obscure_> Mr Sketch: I found a site you might like, it's an attempt at an archive of the Wisconsin punk rock scene band recordings from the 1970-2000. It's a really interesting thing and it's right up your alley. http://www.mkepunk.com/ [23:24] I believe my WarcMiddleware is sufficiently advanced that it could be used to archive websites now: https://github.com/iramari/WarcMiddleware [23:25] I have successfully used it to do so at least. [23:45] nice