#archiveteam 2013-01-06,Sun

↑back Search

Time Nickname Message
04:25 🔗 hiker2 Is there an easy way to convert a warc to a warc.gz?
04:29 🔗 no2pencil you wanto compress it with gzip?
04:40 🔗 hiker2 I think warc.gz technically compresses the individual blocks, and is not simply a compressed file.
04:45 🔗 GLaDOS Correct.
04:57 🔗 hiker2 Are mirrors classified as panic grabs even if there isn't any real worry the site will go down?
05:03 🔗 godane hiker2: i'm always doing panic grabs of sites that are not going down
05:04 🔗 hiker2 Where do I upload them to?
05:04 🔗 godane i normally upload my stuff to texts
05:05 🔗 godane when its warc.gz dumps
05:11 🔗 hiker2 Wouldn't the files compress a lot better if you compressed them after saving them to e.g. a 7z file?
05:13 🔗 DFJustin yes but then the wayback machine's software wouldn't be able to ingest them directly
05:13 🔗 hiker2 Is anyone actually using the wayback machine software to view warcs besides IA?
05:14 🔗 GLaDOS Remember: warc.gz isn't just a compressed warc file
05:14 🔗 GLaDOS It compresses the chunks, but not the headers and hashes
05:14 🔗 GLaDOS or something
05:14 🔗 hiker2 and that's why its compression suffers
05:14 🔗 DFJustin there's the IA partners under the Archive-It umbrella http://archive-it.org/
05:14 🔗 hiker2 GLaDOS: It appears to compress the entire Record, including headers.
05:15 🔗 hiker2 DFJustin: but no one here uses it. And the WARCs generated here are not being used with the wayback machine.
05:15 🔗 DFJustin that second point is incorrect
05:16 🔗 DFJustin jason has worked with the IA guys to ingest everything we've done into the new beta wayback machine
05:16 🔗 DFJustin except for the very latest new grabs
05:16 🔗 hiker2 Can the beta be accessed from anywhere?
05:16 🔗 DFJustin yes
05:16 🔗 DFJustin http://wayback-beta.archive.org/
05:17 🔗 hiker2 DFJustin: could you give me an example of a site that #archiveteam saved and is now available through the beta?
05:18 🔗 DFJustin look at Nov/Dec here http://web-beta.archive.org/web/20110701000000*/http://www.splinder.com/
05:18 🔗 DFJustin the regular wayback machine was doing spotty crawls but the huge spike is us
05:19 🔗 hiker2 Do they load for you?
05:19 🔗 hiker2 ah, it loaded
05:20 🔗 hiker2 neat. I didn't realize this stuff was being pooled together somewhere
05:20 🔗 hiker2 How quickly are sites added to the machine after being grabbed on here?
05:21 🔗 DFJustin updating the wayback machine has so far been a manual process done irregularly at multi-month intervals, I think the plan with the new one is to do it more often but I don't know the details
05:26 🔗 DFJustin here's the spreadsheet jason was using, blue means go for wayback ingestion https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0
05:28 🔗 DFJustin so not actually everything because mobileme is too freaking huge for right now
06:03 🔗 godane there must have been older pdfs on computerpoweruser.com
06:06 🔗 godane nevermind
06:06 🔗 godane it was a dead link
06:23 🔗 hiker2 If the wayback machine already has a good archive, should I bother archiving a site?
06:23 🔗 godane i'm grabing ftp://ftp.futurenet.com
06:23 🔗 godane hiker2: i say archive it again
06:23 🔗 godane sometimes wayback machine can't get stuff cause of robots.txt
06:24 🔗 hiker2 some of the waybackmachine grabs don't properly archive external images either
06:25 🔗 hiker2 The first real spider I wrote now grabs all the urls in a sitemap.xml. Seems to work well for blogspot sites, so you can just download the sitemap and feed it into the spider.
06:45 🔗 hiker2 godane: How can I tell which uploads are yours?
06:59 🔗 godane hiker2: https://archive.org/search.php?query=uploader%3A%22slaxemulator%40gmail.com%22
07:00 🔗 hiker2 Do individual items not show who uploaded them?
07:36 🔗 DFJustin they do but you have to look at the meta.xml
07:37 🔗 DFJustin unless you're a collection admin
07:37 🔗 Lord_Nigh i'm grabbing www.polymicrosystems.com/files/ but am NOT using warc... not even sure HOW to use warc
07:38 🔗 DFJustin http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
08:09 🔗 godane i'm now uploading the offical xbox magazine web archive
08:23 🔗 godane looks like the wayback machine has all of the xbox podcast from dl.oxmonline.com
08:41 🔗 lemonkey LilyLivingstone: 5 megabyte hard drive from 1956, being loaded via forklift onto plane. http://t.co/Cop9kR0l
09:43 🔗 godane is anyone mirroring g4tv videos?
10:17 🔗 alard hiker2: The warcproxy also depends on the per-record gzip compression.
10:17 🔗 Nemo_bis Is this archived somewhere? http://torcache.net/
10:33 🔗 godane i think this needs to be backed up: ftp://ftp.download.packardbell.com/
10:34 🔗 godane it has manuals and drivers packardbell or hp stuff
10:52 🔗 chronomex godane: on it
10:54 🔗 Nemo_bis the NATO FTP is still downloading... at 30 KiB/s now
10:54 🔗 chronomex nice
10:54 🔗 chronomex packard bell is similarly 90s-bound
11:09 🔗 Nemo_bis godane: do you also try eMule to grab stuff?
11:10 🔗 Nemo_bis in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems
11:20 🔗 chazchaz Do you know any examples off the top of your head?
11:24 🔗 Nemo_bis chazchaz: examples of what?
11:38 🔗 chronomex 03:10:23 <@Nemo_bis> in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems
11:38 🔗 chronomex countries, I'd wager
11:42 🔗 godane Nemo_bis: i did try to find the techtv music wars special on emule
11:43 🔗 godane but turns out that the server it was called razor
11:44 🔗 godane it was raided in feb 2006
11:51 🔗 Nemo_bis chronomex: I'm sure of Italy and Spain, for instance
11:51 🔗 Nemo_bis godane: eMule is serverless since ages, it uses KAD
11:52 🔗 godane i'm on amule since i run linux
11:52 🔗 Nemo_bis so?
11:52 🔗 godane i have search techtv and i'm still not finding it
11:53 🔗 Nemo_bis I have to use eMule on wine because I don't have a public IP, Fastweb uses NAT
11:53 🔗 Nemo_bis and only MorphXT has a decent support for it
11:54 🔗 Nemo_bis godane: KAD needs some time
11:55 🔗 Nemo_bis you can try and add to downloads other techtv things and find more noes
11:55 🔗 Nemo_bis *nodes
11:56 🔗 Nemo_bis that said, it might be the wrong thing to search there, dunno
12:37 🔗 Nemo_bis http://www.introni.it/marzaglia.html
12:38 🔗 hiker2 godane: What exactly from the official xbox mag are you archiving?
12:44 🔗 godane hiker2: everything here: http://www.oxmonline.com/secretstash
12:48 🔗 hiker2 godane: Are those the full issues?
12:48 🔗 Nemo_bis chronomex: interested in archiving these? http://www.tubebooks.org/technical_books_online.htm
12:48 🔗 hiker2 wow. Why do they offer them for free? Most sites would charge for them.
12:50 🔗 hiker2 godane: Do you delete the archives from your computer after you upload them to IA?
12:54 🔗 godane hiker2: no
12:54 🔗 hiker2 What do you do with them?
12:54 🔗 godane i most burn them to bluray when i need space
12:54 🔗 hiker2 wow. you are serious about archiving!
12:55 🔗 godane living with dialup made me serious about archiving
12:55 🔗 hiker2 I had dialup as well.. But you don't have it anymore I assume
13:56 🔗 godane Nemo_bis: do you know how to get better search results in emule?
14:13 🔗 Nemo_bis godane: you have to know the nodes closer to those who have that stuff
14:13 🔗 Nemo_bis when you've been downloading/uploading some things for a while, you're more likely to find similar things
14:14 🔗 Nemo_bis you also have to try all possible combinations and orders for your keywords in KAD searches, because they're a bit silly
14:15 🔗 Nemo_bis if your first keyword is full of noise, subsequent keywords usually will not help narrowing the search
14:15 🔗 Nemo_bis but if it's too specific you may not find anything
14:16 🔗 Nemo_bis of course it's better if you have a "high id", which needs a public ip, properly configured firewall etc.
15:27 🔗 chazchaz Nemo_bis: Examples of niche things one might only find on eMule.
15:36 🔗 schbiridi chazchaz: every network/service/community might have things you can find nowhere else
15:38 🔗 chazchaz I know, I was just curious about examples for eMule
15:59 🔗 godane there are tons of trails on future publising ftp
15:59 🔗 godane *game trailers
17:19 🔗 Nemo_bis chazchaz: I can't provide examples, one just has to try... and the results also depend on "where" one is, I suspect, nor I have a good setup to have all possible search results
18:25 🔗 DFJustin I've grabbed some shareware cds off emule
18:26 🔗 DFJustin as Nemo_bis says it's more popular with italians so e.g. I found https://archive.org/details/cdrom-hackers-magazine-57
18:30 🔗 Nemo_bis Nice. :)
18:30 🔗 Nemo_bis If you keep it in the shared files, you'll later find more similar stuff.
18:32 🔗 Nemo_bis That's the only ISO I see, too. I'm downloading a couple PDFs though
18:32 🔗 DFJustin yeah lots of ebooks
18:41 🔗 DFJustin also all this stuff came from packs on emule https://archive.org/details/firearmsmanuals https://archive.org/details/manuals-apple https://archive.org/details/printer-manuals https://archive.org/details/yamaha_bike_manuals
19:04 🔗 DFJustin I see a bunch of photoshop magazine CDs, grabbing those
19:41 🔗 hiker1 Are there any WARC guis?
20:03 🔗 Nemo_bis This guy claims to have scanned 10 000 magazines: http://www.blogdopicco.blogspot.com/
20:03 🔗 Nemo_bis And he has uploaded only a tiny fraction of them.
20:04 🔗 hiker1 People like to exaggerate their claims.
22:30 🔗 _obscure_ Mr Sketch: I found a site you might like, it's an attempt at an archive of the Wisconsin punk rock scene band recordings from the 1970-2000. It's a really interesting thing and it's right up your alley. http://www.mkepunk.com/
23:24 🔗 hiker1 I believe my WarcMiddleware is sufficiently advanced that it could be used to archive websites now: https://github.com/iramari/WarcMiddleware
23:25 🔗 hiker1 I have successfully used it to do so at least.
23:45 🔗 chronomex nice

irclogger-viewer