Time |
Nickname |
Message |
04:25
🔗
|
hiker2 |
Is there an easy way to convert a warc to a warc.gz? |
04:29
🔗
|
no2pencil |
you wanto compress it with gzip? |
04:40
🔗
|
hiker2 |
I think warc.gz technically compresses the individual blocks, and is not simply a compressed file. |
04:45
🔗
|
GLaDOS |
Correct. |
04:57
🔗
|
hiker2 |
Are mirrors classified as panic grabs even if there isn't any real worry the site will go down? |
05:03
🔗
|
godane |
hiker2: i'm always doing panic grabs of sites that are not going down |
05:04
🔗
|
hiker2 |
Where do I upload them to? |
05:04
🔗
|
godane |
i normally upload my stuff to texts |
05:05
🔗
|
godane |
when its warc.gz dumps |
05:11
🔗
|
hiker2 |
Wouldn't the files compress a lot better if you compressed them after saving them to e.g. a 7z file? |
05:13
🔗
|
DFJustin |
yes but then the wayback machine's software wouldn't be able to ingest them directly |
05:13
🔗
|
hiker2 |
Is anyone actually using the wayback machine software to view warcs besides IA? |
05:14
🔗
|
GLaDOS |
Remember: warc.gz isn't just a compressed warc file |
05:14
🔗
|
GLaDOS |
It compresses the chunks, but not the headers and hashes |
05:14
🔗
|
GLaDOS |
or something |
05:14
🔗
|
hiker2 |
and that's why its compression suffers |
05:14
🔗
|
DFJustin |
there's the IA partners under the Archive-It umbrella http://archive-it.org/ |
05:14
🔗
|
hiker2 |
GLaDOS: It appears to compress the entire Record, including headers. |
05:15
🔗
|
hiker2 |
DFJustin: but no one here uses it. And the WARCs generated here are not being used with the wayback machine. |
05:15
🔗
|
DFJustin |
that second point is incorrect |
05:16
🔗
|
DFJustin |
jason has worked with the IA guys to ingest everything we've done into the new beta wayback machine |
05:16
🔗
|
DFJustin |
except for the very latest new grabs |
05:16
🔗
|
hiker2 |
Can the beta be accessed from anywhere? |
05:16
🔗
|
DFJustin |
yes |
05:16
🔗
|
DFJustin |
http://wayback-beta.archive.org/ |
05:17
🔗
|
hiker2 |
DFJustin: could you give me an example of a site that #archiveteam saved and is now available through the beta? |
05:18
🔗
|
DFJustin |
look at Nov/Dec here http://web-beta.archive.org/web/20110701000000*/http://www.splinder.com/ |
05:18
🔗
|
DFJustin |
the regular wayback machine was doing spotty crawls but the huge spike is us |
05:19
🔗
|
hiker2 |
Do they load for you? |
05:19
🔗
|
hiker2 |
ah, it loaded |
05:20
🔗
|
hiker2 |
neat. I didn't realize this stuff was being pooled together somewhere |
05:20
🔗
|
hiker2 |
How quickly are sites added to the machine after being grabbed on here? |
05:21
🔗
|
DFJustin |
updating the wayback machine has so far been a manual process done irregularly at multi-month intervals, I think the plan with the new one is to do it more often but I don't know the details |
05:26
🔗
|
DFJustin |
here's the spreadsheet jason was using, blue means go for wayback ingestion https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0 |
05:28
🔗
|
DFJustin |
so not actually everything because mobileme is too freaking huge for right now |
06:03
🔗
|
godane |
there must have been older pdfs on computerpoweruser.com |
06:06
🔗
|
godane |
nevermind |
06:06
🔗
|
godane |
it was a dead link |
06:23
🔗
|
hiker2 |
If the wayback machine already has a good archive, should I bother archiving a site? |
06:23
🔗
|
godane |
i'm grabing ftp://ftp.futurenet.com |
06:23
🔗
|
godane |
hiker2: i say archive it again |
06:23
🔗
|
godane |
sometimes wayback machine can't get stuff cause of robots.txt |
06:24
🔗
|
hiker2 |
some of the waybackmachine grabs don't properly archive external images either |
06:25
🔗
|
hiker2 |
The first real spider I wrote now grabs all the urls in a sitemap.xml. Seems to work well for blogspot sites, so you can just download the sitemap and feed it into the spider. |
06:45
🔗
|
hiker2 |
godane: How can I tell which uploads are yours? |
06:59
🔗
|
godane |
hiker2: https://archive.org/search.php?query=uploader%3A%22slaxemulator%40gmail.com%22 |
07:00
🔗
|
hiker2 |
Do individual items not show who uploaded them? |
07:36
🔗
|
DFJustin |
they do but you have to look at the meta.xml |
07:37
🔗
|
DFJustin |
unless you're a collection admin |
07:37
🔗
|
Lord_Nigh |
i'm grabbing www.polymicrosystems.com/files/ but am NOT using warc... not even sure HOW to use warc |
07:38
🔗
|
DFJustin |
http://www.archiveteam.org/index.php?title=Wget_with_WARC_output |
08:09
🔗
|
godane |
i'm now uploading the offical xbox magazine web archive |
08:23
🔗
|
godane |
looks like the wayback machine has all of the xbox podcast from dl.oxmonline.com |
08:41
🔗
|
lemonkey |
LilyLivingstone: 5 megabyte hard drive from 1956, being loaded via forklift onto plane. http://t.co/Cop9kR0l |
09:43
🔗
|
godane |
is anyone mirroring g4tv videos? |
10:17
🔗
|
alard |
hiker2: The warcproxy also depends on the per-record gzip compression. |
10:17
🔗
|
Nemo_bis |
Is this archived somewhere? http://torcache.net/ |
10:33
🔗
|
godane |
i think this needs to be backed up: ftp://ftp.download.packardbell.com/ |
10:34
🔗
|
godane |
it has manuals and drivers packardbell or hp stuff |
10:52
🔗
|
chronomex |
godane: on it |
10:54
🔗
|
Nemo_bis |
the NATO FTP is still downloading... at 30 KiB/s now |
10:54
🔗
|
chronomex |
nice |
10:54
🔗
|
chronomex |
packard bell is similarly 90s-bound |
11:09
🔗
|
Nemo_bis |
godane: do you also try eMule to grab stuff? |
11:10
🔗
|
Nemo_bis |
in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems |
11:20
🔗
|
chazchaz |
Do you know any examples off the top of your head? |
11:24
🔗
|
Nemo_bis |
chazchaz: examples of what? |
11:38
🔗
|
chronomex |
03:10:23 <@Nemo_bis> in some countries/languages/niches it's still widely used and some stuff never reaches torrents or other sharing systems |
11:38
🔗
|
chronomex |
countries, I'd wager |
11:42
🔗
|
godane |
Nemo_bis: i did try to find the techtv music wars special on emule |
11:43
🔗
|
godane |
but turns out that the server it was called razor |
11:44
🔗
|
godane |
it was raided in feb 2006 |
11:51
🔗
|
Nemo_bis |
chronomex: I'm sure of Italy and Spain, for instance |
11:51
🔗
|
Nemo_bis |
godane: eMule is serverless since ages, it uses KAD |
11:52
🔗
|
godane |
i'm on amule since i run linux |
11:52
🔗
|
Nemo_bis |
so? |
11:52
🔗
|
godane |
i have search techtv and i'm still not finding it |
11:53
🔗
|
Nemo_bis |
I have to use eMule on wine because I don't have a public IP, Fastweb uses NAT |
11:53
🔗
|
Nemo_bis |
and only MorphXT has a decent support for it |
11:54
🔗
|
Nemo_bis |
godane: KAD needs some time |
11:55
🔗
|
Nemo_bis |
you can try and add to downloads other techtv things and find more noes |
11:55
🔗
|
Nemo_bis |
*nodes |
11:56
🔗
|
Nemo_bis |
that said, it might be the wrong thing to search there, dunno |
12:37
🔗
|
Nemo_bis |
http://www.introni.it/marzaglia.html |
12:38
🔗
|
hiker2 |
godane: What exactly from the official xbox mag are you archiving? |
12:44
🔗
|
godane |
hiker2: everything here: http://www.oxmonline.com/secretstash |
12:48
🔗
|
hiker2 |
godane: Are those the full issues? |
12:48
🔗
|
Nemo_bis |
chronomex: interested in archiving these? http://www.tubebooks.org/technical_books_online.htm |
12:48
🔗
|
hiker2 |
wow. Why do they offer them for free? Most sites would charge for them. |
12:50
🔗
|
hiker2 |
godane: Do you delete the archives from your computer after you upload them to IA? |
12:54
🔗
|
godane |
hiker2: no |
12:54
🔗
|
hiker2 |
What do you do with them? |
12:54
🔗
|
godane |
i most burn them to bluray when i need space |
12:54
🔗
|
hiker2 |
wow. you are serious about archiving! |
12:55
🔗
|
godane |
living with dialup made me serious about archiving |
12:55
🔗
|
hiker2 |
I had dialup as well.. But you don't have it anymore I assume |
13:56
🔗
|
godane |
Nemo_bis: do you know how to get better search results in emule? |
14:13
🔗
|
Nemo_bis |
godane: you have to know the nodes closer to those who have that stuff |
14:13
🔗
|
Nemo_bis |
when you've been downloading/uploading some things for a while, you're more likely to find similar things |
14:14
🔗
|
Nemo_bis |
you also have to try all possible combinations and orders for your keywords in KAD searches, because they're a bit silly |
14:15
🔗
|
Nemo_bis |
if your first keyword is full of noise, subsequent keywords usually will not help narrowing the search |
14:15
🔗
|
Nemo_bis |
but if it's too specific you may not find anything |
14:16
🔗
|
Nemo_bis |
of course it's better if you have a "high id", which needs a public ip, properly configured firewall etc. |
15:27
🔗
|
chazchaz |
Nemo_bis: Examples of niche things one might only find on eMule. |
15:36
🔗
|
schbiridi |
chazchaz: every network/service/community might have things you can find nowhere else |
15:38
🔗
|
chazchaz |
I know, I was just curious about examples for eMule |
15:59
🔗
|
godane |
there are tons of trails on future publising ftp |
15:59
🔗
|
godane |
*game trailers |
17:19
🔗
|
Nemo_bis |
chazchaz: I can't provide examples, one just has to try... and the results also depend on "where" one is, I suspect, nor I have a good setup to have all possible search results |
18:25
🔗
|
DFJustin |
I've grabbed some shareware cds off emule |
18:26
🔗
|
DFJustin |
as Nemo_bis says it's more popular with italians so e.g. I found https://archive.org/details/cdrom-hackers-magazine-57 |
18:30
🔗
|
Nemo_bis |
Nice. :) |
18:30
🔗
|
Nemo_bis |
If you keep it in the shared files, you'll later find more similar stuff. |
18:32
🔗
|
Nemo_bis |
That's the only ISO I see, too. I'm downloading a couple PDFs though |
18:32
🔗
|
DFJustin |
yeah lots of ebooks |
18:41
🔗
|
DFJustin |
also all this stuff came from packs on emule https://archive.org/details/firearmsmanuals https://archive.org/details/manuals-apple https://archive.org/details/printer-manuals https://archive.org/details/yamaha_bike_manuals |
19:04
🔗
|
DFJustin |
I see a bunch of photoshop magazine CDs, grabbing those |
19:41
🔗
|
hiker1 |
Are there any WARC guis? |
20:03
🔗
|
Nemo_bis |
This guy claims to have scanned 10 000 magazines: http://www.blogdopicco.blogspot.com/ |
20:03
🔗
|
Nemo_bis |
And he has uploaded only a tiny fraction of them. |
20:04
🔗
|
hiker1 |
People like to exaggerate their claims. |
22:30
🔗
|
_obscure_ |
Mr Sketch: I found a site you might like, it's an attempt at an archive of the Wisconsin punk rock scene band recordings from the 1970-2000. It's a really interesting thing and it's right up your alley. http://www.mkepunk.com/ |
23:24
🔗
|
hiker1 |
I believe my WarcMiddleware is sufficiently advanced that it could be used to archive websites now: https://github.com/iramari/WarcMiddleware |
23:25
🔗
|
hiker1 |
I have successfully used it to do so at least. |
23:45
🔗
|
chronomex |
nice |