Time |
Nickname |
Message |
00:20
🔗
|
SmileyG |
urgh, my cousin in law was once on a telly program called crazy cottage |
00:20
🔗
|
SmileyG |
need to try and see if i can find it ¬_¬ |
01:53
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-1998-2004-mirror |
02:03
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-2005-mirror |
02:05
🔗
|
godane |
1998 to 2004 is not much bigger then the full 2005 article mirror |
02:37
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-2006-mirror |
05:55
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-2007-mirror |
05:55
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-2008-mirror |
08:03
🔗
|
hiker1 |
Hi. What is the easiest way to access .warc file contents on Windows? |
08:06
🔗
|
SketchCow |
Never ask if you should archive something. Archive it and ask if any of us assholes want a copy |
08:06
🔗
|
SketchCow |
and then keep it yourself |
08:08
🔗
|
Coderjoe |
SketchCow: want a copy of ftp.cavedog.com? uncompressed tarball is 1.6GB, xz-compressed is 1GB |
08:09
🔗
|
SketchCow |
Duh |
08:10
🔗
|
SketchCow |
What was it? |
08:10
🔗
|
Coderjoe |
the game developer that made games like Total Annihilation |
08:12
🔗
|
Coderjoe |
(that is, Total Annihilation, and another similar RTS with a more medieval theme) |
08:13
🔗
|
SketchCow |
Approved. |
08:13
🔗
|
Coderjoe |
it also includes updates for their parent company, Humongous Entertainment |
08:13
🔗
|
SketchCow |
Do you have a place for me to download from or do I need to give you a slot? |
08:15
🔗
|
Coderjoe |
I can set up a download a moment |
08:20
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-2009-mirror |
08:25
🔗
|
nova |
archiving feels so good |
08:25
🔗
|
nova |
especially when the original disappears |
08:26
🔗
|
hiker1 |
Can anyone help me access the contents of a .warc file on Windows? |
08:43
🔗
|
ersi |
Coderjoe: Oooh, I want that as well |
08:48
🔗
|
Coderjoe |
SketchCow: rsync path sent via PM |
08:48
🔗
|
Coderjoe |
ersi: I only have so much upstream bandwidth :-\ |
08:49
🔗
|
Coderjoe |
looks like I probably did the last mirroring pass in 11/2005 |
09:02
🔗
|
godane |
so i'm starting to do image grabs for each of my arstechnica dumps |
09:09
🔗
|
hiker1 |
godane: What programs are you using for the archival process? |
09:14
🔗
|
godane |
wget |
09:16
🔗
|
hiker1 |
thanks |
09:17
🔗
|
ersi |
wget-1.14 (the latest) has support for writing to WARC files, thanks to alard |
09:18
🔗
|
hiker1 |
I'm still trying to get stuff out of the WARC files. |
09:19
🔗
|
ersi |
Hmm, there's warc2zip, that might help you since you're on windows - hold on a moment |
09:20
🔗
|
ersi |
hiker1: http://warctozip.archive.org/ |
09:20
🔗
|
hiker1 |
What are non-Windows users using? |
09:22
🔗
|
hiker1 |
ersi: That website requires you upload the entire warc file to the server. In some cases thats hundreds and hundreds of MB. |
09:22
🔗
|
Coderjoe |
you know what would be really crazy? implementing that warctozip using javascript, so you don't actually need to upload the file to a server |
09:23
🔗
|
hiker1 |
But isn't there a reason stuff is stored using warc instead of zip? |
09:23
🔗
|
ersi |
I never really need to open up WARCs, when I do; I just `less` or `zless them and read them straight. But there's a bunch of tools, like warc-tools from hanzoarchives (ie tef) |
09:24
🔗
|
hiker1 |
How else do you read the contents of archived websites? |
09:24
🔗
|
ersi |
https://github.com/tef/warctools |
09:24
🔗
|
ersi |
The reason for WARC is that; Metadata. You'll know from WHERE and WHEN the data was downloaded. Because you have the HTTP Headers for both the Request and Response |
09:24
🔗
|
Coderjoe |
there is, for archives. the warc includes metadata like the original URL, request headers, response headers, date and time of the request, etc |
09:25
🔗
|
* |
ersi nods |
09:25
🔗
|
ersi |
The most common interface to actually view WARCs is, the Internet Archive Wayback Machine. But you can't use that for your own WARCs though ;) |
09:25
🔗
|
Coderjoe |
if you just need to get files out of it, converting it to zip is fine (provided you don't delete the warc) |
09:25
🔗
|
Coderjoe |
well, there is the open-source wayback codebase |
09:25
🔗
|
ersi |
true, but it's a pain in the ass to setup |
09:25
🔗
|
Coderjoe |
iirc, yipdw had an instance of that set up |
09:26
🔗
|
hiker1 |
I would have thought there would be a program which hosts the warc file on a web server, or directly explores the contents without requiring a conversion. |
09:26
🔗
|
hiker1 |
I'm trying out IA's warc library for python https://github.com/internetarchive/warc |
09:26
🔗
|
Coderjoe |
that's the wayback |
09:27
🔗
|
ersi |
IMO warc-tools from tef is better than IA warc |
09:27
🔗
|
SketchCow |
Coderjoe: Absorbing your cavedog as we speak. |
09:27
🔗
|
ersi |
Om nom nom |
09:27
🔗
|
hiker1 |
Can you use the wayback machine to view warc's that are uploaded to IA? |
09:27
🔗
|
SketchCow |
After I get this, ersi, it'll be on archive.org in seconds. No worries. |
09:28
🔗
|
ersi |
SketchCow: I'll nom on it then |
09:28
🔗
|
Coderjoe |
I noticed (but only because I was watching the log) |
09:28
🔗
|
Coderjoe |
mmm |
09:28
🔗
|
Coderjoe |
traffic shaping and prioritizing really takes the pain away |
09:40
🔗
|
hiker1 |
So it wget outputting to warc the preferred method for archiving sites? Not HTTrack? |
09:40
🔗
|
Coderjoe |
depends |
09:41
🔗
|
Coderjoe |
though I don't know about httrack |
09:41
🔗
|
Coderjoe |
IA has a tool called Heretrix for their normal crawls |
09:42
🔗
|
Coderjoe |
we use wget here because we can make it ignore robots.txt. and with the lua scripting, we can specialize things for each site. |
09:42
🔗
|
chronomex |
yup |
09:58
🔗
|
alard |
Or try viewing the warc with this: https://github.com/alard/warc-proxy :) |
09:59
🔗
|
hiker1 |
That looks a lot closer to what I want |
10:00
🔗
|
ersi |
Oops, thought about writing that one out as well :) |
10:00
🔗
|
hiker1 |
alard: Why does it use a proxy instead of just running a web server? |
10:02
🔗
|
alard |
hiker1: The thing *is* the proxy. That's the easiest way to do it -- from a technical perspective, that is -- since you don't have to rewrite any urls. |
10:02
🔗
|
alard |
The wayback web interface has to replace the URLs in every web page it serves. The warc-proxy addon just configures its little web server as a proxy, and it's done. |
10:03
🔗
|
hiker1 |
well, the nice thing about rewriting is then you can serve the files to other people through the web. |
10:04
🔗
|
hiker1 |
with the proxy method, only a local user can access them, unless you make the proxy public which would not be easy for most users to access |
10:04
🔗
|
ersi |
the non-nice thing is that it's a pain in the ass |
10:04
🔗
|
alard |
Yes, but that's not what this tool is for. If you want to do that there's the wayback tool. |
10:04
🔗
|
hiker1 |
What is the wayback tool? |
10:04
🔗
|
ersi |
Wayback Machine |
10:04
🔗
|
hiker1 |
but that won't serve private warc files |
10:04
🔗
|
alard |
https://github.com/internetarchive/wayback |
10:04
🔗
|
ersi |
https://github.com/internetarchive/wayback |
10:04
🔗
|
ersi |
damn it |
10:04
🔗
|
alard |
Heh. |
10:05
🔗
|
alard |
But as you can see it's much harder to get that running than the warc-proxy + firefox addon. |
10:05
🔗
|
hiker1 |
does warcproxy just grab whatever .warc files it sees? |
10:06
🔗
|
hiker1 |
ah, nvm, it has a neat interface! |
10:07
🔗
|
hiker1 |
wow, this is really impressive work |
10:13
🔗
|
ersi |
+1 alard |
10:16
🔗
|
norbert79 |
alard: Holy-moly, this goes to my favourites |
10:22
🔗
|
godane |
alard: the urls in menu for warc-proxy don't work for me for some reason |
10:22
🔗
|
godane |
it doesn't take in the baseurl |
10:22
🔗
|
hiker1 |
The base url didn't work for me, but the other ones did |
10:23
🔗
|
godane |
so it will go to folder/file instead of example.com/folder/file or something like that |
10:23
🔗
|
godane |
and so it would error |
10:23
🔗
|
alard |
That's strange. |
10:24
🔗
|
godane |
also when testing my eff.org grab it would just go to real site |
10:24
🔗
|
alard |
(Whether the base url works depends on the contents of your warc file. If the base url isn't in there it won't be visible.) |
10:25
🔗
|
alard |
godane: Is that an https site? |
10:25
🔗
|
godane |
yes |
10:33
🔗
|
alard |
godane: The https doesn't work yet. For some reason those requests aren't proxied. I've added it to the list: https://github.com/alard/warc-proxy/issues/2 |
10:44
🔗
|
ats |
is there an Internet Archive IRC channel somewhere, or is this the best bet? |
10:45
🔗
|
ersi |
#internetarchive unofficial/semi-officialo channel |
10:45
🔗
|
ats |
cheers :) |
10:45
🔗
|
ersi |
mostly just to get IA shizzle out of this channel :) |
10:46
🔗
|
chronomex |
yes, same people here and there mostly |
11:38
🔗
|
SketchCow |
More hugs here |
11:41
🔗
|
SketchCow |
Hey, someone's using the warrior, it spent 45 minutes on "setting up data partition". |
11:41
🔗
|
SketchCow |
And he stopped it. |
11:41
🔗
|
SketchCow |
Any ideas? |
11:42
🔗
|
ersi |
scrap and start it again? |
12:12
🔗
|
SmileyG |
did you givbe it like a 10tb partition for /data? |
12:24
🔗
|
tuabkiet |
10TB??? |
12:48
🔗
|
hiker1 |
tuabkiet: You don't have 10 TB of RAID space lying around? |
12:49
🔗
|
tuabkiet |
I don't use RAID, and my hard disk is 10 times smaller |
12:53
🔗
|
hiker1 |
How do I get wget 1.14? |
12:58
🔗
|
ersi |
hiker1: It's not in many repositories. You'll probably have to compile it yourself |
12:59
🔗
|
hiker1 |
damn. I'm downloading Linux Mint Debian Edition which uses Debian Testing. I hope it's in there... Is there a compile guide by ArchiveTeam? |
13:00
🔗
|
ersi |
No, but I can probably help |
13:00
🔗
|
hiker1 |
how long are you going to be on? I'm still downloading the Mint dvd. |
13:01
🔗
|
ersi |
debian testing has wget 1.13.4-3 |
13:01
🔗
|
hiker1 |
How did you find that out? |
13:01
🔗
|
hiker1 |
I was looking for a package listing but couldn't find one |
13:01
🔗
|
ersi |
debian sid has wget 1.14 |
13:01
🔗
|
ersi |
http://packages.debian.org bro |
13:01
🔗
|
hiker1 |
they hid it on their packages subdomain! those sneaky... |
13:02
🔗
|
ersi |
you can probably install that .deb and everything will be fine |
13:02
🔗
|
hiker1 |
I think there was an aptosid... |
13:03
🔗
|
ersi |
you can probably just dpkg -i the .deb if you're inclined |
13:03
🔗
|
hiker1 |
ersi: Do you use a linux distro? if so, which? |
13:04
🔗
|
ersi |
Ubuntu, Red Hat Enterprise Server, Gentoo, crappy version of SuSE and I've used Debian |
13:04
🔗
|
hiker1 |
oh. |
13:04
🔗
|
hiker1 |
no mint? |
13:04
🔗
|
ersi |
nope. But it's just another Debian deriative |
13:06
🔗
|
SketchCow |
http://archive.org/details/ftp_cavedog.com now up |
13:07
🔗
|
hiker1 |
Where are archives of known dead sites kept? |
13:07
🔗
|
hiker1 |
I only saw the just in time captures |
13:09
🔗
|
hiker1 |
SketchCow: Any chance you could post a file listing along with the FTP Snapshot? It would be nice to know what I'm getting before grabbing 1.5 GB. |
13:10
🔗
|
ersi |
Most are up on archive.org |
13:10
🔗
|
ersi |
SketchCow: thx~ |
13:11
🔗
|
hiker1 |
Does http://archive.org/details/archiveteam-fire include known dead sites? |
13:11
🔗
|
alard |
hiker1: http://archive.org/download/ftp_cavedog.com/ftp.cavedog.com.tar/ |
13:12
🔗
|
alard |
(a slash at the end of the .tar usually gives you an index) |
13:12
🔗
|
hiker1 |
alard: oh, wow, that is handy. Thank you. |
13:12
🔗
|
SketchCow |
Also, you should trust me |
13:12
🔗
|
SketchCow |
Everything I upload is awesome |
13:12
🔗
|
hiker1 |
hah |
13:13
🔗
|
ersi |
Indeed |
13:26
🔗
|
hiker1 |
I downloaded a forum about 3 years ago. The place is gone now. IA has some of the forum archived, but I'm pretty sure my archive has everything. Can I distribute it through ArchiveTeam? |
13:28
🔗
|
hiker1 |
The forum had a few thousand posts. It was the official forum for a video game called Lord of the Rings Online TCG. The whole archive is only 11 MB. |
13:43
🔗
|
tuabkiet |
hiker1: Up it to Internet Archive NOW! |
13:43
🔗
|
hiker1 |
I am not sure how |
13:44
🔗
|
ersi |
Create an account first and foremost |
16:44
🔗
|
SketchCow |
Bagger 288! Bagger 288! |
16:46
🔗
|
soultcer |
SketchCow: Did you find the two Dailybooth warc files I asked for? |
16:47
🔗
|
schbiridi |
the tracker thing eg used at http://tracker.archiveteam.org/webshots/ could use a link "Wanna join? http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior" link |
16:48
🔗
|
SketchCow |
Agreed on wanna join. |
16:48
🔗
|
SketchCow |
soultcer: No, I've been working on my presentation. |
16:48
🔗
|
SketchCow |
E-mail me. jason@textfiles.com. |
16:48
🔗
|
soultcer |
Will do |
19:00
🔗
|
alard |
Has someone saved the http://blog.webshots.com/ ? |
23:27
🔗
|
Nemo_bis |
slowly redoing wikia dumps mirror: https://archive.org/details/wikia_dump_20121204 |
23:28
🔗
|
Nemo_bis |
now 5704 wikis begining by "a" vs. 872 in previous snapshot |
23:29
🔗
|
Nemo_bis |
still, looks like dumps are not generated for 80 % of wikis they have even if requested |
23:39
🔗
|
alard |
--------------------------------------------------------------------------- |
23:39
🔗
|
alard |
Hi all. Webshots is done. 109 TB saved by 134 downloaders. Thanks! |
23:39
🔗
|
alard |
It's available on the projects tab of your warrior. |
23:39
🔗
|
alard |
Next station: DailyBooth.com, closing at the end of the year. |
23:39
🔗
|
alard |
If you want to run it yourself: https://github.com/ArchiveTeam/dailybooth-grab |
23:39
🔗
|
alard |
(All very similar to WebShots and previous projects.) |
23:39
🔗
|
alard |
Join #dailybooth for more detailed discussions. |
23:39
🔗
|
alard |
--------------------------------------------------------------------------- |