| Time |
Nickname |
Message |
|
00:20
🔗
|
SmileyG |
urgh, my cousin in law was once on a telly program called crazy cottage |
|
00:20
🔗
|
SmileyG |
need to try and see if i can find it ¬_¬ |
|
01:53
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-1998-2004-mirror |
|
02:03
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-2005-mirror |
|
02:05
🔗
|
godane |
1998 to 2004 is not much bigger then the full 2005 article mirror |
|
02:37
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-2006-mirror |
|
05:55
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-2007-mirror |
|
05:55
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-2008-mirror |
|
08:03
🔗
|
hiker1 |
Hi. What is the easiest way to access .warc file contents on Windows? |
|
08:06
🔗
|
SketchCow |
Never ask if you should archive something. Archive it and ask if any of us assholes want a copy |
|
08:06
🔗
|
SketchCow |
and then keep it yourself |
|
08:08
🔗
|
Coderjoe |
SketchCow: want a copy of ftp.cavedog.com? uncompressed tarball is 1.6GB, xz-compressed is 1GB |
|
08:09
🔗
|
SketchCow |
Duh |
|
08:10
🔗
|
SketchCow |
What was it? |
|
08:10
🔗
|
Coderjoe |
the game developer that made games like Total Annihilation |
|
08:12
🔗
|
Coderjoe |
(that is, Total Annihilation, and another similar RTS with a more medieval theme) |
|
08:13
🔗
|
SketchCow |
Approved. |
|
08:13
🔗
|
Coderjoe |
it also includes updates for their parent company, Humongous Entertainment |
|
08:13
🔗
|
SketchCow |
Do you have a place for me to download from or do I need to give you a slot? |
|
08:15
🔗
|
Coderjoe |
I can set up a download a moment |
|
08:20
🔗
|
godane |
uploaded: http://archive.org/details/arstechnica.com-articles-2009-mirror |
|
08:25
🔗
|
nova |
archiving feels so good |
|
08:25
🔗
|
nova |
especially when the original disappears |
|
08:26
🔗
|
hiker1 |
Can anyone help me access the contents of a .warc file on Windows? |
|
08:43
🔗
|
ersi |
Coderjoe: Oooh, I want that as well |
|
08:48
🔗
|
Coderjoe |
SketchCow: rsync path sent via PM |
|
08:48
🔗
|
Coderjoe |
ersi: I only have so much upstream bandwidth :-\ |
|
08:49
🔗
|
Coderjoe |
looks like I probably did the last mirroring pass in 11/2005 |
|
09:02
🔗
|
godane |
so i'm starting to do image grabs for each of my arstechnica dumps |
|
09:09
🔗
|
hiker1 |
godane: What programs are you using for the archival process? |
|
09:14
🔗
|
godane |
wget |
|
09:16
🔗
|
hiker1 |
thanks |
|
09:17
🔗
|
ersi |
wget-1.14 (the latest) has support for writing to WARC files, thanks to alard |
|
09:18
🔗
|
hiker1 |
I'm still trying to get stuff out of the WARC files. |
|
09:19
🔗
|
ersi |
Hmm, there's warc2zip, that might help you since you're on windows - hold on a moment |
|
09:20
🔗
|
ersi |
hiker1: http://warctozip.archive.org/ |
|
09:20
🔗
|
hiker1 |
What are non-Windows users using? |
|
09:22
🔗
|
hiker1 |
ersi: That website requires you upload the entire warc file to the server. In some cases thats hundreds and hundreds of MB. |
|
09:22
🔗
|
Coderjoe |
you know what would be really crazy? implementing that warctozip using javascript, so you don't actually need to upload the file to a server |
|
09:23
🔗
|
hiker1 |
But isn't there a reason stuff is stored using warc instead of zip? |
|
09:23
🔗
|
ersi |
I never really need to open up WARCs, when I do; I just `less` or `zless them and read them straight. But there's a bunch of tools, like warc-tools from hanzoarchives (ie tef) |
|
09:24
🔗
|
hiker1 |
How else do you read the contents of archived websites? |
|
09:24
🔗
|
ersi |
https://github.com/tef/warctools |
|
09:24
🔗
|
ersi |
The reason for WARC is that; Metadata. You'll know from WHERE and WHEN the data was downloaded. Because you have the HTTP Headers for both the Request and Response |
|
09:24
🔗
|
Coderjoe |
there is, for archives. the warc includes metadata like the original URL, request headers, response headers, date and time of the request, etc |
|
09:25
🔗
|
* |
ersi nods |
|
09:25
🔗
|
ersi |
The most common interface to actually view WARCs is, the Internet Archive Wayback Machine. But you can't use that for your own WARCs though ;) |
|
09:25
🔗
|
Coderjoe |
if you just need to get files out of it, converting it to zip is fine (provided you don't delete the warc) |
|
09:25
🔗
|
Coderjoe |
well, there is the open-source wayback codebase |
|
09:25
🔗
|
ersi |
true, but it's a pain in the ass to setup |
|
09:25
🔗
|
Coderjoe |
iirc, yipdw had an instance of that set up |
|
09:26
🔗
|
hiker1 |
I would have thought there would be a program which hosts the warc file on a web server, or directly explores the contents without requiring a conversion. |
|
09:26
🔗
|
hiker1 |
I'm trying out IA's warc library for python https://github.com/internetarchive/warc |
|
09:26
🔗
|
Coderjoe |
that's the wayback |
|
09:27
🔗
|
ersi |
IMO warc-tools from tef is better than IA warc |
|
09:27
🔗
|
SketchCow |
Coderjoe: Absorbing your cavedog as we speak. |
|
09:27
🔗
|
ersi |
Om nom nom |
|
09:27
🔗
|
hiker1 |
Can you use the wayback machine to view warc's that are uploaded to IA? |
|
09:27
🔗
|
SketchCow |
After I get this, ersi, it'll be on archive.org in seconds. No worries. |
|
09:28
🔗
|
ersi |
SketchCow: I'll nom on it then |
|
09:28
🔗
|
Coderjoe |
I noticed (but only because I was watching the log) |
|
09:28
🔗
|
Coderjoe |
mmm |
|
09:28
🔗
|
Coderjoe |
traffic shaping and prioritizing really takes the pain away |
|
09:40
🔗
|
hiker1 |
So it wget outputting to warc the preferred method for archiving sites? Not HTTrack? |
|
09:40
🔗
|
Coderjoe |
depends |
|
09:41
🔗
|
Coderjoe |
though I don't know about httrack |
|
09:41
🔗
|
Coderjoe |
IA has a tool called Heretrix for their normal crawls |
|
09:42
🔗
|
Coderjoe |
we use wget here because we can make it ignore robots.txt. and with the lua scripting, we can specialize things for each site. |
|
09:42
🔗
|
chronomex |
yup |
|
09:58
🔗
|
alard |
Or try viewing the warc with this: https://github.com/alard/warc-proxy :) |
|
09:59
🔗
|
hiker1 |
That looks a lot closer to what I want |
|
10:00
🔗
|
ersi |
Oops, thought about writing that one out as well :) |
|
10:00
🔗
|
hiker1 |
alard: Why does it use a proxy instead of just running a web server? |
|
10:02
🔗
|
alard |
hiker1: The thing *is* the proxy. That's the easiest way to do it -- from a technical perspective, that is -- since you don't have to rewrite any urls. |
|
10:02
🔗
|
alard |
The wayback web interface has to replace the URLs in every web page it serves. The warc-proxy addon just configures its little web server as a proxy, and it's done. |
|
10:03
🔗
|
hiker1 |
well, the nice thing about rewriting is then you can serve the files to other people through the web. |
|
10:04
🔗
|
hiker1 |
with the proxy method, only a local user can access them, unless you make the proxy public which would not be easy for most users to access |
|
10:04
🔗
|
ersi |
the non-nice thing is that it's a pain in the ass |
|
10:04
🔗
|
alard |
Yes, but that's not what this tool is for. If you want to do that there's the wayback tool. |
|
10:04
🔗
|
hiker1 |
What is the wayback tool? |
|
10:04
🔗
|
ersi |
Wayback Machine |
|
10:04
🔗
|
hiker1 |
but that won't serve private warc files |
|
10:04
🔗
|
alard |
https://github.com/internetarchive/wayback |
|
10:04
🔗
|
ersi |
https://github.com/internetarchive/wayback |
|
10:04
🔗
|
ersi |
damn it |
|
10:04
🔗
|
alard |
Heh. |
|
10:05
🔗
|
alard |
But as you can see it's much harder to get that running than the warc-proxy + firefox addon. |
|
10:05
🔗
|
hiker1 |
does warcproxy just grab whatever .warc files it sees? |
|
10:06
🔗
|
hiker1 |
ah, nvm, it has a neat interface! |
|
10:07
🔗
|
hiker1 |
wow, this is really impressive work |
|
10:13
🔗
|
ersi |
+1 alard |
|
10:16
🔗
|
norbert79 |
alard: Holy-moly, this goes to my favourites |
|
10:22
🔗
|
godane |
alard: the urls in menu for warc-proxy don't work for me for some reason |
|
10:22
🔗
|
godane |
it doesn't take in the baseurl |
|
10:22
🔗
|
hiker1 |
The base url didn't work for me, but the other ones did |
|
10:23
🔗
|
godane |
so it will go to folder/file instead of example.com/folder/file or something like that |
|
10:23
🔗
|
godane |
and so it would error |
|
10:23
🔗
|
alard |
That's strange. |
|
10:24
🔗
|
godane |
also when testing my eff.org grab it would just go to real site |
|
10:24
🔗
|
alard |
(Whether the base url works depends on the contents of your warc file. If the base url isn't in there it won't be visible.) |
|
10:25
🔗
|
alard |
godane: Is that an https site? |
|
10:25
🔗
|
godane |
yes |
|
10:33
🔗
|
alard |
godane: The https doesn't work yet. For some reason those requests aren't proxied. I've added it to the list: https://github.com/alard/warc-proxy/issues/2 |
|
10:44
🔗
|
ats |
is there an Internet Archive IRC channel somewhere, or is this the best bet? |
|
10:45
🔗
|
ersi |
#internetarchive unofficial/semi-officialo channel |
|
10:45
🔗
|
ats |
cheers :) |
|
10:45
🔗
|
ersi |
mostly just to get IA shizzle out of this channel :) |
|
10:46
🔗
|
chronomex |
yes, same people here and there mostly |
|
11:38
🔗
|
SketchCow |
More hugs here |
|
11:41
🔗
|
SketchCow |
Hey, someone's using the warrior, it spent 45 minutes on "setting up data partition". |
|
11:41
🔗
|
SketchCow |
And he stopped it. |
|
11:41
🔗
|
SketchCow |
Any ideas? |
|
11:42
🔗
|
ersi |
scrap and start it again? |
|
12:12
🔗
|
SmileyG |
did you givbe it like a 10tb partition for /data? |
|
12:24
🔗
|
tuabkiet |
10TB??? |
|
12:48
🔗
|
hiker1 |
tuabkiet: You don't have 10 TB of RAID space lying around? |
|
12:49
🔗
|
tuabkiet |
I don't use RAID, and my hard disk is 10 times smaller |
|
12:53
🔗
|
hiker1 |
How do I get wget 1.14? |
|
12:58
🔗
|
ersi |
hiker1: It's not in many repositories. You'll probably have to compile it yourself |
|
12:59
🔗
|
hiker1 |
damn. I'm downloading Linux Mint Debian Edition which uses Debian Testing. I hope it's in there... Is there a compile guide by ArchiveTeam? |
|
13:00
🔗
|
ersi |
No, but I can probably help |
|
13:00
🔗
|
hiker1 |
how long are you going to be on? I'm still downloading the Mint dvd. |
|
13:01
🔗
|
ersi |
debian testing has wget 1.13.4-3 |
|
13:01
🔗
|
hiker1 |
How did you find that out? |
|
13:01
🔗
|
hiker1 |
I was looking for a package listing but couldn't find one |
|
13:01
🔗
|
ersi |
debian sid has wget 1.14 |
|
13:01
🔗
|
ersi |
http://packages.debian.org bro |
|
13:01
🔗
|
hiker1 |
they hid it on their packages subdomain! those sneaky... |
|
13:02
🔗
|
ersi |
you can probably install that .deb and everything will be fine |
|
13:02
🔗
|
hiker1 |
I think there was an aptosid... |
|
13:03
🔗
|
ersi |
you can probably just dpkg -i the .deb if you're inclined |
|
13:03
🔗
|
hiker1 |
ersi: Do you use a linux distro? if so, which? |
|
13:04
🔗
|
ersi |
Ubuntu, Red Hat Enterprise Server, Gentoo, crappy version of SuSE and I've used Debian |
|
13:04
🔗
|
hiker1 |
oh. |
|
13:04
🔗
|
hiker1 |
no mint? |
|
13:04
🔗
|
ersi |
nope. But it's just another Debian deriative |
|
13:06
🔗
|
SketchCow |
http://archive.org/details/ftp_cavedog.com now up |
|
13:07
🔗
|
hiker1 |
Where are archives of known dead sites kept? |
|
13:07
🔗
|
hiker1 |
I only saw the just in time captures |
|
13:09
🔗
|
hiker1 |
SketchCow: Any chance you could post a file listing along with the FTP Snapshot? It would be nice to know what I'm getting before grabbing 1.5 GB. |
|
13:10
🔗
|
ersi |
Most are up on archive.org |
|
13:10
🔗
|
ersi |
SketchCow: thx~ |
|
13:11
🔗
|
hiker1 |
Does http://archive.org/details/archiveteam-fire include known dead sites? |
|
13:11
🔗
|
alard |
hiker1: http://archive.org/download/ftp_cavedog.com/ftp.cavedog.com.tar/ |
|
13:12
🔗
|
alard |
(a slash at the end of the .tar usually gives you an index) |
|
13:12
🔗
|
hiker1 |
alard: oh, wow, that is handy. Thank you. |
|
13:12
🔗
|
SketchCow |
Also, you should trust me |
|
13:12
🔗
|
SketchCow |
Everything I upload is awesome |
|
13:12
🔗
|
hiker1 |
hah |
|
13:13
🔗
|
ersi |
Indeed |
|
13:26
🔗
|
hiker1 |
I downloaded a forum about 3 years ago. The place is gone now. IA has some of the forum archived, but I'm pretty sure my archive has everything. Can I distribute it through ArchiveTeam? |
|
13:28
🔗
|
hiker1 |
The forum had a few thousand posts. It was the official forum for a video game called Lord of the Rings Online TCG. The whole archive is only 11 MB. |
|
13:43
🔗
|
tuabkiet |
hiker1: Up it to Internet Archive NOW! |
|
13:43
🔗
|
hiker1 |
I am not sure how |
|
13:44
🔗
|
ersi |
Create an account first and foremost |
|
16:44
🔗
|
SketchCow |
Bagger 288! Bagger 288! |
|
16:46
🔗
|
soultcer |
SketchCow: Did you find the two Dailybooth warc files I asked for? |
|
16:47
🔗
|
schbiridi |
the tracker thing eg used at http://tracker.archiveteam.org/webshots/ could use a link "Wanna join? http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior" link |
|
16:48
🔗
|
SketchCow |
Agreed on wanna join. |
|
16:48
🔗
|
SketchCow |
soultcer: No, I've been working on my presentation. |
|
16:48
🔗
|
SketchCow |
E-mail me. jason@textfiles.com. |
|
16:48
🔗
|
soultcer |
Will do |
|
19:00
🔗
|
alard |
Has someone saved the http://blog.webshots.com/ ? |
|
23:27
🔗
|
Nemo_bis |
slowly redoing wikia dumps mirror: https://archive.org/details/wikia_dump_20121204 |
|
23:28
🔗
|
Nemo_bis |
now 5704 wikis begining by "a" vs. 872 in previous snapshot |
|
23:29
🔗
|
Nemo_bis |
still, looks like dumps are not generated for 80 % of wikis they have even if requested |
|
23:39
🔗
|
alard |
--------------------------------------------------------------------------- |
|
23:39
🔗
|
alard |
Hi all. Webshots is done. 109 TB saved by 134 downloaders. Thanks! |
|
23:39
🔗
|
alard |
It's available on the projects tab of your warrior. |
|
23:39
🔗
|
alard |
Next station: DailyBooth.com, closing at the end of the year. |
|
23:39
🔗
|
alard |
If you want to run it yourself: https://github.com/ArchiveTeam/dailybooth-grab |
|
23:39
🔗
|
alard |
(All very similar to WebShots and previous projects.) |
|
23:39
🔗
|
alard |
Join #dailybooth for more detailed discussions. |
|
23:39
🔗
|
alard |
--------------------------------------------------------------------------- |