Time |
Nickname |
Message |
00:37
🔗
|
yipdw |
there's some funny stuff on amkon.net |
00:37
🔗
|
yipdw |
e.g. http://amkon.net/showthread.php?54867-As-usual-Calvin-overrules-that-wimpy-communist-sympathiser-Christ |
00:55
🔗
|
dashcloud |
DFJustin: funny you should ask- I've mostly de-duped all of the lists except for the Bing one here: http://paste.archivingyoursh.it/paqorogimi.avrasm |
00:56
🔗
|
dashcloud |
I've taken out the obvious duplications (exact or nearly exact except for trailing slashes or extra characters that couldn't be part of the address) |
00:56
🔗
|
SketchCow |
Do we need to move forward on webtv? |
00:57
🔗
|
dashcloud |
yep- if the archiveteam wiki is right, they close on the 30th of this month |
01:00
🔗
|
dashcloud |
so, apparently WebTV/MSNTV hosted newsgroups at some point: |
01:00
🔗
|
dashcloud |
http://news.webtv.net/charters/alt.discuss.webtvkillaz.html |
01:00
🔗
|
dashcloud |
http://news.webtv.net/club-charters/alt.discuss.clubs.public.seniors.misc.ivyhalls.html |
01:28
🔗
|
godane |
SketchCow: I'm starting to mirror amkon.net forums |
01:37
🔗
|
SketchCow |
OK. We have someone else doing it too. |
01:37
🔗
|
SketchCow |
I am all for two grabs. |
01:37
🔗
|
yipdw |
me too, especially if the bot fucks out for some unforeseen reason |
01:49
🔗
|
godane |
so how do you bypass a read error to make it continue? |
01:49
🔗
|
godane |
*wget continue |
01:51
🔗
|
SketchCow |
root@teamarchive0:/1/FRIENDSTER# du -sh . |
01:51
🔗
|
SketchCow |
5.4M . |
01:51
🔗
|
SketchCow |
OH THANK GOD |
01:52
🔗
|
BlueMax |
? |
01:53
🔗
|
SketchCow |
I've been uploading many many 100gb+ Friendster Grabs |
01:55
🔗
|
dashcloud |
so you're just about done with Friendster uploads now? |
01:55
🔗
|
SketchCow |
Maybe. |
01:55
🔗
|
SketchCow |
Probably. |
01:55
🔗
|
SketchCow |
920G HACKERCONS |
01:55
🔗
|
SketchCow |
That's a new one |
01:59
🔗
|
dashcloud |
so, I'm downloading the list of MSNTV/WebTV urls I posted earlier now- if someone else wants to get that list or the Bing list, that would be great (also, is there a general purpose script for making WARCs of these kinds of downloads? someone in here made the one that I'm using currently) |
02:49
🔗
|
dashcloud |
so, a more specialized tool that doesn't mirror the same item hundreds of times because you have example.com/user/ and example.com/user/index.html would be nice |
02:51
🔗
|
SketchCow |
Agreed. |
02:51
🔗
|
SketchCow |
Over time, it'll be nice to improve the archivebot to do the right things. |
02:51
🔗
|
SketchCow |
Save us time. |
02:52
🔗
|
SketchCow |
I want us out of the business of team members being tied up with "shitbag.com is going down, 200 URLs and 45 image files" |
03:00
🔗
|
yipdw |
dashcloud: those aren't the same in general |
03:02
🔗
|
yipdw |
dashcloud: or did you mean a specialized tool for the cases where they are the same |
03:06
🔗
|
dashcloud |
probably the second |
03:07
🔗
|
dashcloud |
can't think of any cases really where you've got a pile of URLs and you won't be duplicating downloads (if you could crawl the whole site easily, you wouldn't be bothering with piles of urls- just one). |
03:08
🔗
|
yipdw |
yeah, that happens a lot |
03:08
🔗
|
yipdw |
actually, that happened in the patch.com grab |
03:08
🔗
|
yipdw |
(which I need to shut down before it keeps costing me money, heh) |
03:08
🔗
|
yipdw |
but there, I'm sure that the propwash junction patch advert was grabbed 100,000 times |
03:09
🔗
|
yipdw |
it *is* possible to avoid that with e.g. wget-lua and a central URL database |
03:09
🔗
|
dashcloud |
actually, I can think of one instance where you would be in that instance- if you have a pre-existing list of URLs that are from many different sites, you could have case 1 without case 2 |
03:10
🔗
|
dashcloud |
like the 505 unbelievably stupid web p@ges book I have sitting over my desk- a list of 500 urls, and little to no duplicates there |
03:12
🔗
|
dashcloud |
SketchCow: is there some way IA can be queried by someone for every url from a certain domain? (like if we ever had to do angelfire or tripod) (not for MSNTV, but in general) |
03:12
🔗
|
omf_ |
use the cdx search |
03:15
🔗
|
yipdw |
I wonder how much memory you'd need to store all of the wayback machine's URLs in memory |
03:15
🔗
|
yipdw |
assuming zero overhead for object representations, etc. |
03:28
🔗
|
SketchCow |
Bi,] |
03:28
🔗
|
SketchCow |
Ahem. |
03:28
🔗
|
SketchCow |
We're terrible with that. |
04:59
🔗
|
DFJustin |
dashcloud: it is possible and underscor has done it for us before |
06:09
🔗
|
chfoo |
i've created http://archiveteam.org/index.php?title=Yahoo!_Blog . i can't find the vietnam archives and or archives of the full shutdown notices. |
20:36
🔗
|
SketchCow |
I run the site http://www.shakodo.com/ , a free resource/forum/QA site |
20:36
🔗
|
SketchCow |
especially pricing for photographers. |
20:36
🔗
|
SketchCow |
for photographers to learn about the business of photography and |
20:36
🔗
|
SketchCow |
There are some (ha!) priceless information on the site, with no |
20:36
🔗
|
SketchCow |
advertisements etc. But unfortunately I have to shut the site |
20:36
🔗
|
SketchCow |
There are no pictures on it, so storage requirements are minimal. |
20:36
🔗
|
SketchCow |
could crawl and archive the entire site before this date. |
20:36
🔗
|
SketchCow |
down from December 8th, 2013 and it would be great if you |
20:36
🔗
|
SketchCow |
It would be fantastic if you could do that or somehow update the |
20:36
🔗
|
SketchCow |
schedule of your Waybackmachine crawler, so that it gets a few |
20:36
🔗
|
SketchCow |
more snapshots before it closes. |
20:36
🔗
|
SketchCow |
... |
20:36
🔗
|
SketchCow |
and THAT is how you shut down |
20:45
🔗
|
ersi |
nice! |
20:45
🔗
|
ersi |
or well, shame! but nicely done! |
21:13
🔗
|
mistym |
That is fantastic. More site operators need to be like that |
21:40
🔗
|
BiggieJon |
only thing better would be a site owner offering to send a hard drive with the last full backup of the site the day it closes |
22:30
🔗
|
godane |
Shakodo will shut down on December 8th, 2013, to the day 3 years after it was launched. The content on this site is invaluable, so I will search for a place which will it archive it in perpetuity, so that future generations of photographers also can look back at the history of pricing information for photographers. |
23:39
🔗
|
robbiet4- |
hi all |
23:39
🔗
|
robbiet4- |
for some reason, i own the ArchiveTeam org on GitHub still |
23:39
🔗
|
robbiet4- |
who wants it? :p |
23:39
🔗
|
robbiet4- |
the email addresses need to be changed too |
23:40
🔗
|
robbiet4- |
also, i retract my previous statement, it seems I am one of the owners. don't want to leave until another email is in there though. still trying to get bukkit notifications to stop |