[00:37] there's some funny stuff on amkon.net [00:37] e.g. http://amkon.net/showthread.php?54867-As-usual-Calvin-overrules-that-wimpy-communist-sympathiser-Christ [00:55] DFJustin: funny you should ask- I've mostly de-duped all of the lists except for the Bing one here: http://paste.archivingyoursh.it/paqorogimi.avrasm [00:56] I've taken out the obvious duplications (exact or nearly exact except for trailing slashes or extra characters that couldn't be part of the address) [00:56] Do we need to move forward on webtv? [00:57] yep- if the archiveteam wiki is right, they close on the 30th of this month [01:00] so, apparently WebTV/MSNTV hosted newsgroups at some point: [01:00] http://news.webtv.net/charters/alt.discuss.webtvkillaz.html [01:00] http://news.webtv.net/club-charters/alt.discuss.clubs.public.seniors.misc.ivyhalls.html [01:28] SketchCow: I'm starting to mirror amkon.net forums [01:37] OK. We have someone else doing it too. [01:37] I am all for two grabs. [01:37] me too, especially if the bot fucks out for some unforeseen reason [01:49] so how do you bypass a read error to make it continue? [01:49] *wget continue [01:51] root@teamarchive0:/1/FRIENDSTER# du -sh . [01:51] 5.4M . [01:51] OH THANK GOD [01:52] ? [01:53] I've been uploading many many 100gb+ Friendster Grabs [01:55] so you're just about done with Friendster uploads now? [01:55] Maybe. [01:55] Probably. [01:55] 920G HACKERCONS [01:55] That's a new one [01:59] so, I'm downloading the list of MSNTV/WebTV urls I posted earlier now- if someone else wants to get that list or the Bing list, that would be great (also, is there a general purpose script for making WARCs of these kinds of downloads? someone in here made the one that I'm using currently) [02:49] so, a more specialized tool that doesn't mirror the same item hundreds of times because you have example.com/user/ and example.com/user/index.html would be nice [02:51] Agreed. [02:51] Over time, it'll be nice to improve the archivebot to do the right things. [02:51] Save us time. [02:52] I want us out of the business of team members being tied up with "shitbag.com is going down, 200 URLs and 45 image files" [03:00] dashcloud: those aren't the same in general [03:02] dashcloud: or did you mean a specialized tool for the cases where they are the same [03:06] probably the second [03:07] can't think of any cases really where you've got a pile of URLs and you won't be duplicating downloads (if you could crawl the whole site easily, you wouldn't be bothering with piles of urls- just one). [03:08] yeah, that happens a lot [03:08] actually, that happened in the patch.com grab [03:08] (which I need to shut down before it keeps costing me money, heh) [03:08] but there, I'm sure that the propwash junction patch advert was grabbed 100,000 times [03:09] it *is* possible to avoid that with e.g. wget-lua and a central URL database [03:09] actually, I can think of one instance where you would be in that instance- if you have a pre-existing list of URLs that are from many different sites, you could have case 1 without case 2 [03:10] like the 505 unbelievably stupid web p@ges book I have sitting over my desk- a list of 500 urls, and little to no duplicates there [03:12] SketchCow: is there some way IA can be queried by someone for every url from a certain domain? (like if we ever had to do angelfire or tripod) (not for MSNTV, but in general) [03:12] use the cdx search [03:15] I wonder how much memory you'd need to store all of the wayback machine's URLs in memory [03:15] assuming zero overhead for object representations, etc. [03:28] Bi,] [03:28] Ahem. [03:28] We're terrible with that. [04:59] dashcloud: it is possible and underscor has done it for us before [06:09] i've created http://archiveteam.org/index.php?title=Yahoo!_Blog . i can't find the vietnam archives and or archives of the full shutdown notices. [20:36] I run the site http://www.shakodo.com/ , a free resource/forum/QA site [20:36] especially pricing for photographers. [20:36] for photographers to learn about the business of photography and [20:36] There are some (ha!) priceless information on the site, with no [20:36] advertisements etc. But unfortunately I have to shut the site [20:36] There are no pictures on it, so storage requirements are minimal. [20:36] could crawl and archive the entire site before this date. [20:36] down from December 8th, 2013 and it would be great if you [20:36] It would be fantastic if you could do that or somehow update the [20:36] schedule of your Waybackmachine crawler, so that it gets a few [20:36] more snapshots before it closes. [20:36] ... [20:36] and THAT is how you shut down [20:45] nice! [20:45] or well, shame! but nicely done! [21:13] That is fantastic. More site operators need to be like that [21:40] only thing better would be a site owner offering to send a hard drive with the last full backup of the site the day it closes [22:30] Shakodo will shut down on December 8th, 2013, to the day 3 years after it was launched. The content on this site is invaluable, so I will search for a place which will it archive it in perpetuity, so that future generations of photographers also can look back at the history of pricing information for photographers. [23:39] hi all [23:39] for some reason, i own the ArchiveTeam org on GitHub still [23:39] who wants it? :p [23:39] the email addresses need to be changed too [23:40] also, i retract my previous statement, it seems I am one of the owners. don't want to leave until another email is in there though. still trying to get bukkit notifications to stop