#archiveteam 2013-09-25,Wed

↑back Search

Time Nickname Message
00:37 🔗 yipdw there's some funny stuff on amkon.net
00:37 🔗 yipdw e.g. http://amkon.net/showthread.php?54867-As-usual-Calvin-overrules-that-wimpy-communist-sympathiser-Christ
00:55 🔗 dashcloud DFJustin: funny you should ask- I've mostly de-duped all of the lists except for the Bing one here: http://paste.archivingyoursh.it/paqorogimi.avrasm
00:56 🔗 dashcloud I've taken out the obvious duplications (exact or nearly exact except for trailing slashes or extra characters that couldn't be part of the address)
00:56 🔗 SketchCow Do we need to move forward on webtv?
00:57 🔗 dashcloud yep- if the archiveteam wiki is right, they close on the 30th of this month
01:00 🔗 dashcloud so, apparently WebTV/MSNTV hosted newsgroups at some point:
01:00 🔗 dashcloud http://news.webtv.net/charters/alt.discuss.webtvkillaz.html
01:00 🔗 dashcloud http://news.webtv.net/club-charters/alt.discuss.clubs.public.seniors.misc.ivyhalls.html
01:28 🔗 godane SketchCow: I'm starting to mirror amkon.net forums
01:37 🔗 SketchCow OK. We have someone else doing it too.
01:37 🔗 SketchCow I am all for two grabs.
01:37 🔗 yipdw me too, especially if the bot fucks out for some unforeseen reason
01:49 🔗 godane so how do you bypass a read error to make it continue?
01:49 🔗 godane *wget continue
01:51 🔗 SketchCow root@teamarchive0:/1/FRIENDSTER# du -sh .
01:51 🔗 SketchCow 5.4M .
01:51 🔗 SketchCow OH THANK GOD
01:52 🔗 BlueMax ?
01:53 🔗 SketchCow I've been uploading many many 100gb+ Friendster Grabs
01:55 🔗 dashcloud so you're just about done with Friendster uploads now?
01:55 🔗 SketchCow Maybe.
01:55 🔗 SketchCow Probably.
01:55 🔗 SketchCow 920G HACKERCONS
01:55 🔗 SketchCow That's a new one
01:59 🔗 dashcloud so, I'm downloading the list of MSNTV/WebTV urls I posted earlier now- if someone else wants to get that list or the Bing list, that would be great (also, is there a general purpose script for making WARCs of these kinds of downloads? someone in here made the one that I'm using currently)
02:49 🔗 dashcloud so, a more specialized tool that doesn't mirror the same item hundreds of times because you have example.com/user/ and example.com/user/index.html would be nice
02:51 🔗 SketchCow Agreed.
02:51 🔗 SketchCow Over time, it'll be nice to improve the archivebot to do the right things.
02:51 🔗 SketchCow Save us time.
02:52 🔗 SketchCow I want us out of the business of team members being tied up with "shitbag.com is going down, 200 URLs and 45 image files"
03:00 🔗 yipdw dashcloud: those aren't the same in general
03:02 🔗 yipdw dashcloud: or did you mean a specialized tool for the cases where they are the same
03:06 🔗 dashcloud probably the second
03:07 🔗 dashcloud can't think of any cases really where you've got a pile of URLs and you won't be duplicating downloads (if you could crawl the whole site easily, you wouldn't be bothering with piles of urls- just one).
03:08 🔗 yipdw yeah, that happens a lot
03:08 🔗 yipdw actually, that happened in the patch.com grab
03:08 🔗 yipdw (which I need to shut down before it keeps costing me money, heh)
03:08 🔗 yipdw but there, I'm sure that the propwash junction patch advert was grabbed 100,000 times
03:09 🔗 yipdw it *is* possible to avoid that with e.g. wget-lua and a central URL database
03:09 🔗 dashcloud actually, I can think of one instance where you would be in that instance- if you have a pre-existing list of URLs that are from many different sites, you could have case 1 without case 2
03:10 🔗 dashcloud like the 505 unbelievably stupid web p@ges book I have sitting over my desk- a list of 500 urls, and little to no duplicates there
03:12 🔗 dashcloud SketchCow: is there some way IA can be queried by someone for every url from a certain domain? (like if we ever had to do angelfire or tripod) (not for MSNTV, but in general)
03:12 🔗 omf_ use the cdx search
03:15 🔗 yipdw I wonder how much memory you'd need to store all of the wayback machine's URLs in memory
03:15 🔗 yipdw assuming zero overhead for object representations, etc.
03:28 🔗 SketchCow Bi,]
03:28 🔗 SketchCow Ahem.
03:28 🔗 SketchCow We're terrible with that.
04:59 🔗 DFJustin dashcloud: it is possible and underscor has done it for us before
06:09 🔗 chfoo i've created http://archiveteam.org/index.php?title=Yahoo!_Blog . i can't find the vietnam archives and or archives of the full shutdown notices.
20:36 🔗 SketchCow I run the site http://www.shakodo.com/ , a free resource/forum/QA site
20:36 🔗 SketchCow especially pricing for photographers.
20:36 🔗 SketchCow for photographers to learn about the business of photography and
20:36 🔗 SketchCow There are some (ha!) priceless information on the site, with no
20:36 🔗 SketchCow advertisements etc. But unfortunately I have to shut the site
20:36 🔗 SketchCow There are no pictures on it, so storage requirements are minimal.
20:36 🔗 SketchCow could crawl and archive the entire site before this date.
20:36 🔗 SketchCow down from December 8th, 2013 and it would be great if you
20:36 🔗 SketchCow It would be fantastic if you could do that or somehow update the
20:36 🔗 SketchCow schedule of your Waybackmachine crawler, so that it gets a few
20:36 🔗 SketchCow more snapshots before it closes.
20:36 🔗 SketchCow ...
20:36 🔗 SketchCow and THAT is how you shut down
20:45 🔗 ersi nice!
20:45 🔗 ersi or well, shame! but nicely done!
21:13 🔗 mistym That is fantastic. More site operators need to be like that
21:40 🔗 BiggieJon only thing better would be a site owner offering to send a hard drive with the last full backup of the site the day it closes
22:30 🔗 godane Shakodo will shut down on December 8th, 2013, to the day 3 years after it was launched. The content on this site is invaluable, so I will search for a place which will it archive it in perpetuity, so that future generations of photographers also can look back at the history of pricing information for photographers.
23:39 🔗 robbiet4- hi all
23:39 🔗 robbiet4- for some reason, i own the ArchiveTeam org on GitHub still
23:39 🔗 robbiet4- who wants it? :p
23:39 🔗 robbiet4- the email addresses need to be changed too
23:40 🔗 robbiet4- also, i retract my previous statement, it seems I am one of the owners. don't want to leave until another email is in there though. still trying to get bukkit notifications to stop

irclogger-viewer