#archiveteam 2013-09-25,Wed

↑back Search

Time	Nickname	Message
00:37 ^🔗	yipdw	there's some funny stuff on amkon.net
00:37 ^🔗	yipdw	e.g. http://amkon.net/showthread.php?54867-As-usual-Calvin-overrules-that-wimpy-communist-sympathiser-Christ
00:55 ^🔗	dashcloud	DFJustin: funny you should ask- I've mostly de-duped all of the lists except for the Bing one here: http://paste.archivingyoursh.it/paqorogimi.avrasm
00:56 ^🔗	dashcloud	I've taken out the obvious duplications (exact or nearly exact except for trailing slashes or extra characters that couldn't be part of the address)
00:56 ^🔗	SketchCow	Do we need to move forward on webtv?
00:57 ^🔗	dashcloud	yep- if the archiveteam wiki is right, they close on the 30th of this month
01:00 ^🔗	dashcloud	so, apparently WebTV/MSNTV hosted newsgroups at some point:
01:00 ^🔗	dashcloud	http://news.webtv.net/charters/alt.discuss.webtvkillaz.html
01:00 ^🔗	dashcloud	http://news.webtv.net/club-charters/alt.discuss.clubs.public.seniors.misc.ivyhalls.html
01:28 ^🔗	godane	SketchCow: I'm starting to mirror amkon.net forums
01:37 ^🔗	SketchCow	OK. We have someone else doing it too.
01:37 ^🔗	SketchCow	I am all for two grabs.
01:37 ^🔗	yipdw	me too, especially if the bot fucks out for some unforeseen reason
01:49 ^🔗	godane	so how do you bypass a read error to make it continue?
01:49 ^🔗	godane	*wget continue
01:51 ^🔗	SketchCow	root@teamarchive0:/1/FRIENDSTER# du -sh .
01:51 ^🔗	SketchCow	5.4M .
01:51 ^🔗	SketchCow	OH THANK GOD
01:52 ^🔗	BlueMax	?
01:53 ^🔗	SketchCow	I've been uploading many many 100gb+ Friendster Grabs
01:55 ^🔗	dashcloud	so you're just about done with Friendster uploads now?
01:55 ^🔗	SketchCow	Maybe.
01:55 ^🔗	SketchCow	Probably.
01:55 ^🔗	SketchCow	920G HACKERCONS
01:55 ^🔗	SketchCow	That's a new one
01:59 ^🔗	dashcloud	so, I'm downloading the list of MSNTV/WebTV urls I posted earlier now- if someone else wants to get that list or the Bing list, that would be great (also, is there a general purpose script for making WARCs of these kinds of downloads? someone in here made the one that I'm using currently)
02:49 ^🔗	dashcloud	so, a more specialized tool that doesn't mirror the same item hundreds of times because you have example.com/user/ and example.com/user/index.html would be nice
02:51 ^🔗	SketchCow	Agreed.
02:51 ^🔗	SketchCow	Over time, it'll be nice to improve the archivebot to do the right things.
02:51 ^🔗	SketchCow	Save us time.
02:52 ^🔗	SketchCow	I want us out of the business of team members being tied up with "shitbag.com is going down, 200 URLs and 45 image files"
03:00 ^🔗	yipdw	dashcloud: those aren't the same in general
03:02 ^🔗	yipdw	dashcloud: or did you mean a specialized tool for the cases where they are the same
03:06 ^🔗	dashcloud	probably the second
03:07 ^🔗	dashcloud	can't think of any cases really where you've got a pile of URLs and you won't be duplicating downloads (if you could crawl the whole site easily, you wouldn't be bothering with piles of urls- just one).
03:08 ^🔗	yipdw	yeah, that happens a lot
03:08 ^🔗	yipdw	actually, that happened in the patch.com grab
03:08 ^🔗	yipdw	(which I need to shut down before it keeps costing me money, heh)
03:08 ^🔗	yipdw	but there, I'm sure that the propwash junction patch advert was grabbed 100,000 times
03:09 ^🔗	yipdw	it is possible to avoid that with e.g. wget-lua and a central URL database
03:09 ^🔗	dashcloud	actually, I can think of one instance where you would be in that instance- if you have a pre-existing list of URLs that are from many different sites, you could have case 1 without case 2
03:10 ^🔗	dashcloud	like the 505 unbelievably stupid web p@ges book I have sitting over my desk- a list of 500 urls, and little to no duplicates there
03:12 ^🔗	dashcloud	SketchCow: is there some way IA can be queried by someone for every url from a certain domain? (like if we ever had to do angelfire or tripod) (not for MSNTV, but in general)
03:12 ^🔗	omf_	use the cdx search
03:15 ^🔗	yipdw	I wonder how much memory you'd need to store all of the wayback machine's URLs in memory
03:15 ^🔗	yipdw	assuming zero overhead for object representations, etc.
03:28 ^🔗	SketchCow	Bi,]
03:28 ^🔗	SketchCow	Ahem.
03:28 ^🔗	SketchCow	We're terrible with that.
04:59 ^🔗	DFJustin	dashcloud: it is possible and underscor has done it for us before
06:09 ^🔗	chfoo	i've created http://archiveteam.org/index.php?title=Yahoo!_Blog . i can't find the vietnam archives and or archives of the full shutdown notices.
20:36 ^🔗	SketchCow	I run the site http://www.shakodo.com/ , a free resource/forum/QA site
20:36 ^🔗	SketchCow	especially pricing for photographers.
20:36 ^🔗	SketchCow	for photographers to learn about the business of photography and
20:36 ^🔗	SketchCow	There are some (ha!) priceless information on the site, with no
20:36 ^🔗	SketchCow	advertisements etc. But unfortunately I have to shut the site
20:36 ^🔗	SketchCow	There are no pictures on it, so storage requirements are minimal.
20:36 ^🔗	SketchCow	could crawl and archive the entire site before this date.
20:36 ^🔗	SketchCow	down from December 8th, 2013 and it would be great if you
20:36 ^🔗	SketchCow	It would be fantastic if you could do that or somehow update the
20:36 ^🔗	SketchCow	schedule of your Waybackmachine crawler, so that it gets a few
20:36 ^🔗	SketchCow	more snapshots before it closes.
20:36 ^🔗	SketchCow	...
20:36 ^🔗	SketchCow	and THAT is how you shut down
20:45 ^🔗	ersi	nice!
20:45 ^🔗	ersi	or well, shame! but nicely done!
21:13 ^🔗	mistym	That is fantastic. More site operators need to be like that
21:40 ^🔗	BiggieJon	only thing better would be a site owner offering to send a hard drive with the last full backup of the site the day it closes
22:30 ^🔗	godane	Shakodo will shut down on December 8th, 2013, to the day 3 years after it was launched. The content on this site is invaluable, so I will search for a place which will it archive it in perpetuity, so that future generations of photographers also can look back at the history of pricing information for photographers.
23:39 ^🔗	robbiet4-	hi all
23:39 ^🔗	robbiet4-	for some reason, i own the ArchiveTeam org on GitHub still
23:39 ^🔗	robbiet4-	who wants it? :p
23:39 ^🔗	robbiet4-	the email addresses need to be changed too
23:40 ^🔗	robbiet4-	also, i retract my previous statement, it seems I am one of the owners. don't want to leave until another email is in there though. still trying to get bukkit notifications to stop

irclogger-viewer