#warrior 2013-02-23,Sat

↑back Search

Time	Nickname	Message
00:16 ^🔗	omf_	Okay so far I got it from 8033 domains to 3735
00:16 ^🔗	omf_	I expect a few hundred more to fall off before I am done
00:45 ^🔗	omf_	alard, I am going to look into setting up a warror server myself at a later date. I want to collect stats data that can we can publish as CC0 so there is more research out there
00:46 ^🔗	alard	omf_: Excellent.
00:48 ^🔗	omf_	I have been a web developer for 17 years now and we only recently had large scale open data
00:48 ^🔗	omf_	google, yahoo, craigslist
00:48 ^🔗	omf_	they all started giving bits out
00:49 ^🔗	omf_	that lead to others and more community projects. I just see the next logical step being stats from millions of pages at a time
00:49 ^🔗	omf_	so more of the higher scalability end so beginners have something to learn from
10:51 ^🔗	alard	omf_: Let's continue here.
10:51 ^🔗	omf_	The links between all these sites and inside these sites is a mess
10:52 ^🔗	omf_	a full premap would make this process easy but with no way knowing when shit is turned off, we do not have that kind of time
10:52 ^🔗	alard	The universal-tracker system works best if you can split your task into small, but not too small subtasks.
10:53 ^🔗	alard	But you have a small number of very large sites, is that correct?
10:53 ^🔗	omf_	looks that way
10:54 ^🔗	omf_	planetquake, the forums and a few others
10:54 ^🔗	omf_	I am still trying to figure out why the few attempted made completed without getting nearly everything
10:55 ^🔗	omf_	take this url for instance
10:55 ^🔗	omf_	http://planetquake.gamespy.com/View.php?view=POTD.Detail&id=4222
10:55 ^🔗	omf_	now all I have to do is minus one the number on the end and I got the previous page
10:56 ^🔗	omf_	there is also a list page with 167 pages of results
10:56 ^🔗	omf_	and yet wget got none of it
10:56 ^🔗	omf_	part of that I know is the cross domain image fetching bs
10:56 ^🔗	omf_	How do we bake that in
10:57 ^🔗	omf_	all images have this kind of url http://pnmedia.gamespy.com/planetquake.gamespy.com/fms/images/potd/4199/1323262539_fullres.jpg
11:00 ^🔗	alard	That depends on your Wget parameters, probably.
11:00 ^🔗	omf_	Yeah I figure out what they need to be
11:00 ^🔗	omf_	I will have a few people test it
11:01 ^🔗	alard	Is there a list of the sites you want to save on the wiki?
11:02 ^🔗	omf_	I don't have permission to create a wiki page to post it.
11:02 ^🔗	alard	You don't have an account?
11:03 ^🔗	omf_	I have an account it just cannot create pages
11:03 ^🔗	omf_	just update and edit
11:03 ^🔗	omf_	never really had the need
11:03 ^🔗	alard	That's weird. Should I create a page that you can then edit?
11:03 ^🔗	alard	Since I have the impression that you're trying to save a lot of very different sites.
11:04 ^🔗	omf_	basically everything under 1up, gamespy, ugo and ign
11:04 ^🔗	omf_	it is all getting turned off
11:04 ^🔗	alard	What do I call the page?
11:05 ^🔗	alard	http://archiveteam.org/index.php?title=IGN
11:06 ^🔗	omf_	we called the chat room ispygames
11:06 ^🔗	alard	Ah.

irclogger-viewer