Time |
Nickname |
Message |
00:16
🔗
|
omf_ |
Okay so far I got it from 8033 domains to 3735 |
00:16
🔗
|
omf_ |
I expect a few hundred more to fall off before I am done |
00:45
🔗
|
omf_ |
alard, I am going to look into setting up a warror server myself at a later date. I want to collect stats data that can we can publish as CC0 so there is more research out there |
00:46
🔗
|
alard |
omf_: Excellent. |
00:48
🔗
|
omf_ |
I have been a web developer for 17 years now and we only recently had large scale open data |
00:48
🔗
|
omf_ |
google, yahoo, craigslist |
00:48
🔗
|
omf_ |
they all started giving bits out |
00:49
🔗
|
omf_ |
that lead to others and more community projects. I just see the next logical step being stats from millions of pages at a time |
00:49
🔗
|
omf_ |
so more of the higher scalability end so beginners have something to learn from |
10:51
🔗
|
alard |
omf_: Let's continue here. |
10:51
🔗
|
omf_ |
The links between all these sites and inside these sites is a mess |
10:52
🔗
|
omf_ |
a full premap would make this process easy but with no way knowing when shit is turned off, we do not have that kind of time |
10:52
🔗
|
alard |
The universal-tracker system works best if you can split your task into small, but not too small subtasks. |
10:53
🔗
|
alard |
But you have a small number of very large sites, is that correct? |
10:53
🔗
|
omf_ |
looks that way |
10:54
🔗
|
omf_ |
planetquake, the forums and a few others |
10:54
🔗
|
omf_ |
I am still trying to figure out why the few attempted made completed without getting nearly everything |
10:55
🔗
|
omf_ |
take this url for instance |
10:55
🔗
|
omf_ |
http://planetquake.gamespy.com/View.php?view=POTD.Detail&id=4222 |
10:55
🔗
|
omf_ |
now all I have to do is minus one the number on the end and I got the previous page |
10:56
🔗
|
omf_ |
there is also a list page with 167 pages of results |
10:56
🔗
|
omf_ |
and yet wget got none of it |
10:56
🔗
|
omf_ |
part of that I know is the cross domain image fetching bs |
10:56
🔗
|
omf_ |
How do we bake that in |
10:57
🔗
|
omf_ |
all images have this kind of url http://pnmedia.gamespy.com/planetquake.gamespy.com/fms/images/potd/4199/1323262539_fullres.jpg |
11:00
🔗
|
alard |
That depends on your Wget parameters, probably. |
11:00
🔗
|
omf_ |
Yeah I figure out what they need to be |
11:00
🔗
|
omf_ |
I will have a few people test it |
11:01
🔗
|
alard |
Is there a list of the sites you want to save on the wiki? |
11:02
🔗
|
omf_ |
I don't have permission to create a wiki page to post it. |
11:02
🔗
|
alard |
You don't have an account? |
11:03
🔗
|
omf_ |
I have an account it just cannot create pages |
11:03
🔗
|
omf_ |
just update and edit |
11:03
🔗
|
omf_ |
never really had the need |
11:03
🔗
|
alard |
That's weird. Should I create a page that you can then edit? |
11:03
🔗
|
alard |
Since I have the impression that you're trying to save a lot of very different sites. |
11:04
🔗
|
omf_ |
basically everything under 1up, gamespy, ugo and ign |
11:04
🔗
|
omf_ |
it is all getting turned off |
11:04
🔗
|
alard |
What do I call the page? |
11:05
🔗
|
alard |
http://archiveteam.org/index.php?title=IGN |
11:06
🔗
|
omf_ |
we called the chat room ispygames |
11:06
🔗
|
alard |
Ah. |