Time |
Nickname |
Message |
04:20
🔗
|
xk_id |
Hmm... I suppose doing a reduce with a set of pages from the same server is not very polite. |
06:45
🔗
|
Nemo_bis |
norbert79: http://sourceforge.net/projects/xowa |
08:06
🔗
|
norbert79 |
Nemo_bis: Thanks, while I don't quite understand how this might help me, but it's a good start! Thanks again :) |
08:16
🔗
|
Nemo_bis |
norbert79: looks like it parses the wikitext on demand |
08:24
🔗
|
norbert79 |
Nemo_bis: Yes, I realized what you are trying to tell me with it. Might be useful indeed |
10:54
🔗
|
xk_id |
Somebody here recommended that I wait 1s between two requests to the same host. Shall I measure the delay between the requests, or between the reply and the next request? |
10:55
🔗
|
turnkit |
Lord_Nigh if you are dumping mixed mode, try imgburn as you can set it to generate a .cue, .ccd, and .mds file -- I think doing so will make it more easily re-burnable. |
10:56
🔗
|
Lord_Nigh |
does imgburn know how to deal with offsets on drives? the particular cd drive i have has a LARGE offset which actually ends up an entire sector away from where the audio data starts; based on gibberish that appears in the 'right' place its likely a drive firmware off-by-1 bug in the sector count thing |
10:57
🔗
|
Lord_Nigh |
seems to affect the entire line of usb dvd drives made by that manufacturer |
10:59
🔗
|
turnkit |
(scratches head) |
10:59
🔗
|
Lord_Nigh |
offsets meaning when digitally reading the audio areas |
11:00
🔗
|
Lord_Nigh |
theres a block of gibberish which appears before the audio starts |
11:00
🔗
|
Lord_Nigh |
and that has to be cut out |
11:01
🔗
|
turnkit |
http://forum.imgburn.com/index.php?showtopic=5974 |
11:06
🔗
|
turnkit |
I guess the answer is "no" w/ imgburn... the suggestion here for 'exact' duping (?) is two pass burn... http://www.hydrogenaudio.org/forums/index.php?showtopic=31989 |
11:06
🔗
|
turnkit |
But I am not sure... do you think it's necessary... will the difference in ripping make any difference -- i.e. would anyone be aware of the difference? I do much like the idea of a perfect clone if possible though. |
11:10
🔗
|
alard |
xk_id: I'd choose the simplest solution. (send request -> process response -> sleep 1 -> repeat, perhaps?) |
11:10
🔗
|
turnkit |
Linux tool for multisession CD-EXTRA discs... http://www.phong.org/extricate/ ? |
11:11
🔗
|
xk_id |
alard: simplest maybe, but not most convenient :P |
11:11
🔗
|
xk_id |
alard: need to speed things up a bit here... |
11:11
🔗
|
xk_id |
alard: so I'm capturing when each request is being made and delay based on that |
11:11
🔗
|
alard |
In that case, why wait? |
11:11
🔗
|
ersi |
xk_id: Depending on how nice you want to be - the blonger wait the better - since it was like, 4M requests you needed to do? I usually opt for "race to the finish", but not with that amount. So I guess 1s, if you're okay with that |
11:11
🔗
|
xk_id |
alard: cause, morals |
11:12
🔗
|
ersi |
And it was like, a total of 4M~+ requests to be done, right? |
11:12
🔗
|
xk_id |
ersi: I found that in their robots.txt they allow a spider to wait 0.5s between requests. I'm basing mine on that. On the other hand, they ask a different spider to wait 4s |
11:12
🔗
|
* |
xk_id nods |
11:12
🔗
|
xk_id |
maximum 4M. between 3 and 4M |
11:12
🔗
|
alard |
Is it a small site? Or one with more than enough capacity? |
11:13
🔗
|
xk_id |
It's an online social network. |
11:13
🔗
|
xk_id |
http://gather.com |
11:13
🔗
|
ersi |
is freshness important? (Like, if it's suspecteble to.. fall over and die) If not, 1-2s sounds pretty fair |
11:14
🔗
|
xk_id |
I need to finish quickly. less than a week. |
11:14
🔗
|
ersi |
ouch |
11:14
🔗
|
ersi |
with 1s wait, it'll take 46 days |
11:14
🔗
|
xk_id |
I'm doing it distributely |
11:15
🔗
|
ersi |
(46 days basing on single thread - wait 1s between request) |
11:15
🔗
|
* |
xk_id nods |
11:15
🔗
|
alard |
But, erm, if you're running a number of threads, why wait at all? |
11:15
🔗
|
xk_id |
good question... I suppose one reason would be to avoid being identified as an attacher |
11:15
🔗
|
alard |
For the moral dimension, it's probably the number of requests that you send to the site that counts, not the number of requests per thread. |
11:16
🔗
|
alard |
Are you using a disposable IP address? |
11:16
🔗
|
xk_id |
EC2 |
11:16
🔗
|
alard |
You could just try without a delay and see what happens. |
11:16
🔗
|
ersi |
And keep a watch, for when the banhammer falls and switch over to some new ones :) |
11:16
🔗
|
alard |
Switch to more threads with a delay if they block you. |
11:17
🔗
|
ersi |
Maybe randomise your useragent a little as well |
11:17
🔗
|
xk_id |
I thought of randomising the user agent :D |
11:17
🔗
|
ersi |
goodie |
11:17
🔗
|
alard |
So you've got rid of the morals? :) |
11:17
🔗
|
* |
ersi thanks chrome for updating all the freakin' time - plenty of useragent variants to go around |
11:18
🔗
|
xk_id |
I'm not yet decided how to go about this. |
11:18
🔗
|
xk_id |
There's several factors. morals is one of them. Another is the fact that it's part of a dissertation project, so I actually need to have morals. On the other hand, after reviewing many scholarly articles concerned with crawling, they seem very careless about this. |
11:20
🔗
|
xk_id |
but I also cannot go completely mental about this |
11:20
🔗
|
xk_id |
and finish my crawl in a day :P |
11:22
🔗
|
xk_id |
so basically: a) be polite. b) not be identified as an attacker. c) finish quickly. |
11:24
🔗
|
xk_id |
from the pov of the website, it's better to have 5 machines each delaying their requests (individually), then not delaying their requests, right? |
11:26
🔗
|
alard |
Delaying is better than not delaying? |
11:26
🔗
|
xk_id |
well, certainly, if I only had one machine. |
11:27
🔗
|
ersi |
The "Be polite"-coin has two sides. Side A: "Finish as fast as possible" and Side B: "Load as little as possible, over a long time" |
11:27
🔗
|
alard |
I'm not sure that 5 machines is better than 3 if the total requests/second is equal. |
11:28
🔗
|
alard |
The site doesn't seem very fast, by the way. At least the groups pages take a long time to come up. (But that could be my location.) |
11:28
🔗
|
ersi |
maybe only US hosting, seems slow for me as well |
11:28
🔗
|
ersi |
and maybe crummy infra |
11:29
🔗
|
xk_id |
it's a bit slow for me too |
11:30
🔗
|
xk_id |
so is it pointless to delay between requests if I'm using more than one machine? |
11:30
🔗
|
xk_id |
probably not, right? |
11:31
🔗
|
alard |
It may be useful to avoid detection, but otherwise I think it's the number of requests that count. |
11:32
🔗
|
xk_id |
detection-wise, yes. I was curious politeness-wise.. |
11:32
🔗
|
xk_id |
hmm |
11:33
🔗
|
alard |
Politeness-wise I'd say it's the difference between the politeness of a DOS and a DDOS. |
11:33
🔗
|
xk_id |
hah |
11:34
🔗
|
alard |
Given that it's slow already, it's probably good to keep an eye on the site while you're crawling. |
11:34
🔗
|
xk_id |
do you think I could actually harm it? |
11:34
🔗
|
alard |
Yes. |
11:35
🔗
|
xk_id |
what would follow to that? |
11:35
🔗
|
alard |
You're sending a lot of difficult questions. |
11:35
🔗
|
xk_id |
hmm |
11:36
🔗
|
alard |
It's also hard on their cache, since you're asking for a different page each time. |
11:36
🔗
|
xk_id |
I think I'll do some delays across the cluster as well. |
11:37
🔗
|
alard |
Can you pause and resume easily? |
11:37
🔗
|
xk_id |
and perhaps I should do it during US-time night, I think most of their userbase is from there. |
11:37
🔗
|
xk_id |
yes, I know the job queue in advance. |
11:37
🔗
|
xk_id |
and I will work with that |
11:38
🔗
|
alard |
Start slow and then increase your speed if you find that you can? |
11:39
🔗
|
xk_id |
ok |
11:44
🔗
|
xk_id |
You said that: "I'm not sure that 5 machines is better than 3 if the total requests/second is equal.". However, I reckon that if I implement delays in the code for each worker, I may achieve a more polite crawl than otherwise. Am I wrong? |
11:45
🔗
|
alard |
No, I think you're right. 5 workers with delays is more polite than 5 workers without delays. |
11:45
🔗
|
xk_id |
This is an interesting Maths problem. |
11:45
🔗
|
xk_id |
even if the delays don't take into account the operations of the other workers? |
11:46
🔗
|
alard |
Of course: adding a delay means each worker is sending fewer requests per second. |
11:46
🔗
|
xk_id |
but does this decrease the requests per second of the entire cluster? :) |
11:47
🔗
|
xk_id |
hmm |
11:47
🔗
|
alard |
Yes. So you can use a smaller cluster to get the same result. |
11:47
🔗
|
alard |
number of workers * (1 / (request length + delay)) = total requests per second |
11:48
🔗
|
xk_id |
but what matters here is also the rate between two consecutive requests, regardless of where they come from from inside the cluster. |
11:48
🔗
|
xk_id |
*rate/time |
11:48
🔗
|
alard |
Why does that matter? |
11:49
🔗
|
xk_id |
because if I do 5 requests in 5 seconds, all five in the first second, that is more stressful than if I do one request per second |
11:49
🔗
|
xk_id |
no? |
11:50
🔗
|
alard |
That is true, but it's unlikely that your workers remain synchronized. |
11:50
🔗
|
alard |
(Unless you're implementing something difficult to ensure that they are.) |
11:50
🔗
|
xk_id |
will adding delays in the code of each worker improve this, or will it remain practically neutral? |
11:52
🔗
|
alard |
I think that if each worker sends a request, waits for the response, then sends the next request, the workers will get out of sync pretty soon. |
11:52
🔗
|
xk_id |
So, it is not possible to say in advance if delays in the code will improve things in respect to this issue, if I don't synchronise the workers. |
11:53
🔗
|
alard |
I don't think delays will help. |
11:53
🔗
|
xk_id |
yes. I think you're right |
11:53
🔗
|
alard |
They'll only increase your cost, because you need a larger cluster because workers are sleeping part of the time. |
11:54
🔗
|
xk_id |
let's just hope my worker's won't randomly synchronise haha |
11:54
🔗
|
xk_id |
:P |
11:55
🔗
|
xk_id |
but, interesting situation. |
11:56
🔗
|
xk_id |
alard: thank you |
12:00
🔗
|
alard |
xk_id: Good luck. (They will synchronize if the site synchronizes its responses, of course. :) |
12:22
🔗
|
Smiley |
http://wikemacs.org/wiki/Main_Page - going away shortly |
12:22
🔗
|
ersi |
mediawikigrab that shizzle |
12:22
🔗
|
Smiley |
yeah I'm looking how now. |
12:23
🔗
|
Smiley |
errrr |
12:23
🔗
|
Smiley |
yeah how ? :S |
12:23
🔗
|
* |
Smiley needs to go to lunch :< |
12:23
🔗
|
Smiley |
http://code.google.com/p/wikiteam/ |
12:25
🔗
|
Smiley |
Is that sstill up to date? |
12:26
🔗
|
Smiley |
it doesn't say anything about warc :/ |
12:29
🔗
|
ersi |
It's not doing WARC at all |
12:29
🔗
|
ersi |
If I'm not mistaken, emijrp hacked it together |
12:31
🔗
|
Smiley |
hmmm there is no index.php :S |
12:32
🔗
|
Smiley |
./dumpgenerator.py --index=http://wikemacs.org/index.php --xml --images --delay=3 |
12:32
🔗
|
Smiley |
Checking index.php... http://wikemacs.org/index.php |
12:32
🔗
|
Smiley |
Error in index.php, please, provide a correct path to index.php |
12:32
🔗
|
Smiley |
:< |
12:32
🔗
|
Smiley |
tried with /wiki/main_page too |
12:33
🔗
|
alard |
http://wikemacs.org/w/index.php ? |
12:37
🔗
|
ersi |
You need to find the api.php I think. I know the infoz is on the wikiteam page |
12:47
🔗
|
Smiley |
http://code.google.com/p/wikiteam/wiki/NewTutorial#I_have_no_shell_access_to_server |
13:04
🔗
|
Nemo_bis |
yes, python dumpgenerator.py --api=http://wikemacs.org/w/api.php --xml --images |
13:06
🔗
|
Smiley |
woah :/ |
13:06
🔗
|
Smiley |
whats with the /w/ directory then? |
13:08
🔗
|
Smiley |
yey running now :) |
13:24
🔗
|
GLaDOS |
Smiley: they use it if they're using rewrite rules. |
13:33
🔗
|
Smiley |
ah ok |
13:45
🔗
|
Smiley |
ok so I have the dump.... heh |
13:45
🔗
|
Smiley |
Whats next ¬_¬ |
13:53
🔗
|
db48x |
Smiley: keep it safe |
13:59
🔗
|
Smiley |
:) |
13:59
🔗
|
Smiley |
should I tar it or something and upload it to the archive? |
14:00
🔗
|
db48x |
that'd be one way to keep it safe |
14:02
🔗
|
* |
Smiley ponders what to name it |
14:02
🔗
|
Smiley |
wikemacsorg.... |
14:03
🔗
|
ersi |
domaintld-yearmonthdate.something |
14:03
🔗
|
Smiley |
wikemacsorg29012013.tgz |
14:03
🔗
|
Smiley |
awww close :D |
14:06
🔗
|
Smiley |
And upload it to the archive under the same name? |
14:12
🔗
|
Smiley |
going up now |
14:25
🔗
|
Nemo_bis |
Smiley: there's also an uploader.py |
14:25
🔗
|
Smiley |
doh! :D |
15:50
🔗
|
schbiridi |
german video site http://de.sevenload.com/ will delete all user "generated" videos on the 28.02.2013 |
15:50
🔗
|
Smiley |
:< |
15:50
🔗
|
Smiley |
"We are sorry but sevenload doesn't offer its service in your country." |
15:50
🔗
|
Smiley |
herp. |
15:55
🔗
|
xk_id |
what follows, if I crash the webserver I'm crawling? |
15:56
🔗
|
alard |
You won't be able to get more data from it. |
15:57
🔗
|
xk_id |
ever? |
15:57
🔗
|
xk_id |
hmm. |
15:57
🔗
|
alard |
No, as long as it's down. |
15:58
🔗
|
xk_id |
oh, but surely it will resume after short a while. |
15:58
🔗
|
alard |
And you're more likely to get blocked and cause frowns. |
15:58
🔗
|
xk_id |
will there be serious frowns? |
15:59
🔗
|
alard |
Heh. Who knows? Look, if I were you I would just crawl as fast as I could without causing any visible trouble. Start slow -- with one thread, for example -- then add more if it goes well and you want to go a little faster. |
16:00
🔗
|
xk_id |
good idea |
16:49
🔗
|
alard |
SketchCow: The Xanga trial has run out of usernames. Current estimate is still 35TB for everything. Do you want to continue? |
16:51
🔗
|
SketchCow |
COntinue like download it? |
16:51
🔗
|
SketchCow |
Not download it. |
16:51
🔗
|
SketchCow |
But a mapping of all the people is somthing we should upload to archive.org |
16:52
🔗
|
SketchCow |
And we should have their crawlers put it on the roster |
16:55
🔗
|
balrog_ |
just exactly how big was mobileme? |
16:57
🔗
|
alard |
mobileme was 200TB. |
16:58
🔗
|
alard |
The archive.org crawlers probably won't download the audio and video, by the way. |
17:00
🔗
|
alard |
The list of users is here, http://archive.org/details/archiveteam-xanga-userlist-20130142 (not sure if I had linked that before) |
17:01
🔗
|
balrog_ |
how much has the trial downloaded? |
17:02
🔗
|
alard |
81GB. Part of that was without logging in and with earlier versions of the script. |
17:02
🔗
|
DFJustin |
"one file per user" or one line per user |
17:03
🔗
|
alard |
Hmm. As you may have noticed from the item name, this is the dyslexia edition. |
19:11
🔗
|
chronomex |
lol |
19:25
🔗
|
godane |
i'm uploading a 35min interview of the guy that made mairo |
19:27
🔗
|
godane |
filename: E3Interview_Miyamoto_G4700_flv.flv |
19:35
🔗
|
omf__ |
godane, do you know what year that video came from? |
19:36
🔗
|
Smiley |
hmmm |
19:36
🔗
|
Smiley |
whats in it? |
19:36
🔗
|
* |
Smiley bets 2007 |
19:41
🔗
|
DFJustin |
if you like that this may be worth a watch https://archive.org/details/history-of-zelda-courtesy-of-zentendo |
19:52
🔗
|
godane |
omf__: i think 2005 |
19:53
🔗
|
godane |
also found a interview with bruce campbell |
19:53
🔗
|
godane |
problem is 11mins in the clip it goes very slow with no sound |
19:54
🔗
|
godane |
its very sad that it was not fixed |
19:55
🔗
|
godane |
its also 619mb |
19:55
🔗
|
godane |
this is will be archived more the there screwed up anyways |
19:59
🔗
|
godane |
http://abandonware-magazines.org/ |