[04:20] Hmm... I suppose doing a reduce with a set of pages from the same server is not very polite. [06:45] norbert79: http://sourceforge.net/projects/xowa [08:06] Nemo_bis: Thanks, while I don't quite understand how this might help me, but it's a good start! Thanks again :) [08:16] norbert79: looks like it parses the wikitext on demand [08:24] Nemo_bis: Yes, I realized what you are trying to tell me with it. Might be useful indeed [10:54] Somebody here recommended that I wait 1s between two requests to the same host. Shall I measure the delay between the requests, or between the reply and the next request? [10:55] Lord_Nigh if you are dumping mixed mode, try imgburn as you can set it to generate a .cue, .ccd, and .mds file -- I think doing so will make it more easily re-burnable. [10:56] does imgburn know how to deal with offsets on drives? the particular cd drive i have has a LARGE offset which actually ends up an entire sector away from where the audio data starts; based on gibberish that appears in the 'right' place its likely a drive firmware off-by-1 bug in the sector count thing [10:57] seems to affect the entire line of usb dvd drives made by that manufacturer [10:59] (scratches head) [10:59] offsets meaning when digitally reading the audio areas [11:00] theres a block of gibberish which appears before the audio starts [11:00] and that has to be cut out [11:01] http://forum.imgburn.com/index.php?showtopic=5974 [11:06] I guess the answer is "no" w/ imgburn... the suggestion here for 'exact' duping (?) is two pass burn... http://www.hydrogenaudio.org/forums/index.php?showtopic=31989 [11:06] But I am not sure... do you think it's necessary... will the difference in ripping make any difference -- i.e. would anyone be aware of the difference? I do much like the idea of a perfect clone if possible though. [11:10] xk_id: I'd choose the simplest solution. (send request -> process response -> sleep 1 -> repeat, perhaps?) [11:10] Linux tool for multisession CD-EXTRA discs... http://www.phong.org/extricate/ ? [11:11] alard: simplest maybe, but not most convenient :P [11:11] alard: need to speed things up a bit here... [11:11] alard: so I'm capturing when each request is being made and delay based on that [11:11] In that case, why wait? [11:11] xk_id: Depending on how nice you want to be - the blonger wait the better - since it was like, 4M requests you needed to do? I usually opt for "race to the finish", but not with that amount. So I guess 1s, if you're okay with that [11:11] alard: cause, morals [11:12] And it was like, a total of 4M~+ requests to be done, right? [11:12] ersi: I found that in their robots.txt they allow a spider to wait 0.5s between requests. I'm basing mine on that. On the other hand, they ask a different spider to wait 4s [11:12] * xk_id nods [11:12] maximum 4M. between 3 and 4M [11:12] Is it a small site? Or one with more than enough capacity? [11:13] It's an online social network. [11:13] http://gather.com [11:13] is freshness important? (Like, if it's suspecteble to.. fall over and die) If not, 1-2s sounds pretty fair [11:14] I need to finish quickly. less than a week. [11:14] ouch [11:14] with 1s wait, it'll take 46 days [11:14] I'm doing it distributely [11:15] (46 days basing on single thread - wait 1s between request) [11:15] * xk_id nods [11:15] But, erm, if you're running a number of threads, why wait at all? [11:15] good question... I suppose one reason would be to avoid being identified as an attacher [11:15] For the moral dimension, it's probably the number of requests that you send to the site that counts, not the number of requests per thread. [11:16] Are you using a disposable IP address? [11:16] EC2 [11:16] You could just try without a delay and see what happens. [11:16] And keep a watch, for when the banhammer falls and switch over to some new ones :) [11:16] Switch to more threads with a delay if they block you. [11:17] Maybe randomise your useragent a little as well [11:17] I thought of randomising the user agent :D [11:17] goodie [11:17] So you've got rid of the morals? :) [11:17] * ersi thanks chrome for updating all the freakin' time - plenty of useragent variants to go around [11:18] I'm not yet decided how to go about this. [11:18] There's several factors. morals is one of them. Another is the fact that it's part of a dissertation project, so I actually need to have morals. On the other hand, after reviewing many scholarly articles concerned with crawling, they seem very careless about this. [11:20] but I also cannot go completely mental about this [11:20] and finish my crawl in a day :P [11:22] so basically: a) be polite. b) not be identified as an attacker. c) finish quickly. [11:24] from the pov of the website, it's better to have 5 machines each delaying their requests (individually), then not delaying their requests, right? [11:26] Delaying is better than not delaying? [11:26] well, certainly, if I only had one machine. [11:27] The "Be polite"-coin has two sides. Side A: "Finish as fast as possible" and Side B: "Load as little as possible, over a long time" [11:27] I'm not sure that 5 machines is better than 3 if the total requests/second is equal. [11:28] The site doesn't seem very fast, by the way. At least the groups pages take a long time to come up. (But that could be my location.) [11:28] maybe only US hosting, seems slow for me as well [11:28] and maybe crummy infra [11:29] it's a bit slow for me too [11:30] so is it pointless to delay between requests if I'm using more than one machine? [11:30] probably not, right? [11:31] It may be useful to avoid detection, but otherwise I think it's the number of requests that count. [11:32] detection-wise, yes. I was curious politeness-wise.. [11:32] hmm [11:33] Politeness-wise I'd say it's the difference between the politeness of a DOS and a DDOS. [11:33] hah [11:34] Given that it's slow already, it's probably good to keep an eye on the site while you're crawling. [11:34] do you think I could actually harm it? [11:34] Yes. [11:35] what would follow to that? [11:35] You're sending a lot of difficult questions. [11:35] hmm [11:36] It's also hard on their cache, since you're asking for a different page each time. [11:36] I think I'll do some delays across the cluster as well. [11:37] Can you pause and resume easily? [11:37] and perhaps I should do it during US-time night, I think most of their userbase is from there. [11:37] yes, I know the job queue in advance. [11:37] and I will work with that [11:38] Start slow and then increase your speed if you find that you can? [11:39] ok [11:44] You said that: "I'm not sure that 5 machines is better than 3 if the total requests/second is equal.". However, I reckon that if I implement delays in the code for each worker, I may achieve a more polite crawl than otherwise. Am I wrong? [11:45] No, I think you're right. 5 workers with delays is more polite than 5 workers without delays. [11:45] This is an interesting Maths problem. [11:45] even if the delays don't take into account the operations of the other workers? [11:46] Of course: adding a delay means each worker is sending fewer requests per second. [11:46] but does this decrease the requests per second of the entire cluster? :) [11:47] hmm [11:47] Yes. So you can use a smaller cluster to get the same result. [11:47] number of workers * (1 / (request length + delay)) = total requests per second [11:48] but what matters here is also the rate between two consecutive requests, regardless of where they come from from inside the cluster. [11:48] *rate/time [11:48] Why does that matter? [11:49] because if I do 5 requests in 5 seconds, all five in the first second, that is more stressful than if I do one request per second [11:49] no? [11:50] That is true, but it's unlikely that your workers remain synchronized. [11:50] (Unless you're implementing something difficult to ensure that they are.) [11:50] will adding delays in the code of each worker improve this, or will it remain practically neutral? [11:52] I think that if each worker sends a request, waits for the response, then sends the next request, the workers will get out of sync pretty soon. [11:52] So, it is not possible to say in advance if delays in the code will improve things in respect to this issue, if I don't synchronise the workers. [11:53] I don't think delays will help. [11:53] yes. I think you're right [11:53] They'll only increase your cost, because you need a larger cluster because workers are sleeping part of the time. [11:54] let's just hope my worker's won't randomly synchronise haha [11:54] :P [11:55] but, interesting situation. [11:56] alard: thank you [12:00] xk_id: Good luck. (They will synchronize if the site synchronizes its responses, of course. :) [12:22] http://wikemacs.org/wiki/Main_Page - going away shortly [12:22] mediawikigrab that shizzle [12:22] yeah I'm looking how now. [12:23] errrr [12:23] yeah how ? :S [12:23] * Smiley needs to go to lunch :< [12:23] http://code.google.com/p/wikiteam/ [12:25] Is that sstill up to date? [12:26] it doesn't say anything about warc :/ [12:29] It's not doing WARC at all [12:29] If I'm not mistaken, emijrp hacked it together [12:31] hmmm there is no index.php :S [12:32] ./dumpgenerator.py --index=http://wikemacs.org/index.php --xml --images --delay=3 [12:32] Checking index.php... http://wikemacs.org/index.php [12:32] Error in index.php, please, provide a correct path to index.php [12:32] :< [12:32] tried with /wiki/main_page too [12:33] http://wikemacs.org/w/index.php ? [12:37] You need to find the api.php I think. I know the infoz is on the wikiteam page [12:47] http://code.google.com/p/wikiteam/wiki/NewTutorial#I_have_no_shell_access_to_server [13:04] yes, python dumpgenerator.py --api=http://wikemacs.org/w/api.php --xml --images [13:06] woah :/ [13:06] whats with the /w/ directory then? [13:08] yey running now :) [13:24] Smiley: they use it if they're using rewrite rules. [13:33] ah ok [13:45] ok so I have the dump.... heh [13:45] Whats next Â¬_Â¬ [13:53] Smiley: keep it safe [13:59] :) [13:59] should I tar it or something and upload it to the archive? [14:00] that'd be one way to keep it safe [14:02] * Smiley ponders what to name it [14:02] wikemacsorg.... [14:03] domaintld-yearmonthdate.something [14:03] wikemacsorg29012013.tgz [14:03] awww close :D [14:06] And upload it to the archive under the same name? [14:12] going up now [14:25] Smiley: there's also an uploader.py [14:25] doh! :D [15:50] german video site http://de.sevenload.com/ will delete all user "generated" videos on the 28.02.2013 [15:50] :< [15:50] "We are sorry but sevenload doesn't offer its service in your country." [15:50] herp. [15:55] what follows, if I crash the webserver I'm crawling? [15:56] You won't be able to get more data from it. [15:57] ever? [15:57] hmm. [15:57] No, as long as it's down. [15:58] oh, but surely it will resume after short a while. [15:58] And you're more likely to get blocked and cause frowns. [15:58] will there be serious frowns? [15:59] Heh. Who knows? Look, if I were you I would just crawl as fast as I could without causing any visible trouble. Start slow -- with one thread, for example -- then add more if it goes well and you want to go a little faster. [16:00] good idea [16:49] SketchCow: The Xanga trial has run out of usernames. Current estimate is still 35TB for everything. Do you want to continue? [16:51] COntinue like download it? [16:51] Not download it. [16:51] But a mapping of all the people is somthing we should upload to archive.org [16:52] And we should have their crawlers put it on the roster [16:55] just exactly how big was mobileme? [16:57] mobileme was 200TB. [16:58] The archive.org crawlers probably won't download the audio and video, by the way. [17:00] The list of users is here, http://archive.org/details/archiveteam-xanga-userlist-20130142 (not sure if I had linked that before) [17:01] how much has the trial downloaded? [17:02] 81GB. Part of that was without logging in and with earlier versions of the script. [17:02] "one file per user" or one line per user [17:03] Hmm. As you may have noticed from the item name, this is the dyslexia edition. [19:11] lol [19:25] i'm uploading a 35min interview of the guy that made mairo [19:27] filename: E3Interview_Miyamoto_G4700_flv.flv [19:35] godane, do you know what year that video came from? [19:36] hmmm [19:36] whats in it? [19:36] * Smiley bets 2007 [19:41] if you like that this may be worth a watch https://archive.org/details/history-of-zelda-courtesy-of-zentendo [19:52] omf__: i think 2005 [19:53] also found a interview with bruce campbell [19:53] problem is 11mins in the clip it goes very slow with no sound [19:54] its very sad that it was not fixed [19:55] its also 619mb [19:55] this is will be archived more the there screwed up anyways [19:59] http://abandonware-magazines.org/