#archiveteam 2013-01-29,Tue

↑back Search

Time Nickname Message
04:20 🔗 xk_id Hmm... I suppose doing a reduce with a set of pages from the same server is not very polite.
06:45 🔗 Nemo_bis norbert79: http://sourceforge.net/projects/xowa
08:06 🔗 norbert79 Nemo_bis: Thanks, while I don't quite understand how this might help me, but it's a good start! Thanks again :)
08:16 🔗 Nemo_bis norbert79: looks like it parses the wikitext on demand
08:24 🔗 norbert79 Nemo_bis: Yes, I realized what you are trying to tell me with it. Might be useful indeed
10:54 🔗 xk_id Somebody here recommended that I wait 1s between two requests to the same host. Shall I measure the delay between the requests, or between the reply and the next request?
10:55 🔗 turnkit Lord_Nigh if you are dumping mixed mode, try imgburn as you can set it to generate a .cue, .ccd, and .mds file -- I think doing so will make it more easily re-burnable.
10:56 🔗 Lord_Nigh does imgburn know how to deal with offsets on drives? the particular cd drive i have has a LARGE offset which actually ends up an entire sector away from where the audio data starts; based on gibberish that appears in the 'right' place its likely a drive firmware off-by-1 bug in the sector count thing
10:57 🔗 Lord_Nigh seems to affect the entire line of usb dvd drives made by that manufacturer
10:59 🔗 turnkit (scratches head)
10:59 🔗 Lord_Nigh offsets meaning when digitally reading the audio areas
11:00 🔗 Lord_Nigh theres a block of gibberish which appears before the audio starts
11:00 🔗 Lord_Nigh and that has to be cut out
11:01 🔗 turnkit http://forum.imgburn.com/index.php?showtopic=5974
11:06 🔗 turnkit I guess the answer is "no" w/ imgburn... the suggestion here for 'exact' duping (?) is two pass burn... http://www.hydrogenaudio.org/forums/index.php?showtopic=31989
11:06 🔗 turnkit But I am not sure... do you think it's necessary... will the difference in ripping make any difference -- i.e. would anyone be aware of the difference? I do much like the idea of a perfect clone if possible though.
11:10 🔗 alard xk_id: I'd choose the simplest solution. (send request -> process response -> sleep 1 -> repeat, perhaps?)
11:10 🔗 turnkit Linux tool for multisession CD-EXTRA discs... http://www.phong.org/extricate/ ?
11:11 🔗 xk_id alard: simplest maybe, but not most convenient :P
11:11 🔗 xk_id alard: need to speed things up a bit here...
11:11 🔗 xk_id alard: so I'm capturing when each request is being made and delay based on that
11:11 🔗 alard In that case, why wait?
11:11 🔗 ersi xk_id: Depending on how nice you want to be - the blonger wait the better - since it was like, 4M requests you needed to do? I usually opt for "race to the finish", but not with that amount. So I guess 1s, if you're okay with that
11:11 🔗 xk_id alard: cause, morals
11:12 🔗 ersi And it was like, a total of 4M~+ requests to be done, right?
11:12 🔗 xk_id ersi: I found that in their robots.txt they allow a spider to wait 0.5s between requests. I'm basing mine on that. On the other hand, they ask a different spider to wait 4s
11:12 🔗 * xk_id nods
11:12 🔗 xk_id maximum 4M. between 3 and 4M
11:12 🔗 alard Is it a small site? Or one with more than enough capacity?
11:13 🔗 xk_id It's an online social network.
11:13 🔗 xk_id http://gather.com
11:13 🔗 ersi is freshness important? (Like, if it's suspecteble to.. fall over and die) If not, 1-2s sounds pretty fair
11:14 🔗 xk_id I need to finish quickly. less than a week.
11:14 🔗 ersi ouch
11:14 🔗 ersi with 1s wait, it'll take 46 days
11:14 🔗 xk_id I'm doing it distributely
11:15 🔗 ersi (46 days basing on single thread - wait 1s between request)
11:15 🔗 * xk_id nods
11:15 🔗 alard But, erm, if you're running a number of threads, why wait at all?
11:15 🔗 xk_id good question... I suppose one reason would be to avoid being identified as an attacher
11:15 🔗 alard For the moral dimension, it's probably the number of requests that you send to the site that counts, not the number of requests per thread.
11:16 🔗 alard Are you using a disposable IP address?
11:16 🔗 xk_id EC2
11:16 🔗 alard You could just try without a delay and see what happens.
11:16 🔗 ersi And keep a watch, for when the banhammer falls and switch over to some new ones :)
11:16 🔗 alard Switch to more threads with a delay if they block you.
11:17 🔗 ersi Maybe randomise your useragent a little as well
11:17 🔗 xk_id I thought of randomising the user agent :D
11:17 🔗 ersi goodie
11:17 🔗 alard So you've got rid of the morals? :)
11:17 🔗 * ersi thanks chrome for updating all the freakin' time - plenty of useragent variants to go around
11:18 🔗 xk_id I'm not yet decided how to go about this.
11:18 🔗 xk_id There's several factors. morals is one of them. Another is the fact that it's part of a dissertation project, so I actually need to have morals. On the other hand, after reviewing many scholarly articles concerned with crawling, they seem very careless about this.
11:20 🔗 xk_id but I also cannot go completely mental about this
11:20 🔗 xk_id and finish my crawl in a day :P
11:22 🔗 xk_id so basically: a) be polite. b) not be identified as an attacker. c) finish quickly.
11:24 🔗 xk_id from the pov of the website, it's better to have 5 machines each delaying their requests (individually), then not delaying their requests, right?
11:26 🔗 alard Delaying is better than not delaying?
11:26 🔗 xk_id well, certainly, if I only had one machine.
11:27 🔗 ersi The "Be polite"-coin has two sides. Side A: "Finish as fast as possible" and Side B: "Load as little as possible, over a long time"
11:27 🔗 alard I'm not sure that 5 machines is better than 3 if the total requests/second is equal.
11:28 🔗 alard The site doesn't seem very fast, by the way. At least the groups pages take a long time to come up. (But that could be my location.)
11:28 🔗 ersi maybe only US hosting, seems slow for me as well
11:28 🔗 ersi and maybe crummy infra
11:29 🔗 xk_id it's a bit slow for me too
11:30 🔗 xk_id so is it pointless to delay between requests if I'm using more than one machine?
11:30 🔗 xk_id probably not, right?
11:31 🔗 alard It may be useful to avoid detection, but otherwise I think it's the number of requests that count.
11:32 🔗 xk_id detection-wise, yes. I was curious politeness-wise..
11:32 🔗 xk_id hmm
11:33 🔗 alard Politeness-wise I'd say it's the difference between the politeness of a DOS and a DDOS.
11:33 🔗 xk_id hah
11:34 🔗 alard Given that it's slow already, it's probably good to keep an eye on the site while you're crawling.
11:34 🔗 xk_id do you think I could actually harm it?
11:34 🔗 alard Yes.
11:35 🔗 xk_id what would follow to that?
11:35 🔗 alard You're sending a lot of difficult questions.
11:35 🔗 xk_id hmm
11:36 🔗 alard It's also hard on their cache, since you're asking for a different page each time.
11:36 🔗 xk_id I think I'll do some delays across the cluster as well.
11:37 🔗 alard Can you pause and resume easily?
11:37 🔗 xk_id and perhaps I should do it during US-time night, I think most of their userbase is from there.
11:37 🔗 xk_id yes, I know the job queue in advance.
11:37 🔗 xk_id and I will work with that
11:38 🔗 alard Start slow and then increase your speed if you find that you can?
11:39 🔗 xk_id ok
11:44 🔗 xk_id You said that: "I'm not sure that 5 machines is better than 3 if the total requests/second is equal.". However, I reckon that if I implement delays in the code for each worker, I may achieve a more polite crawl than otherwise. Am I wrong?
11:45 🔗 alard No, I think you're right. 5 workers with delays is more polite than 5 workers without delays.
11:45 🔗 xk_id This is an interesting Maths problem.
11:45 🔗 xk_id even if the delays don't take into account the operations of the other workers?
11:46 🔗 alard Of course: adding a delay means each worker is sending fewer requests per second.
11:46 🔗 xk_id but does this decrease the requests per second of the entire cluster? :)
11:47 🔗 xk_id hmm
11:47 🔗 alard Yes. So you can use a smaller cluster to get the same result.
11:47 🔗 alard number of workers * (1 / (request length + delay)) = total requests per second
11:48 🔗 xk_id but what matters here is also the rate between two consecutive requests, regardless of where they come from from inside the cluster.
11:48 🔗 xk_id *rate/time
11:48 🔗 alard Why does that matter?
11:49 🔗 xk_id because if I do 5 requests in 5 seconds, all five in the first second, that is more stressful than if I do one request per second
11:49 🔗 xk_id no?
11:50 🔗 alard That is true, but it's unlikely that your workers remain synchronized.
11:50 🔗 alard (Unless you're implementing something difficult to ensure that they are.)
11:50 🔗 xk_id will adding delays in the code of each worker improve this, or will it remain practically neutral?
11:52 🔗 alard I think that if each worker sends a request, waits for the response, then sends the next request, the workers will get out of sync pretty soon.
11:52 🔗 xk_id So, it is not possible to say in advance if delays in the code will improve things in respect to this issue, if I don't synchronise the workers.
11:53 🔗 alard I don't think delays will help.
11:53 🔗 xk_id yes. I think you're right
11:53 🔗 alard They'll only increase your cost, because you need a larger cluster because workers are sleeping part of the time.
11:54 🔗 xk_id let's just hope my worker's won't randomly synchronise haha
11:54 🔗 xk_id :P
11:55 🔗 xk_id but, interesting situation.
11:56 🔗 xk_id alard: thank you
12:00 🔗 alard xk_id: Good luck. (They will synchronize if the site synchronizes its responses, of course. :)
12:22 🔗 Smiley http://wikemacs.org/wiki/Main_Page - going away shortly
12:22 🔗 ersi mediawikigrab that shizzle
12:22 🔗 Smiley yeah I'm looking how now.
12:23 🔗 Smiley errrr
12:23 🔗 Smiley yeah how ? :S
12:23 🔗 * Smiley needs to go to lunch :<
12:23 🔗 Smiley http://code.google.com/p/wikiteam/
12:25 🔗 Smiley Is that sstill up to date?
12:26 🔗 Smiley it doesn't say anything about warc :/
12:29 🔗 ersi It's not doing WARC at all
12:29 🔗 ersi If I'm not mistaken, emijrp hacked it together
12:31 🔗 Smiley hmmm there is no index.php :S
12:32 🔗 Smiley ./dumpgenerator.py --index=http://wikemacs.org/index.php --xml --images --delay=3
12:32 🔗 Smiley Checking index.php... http://wikemacs.org/index.php
12:32 🔗 Smiley Error in index.php, please, provide a correct path to index.php
12:32 🔗 Smiley :<
12:32 🔗 Smiley tried with /wiki/main_page too
12:33 🔗 alard http://wikemacs.org/w/index.php ?
12:37 🔗 ersi You need to find the api.php I think. I know the infoz is on the wikiteam page
12:47 🔗 Smiley http://code.google.com/p/wikiteam/wiki/NewTutorial#I_have_no_shell_access_to_server
13:04 🔗 Nemo_bis yes, python dumpgenerator.py --api=http://wikemacs.org/w/api.php --xml --images
13:06 🔗 Smiley woah :/
13:06 🔗 Smiley whats with the /w/ directory then?
13:08 🔗 Smiley yey running now :)
13:24 🔗 GLaDOS Smiley: they use it if they're using rewrite rules.
13:33 🔗 Smiley ah ok
13:45 🔗 Smiley ok so I have the dump.... heh
13:45 🔗 Smiley Whats next ¬_¬
13:53 🔗 db48x Smiley: keep it safe
13:59 🔗 Smiley :)
13:59 🔗 Smiley should I tar it or something and upload it to the archive?
14:00 🔗 db48x that'd be one way to keep it safe
14:02 🔗 * Smiley ponders what to name it
14:02 🔗 Smiley wikemacsorg....
14:03 🔗 ersi domaintld-yearmonthdate.something
14:03 🔗 Smiley wikemacsorg29012013.tgz
14:03 🔗 Smiley awww close :D
14:06 🔗 Smiley And upload it to the archive under the same name?
14:12 🔗 Smiley going up now
14:25 🔗 Nemo_bis Smiley: there's also an uploader.py
14:25 🔗 Smiley doh! :D
15:50 🔗 schbiridi german video site http://de.sevenload.com/ will delete all user "generated" videos on the 28.02.2013
15:50 🔗 Smiley :<
15:50 🔗 Smiley "We are sorry but sevenload doesn't offer its service in your country."
15:50 🔗 Smiley herp.
15:55 🔗 xk_id what follows, if I crash the webserver I'm crawling?
15:56 🔗 alard You won't be able to get more data from it.
15:57 🔗 xk_id ever?
15:57 🔗 xk_id hmm.
15:57 🔗 alard No, as long as it's down.
15:58 🔗 xk_id oh, but surely it will resume after short a while.
15:58 🔗 alard And you're more likely to get blocked and cause frowns.
15:58 🔗 xk_id will there be serious frowns?
15:59 🔗 alard Heh. Who knows? Look, if I were you I would just crawl as fast as I could without causing any visible trouble. Start slow -- with one thread, for example -- then add more if it goes well and you want to go a little faster.
16:00 🔗 xk_id good idea
16:49 🔗 alard SketchCow: The Xanga trial has run out of usernames. Current estimate is still 35TB for everything. Do you want to continue?
16:51 🔗 SketchCow COntinue like download it?
16:51 🔗 SketchCow Not download it.
16:51 🔗 SketchCow But a mapping of all the people is somthing we should upload to archive.org
16:52 🔗 SketchCow And we should have their crawlers put it on the roster
16:55 🔗 balrog_ just exactly how big was mobileme?
16:57 🔗 alard mobileme was 200TB.
16:58 🔗 alard The archive.org crawlers probably won't download the audio and video, by the way.
17:00 🔗 alard The list of users is here, http://archive.org/details/archiveteam-xanga-userlist-20130142 (not sure if I had linked that before)
17:01 🔗 balrog_ how much has the trial downloaded?
17:02 🔗 alard 81GB. Part of that was without logging in and with earlier versions of the script.
17:02 🔗 DFJustin "one file per user" or one line per user
17:03 🔗 alard Hmm. As you may have noticed from the item name, this is the dyslexia edition.
19:11 🔗 chronomex lol
19:25 🔗 godane i'm uploading a 35min interview of the guy that made mairo
19:27 🔗 godane filename: E3Interview_Miyamoto_G4700_flv.flv
19:35 🔗 omf__ godane, do you know what year that video came from?
19:36 🔗 Smiley hmmm
19:36 🔗 Smiley whats in it?
19:36 🔗 * Smiley bets 2007
19:41 🔗 DFJustin if you like that this may be worth a watch https://archive.org/details/history-of-zelda-courtesy-of-zentendo
19:52 🔗 godane omf__: i think 2005
19:53 🔗 godane also found a interview with bruce campbell
19:53 🔗 godane problem is 11mins in the clip it goes very slow with no sound
19:54 🔗 godane its very sad that it was not fixed
19:55 🔗 godane its also 619mb
19:55 🔗 godane this is will be archived more the there screwed up anyways
19:59 🔗 godane http://abandonware-magazines.org/

irclogger-viewer