#archiveteam 2013-01-29,Tue

↑back Search

Time	Nickname	Message
04:20 ^🔗	xk_id	Hmm... I suppose doing a reduce with a set of pages from the same server is not very polite.
06:45 ^🔗	Nemo_bis	norbert79: http://sourceforge.net/projects/xowa
08:06 ^🔗	norbert79	Nemo_bis: Thanks, while I don't quite understand how this might help me, but it's a good start! Thanks again :)
08:16 ^🔗	Nemo_bis	norbert79: looks like it parses the wikitext on demand
08:24 ^🔗	norbert79	Nemo_bis: Yes, I realized what you are trying to tell me with it. Might be useful indeed
10:54 ^🔗	xk_id	Somebody here recommended that I wait 1s between two requests to the same host. Shall I measure the delay between the requests, or between the reply and the next request?
10:55 ^🔗	turnkit	Lord_Nigh if you are dumping mixed mode, try imgburn as you can set it to generate a .cue, .ccd, and .mds file -- I think doing so will make it more easily re-burnable.
10:56 ^🔗	Lord_Nigh	does imgburn know how to deal with offsets on drives? the particular cd drive i have has a LARGE offset which actually ends up an entire sector away from where the audio data starts; based on gibberish that appears in the 'right' place its likely a drive firmware off-by-1 bug in the sector count thing
10:57 ^🔗	Lord_Nigh	seems to affect the entire line of usb dvd drives made by that manufacturer
10:59 ^🔗	turnkit	(scratches head)
10:59 ^🔗	Lord_Nigh	offsets meaning when digitally reading the audio areas
11:00 ^🔗	Lord_Nigh	theres a block of gibberish which appears before the audio starts
11:00 ^🔗	Lord_Nigh	and that has to be cut out
11:01 ^🔗	turnkit	http://forum.imgburn.com/index.php?showtopic=5974
11:06 ^🔗	turnkit	I guess the answer is "no" w/ imgburn... the suggestion here for 'exact' duping (?) is two pass burn... http://www.hydrogenaudio.org/forums/index.php?showtopic=31989
11:06 ^🔗	turnkit	But I am not sure... do you think it's necessary... will the difference in ripping make any difference -- i.e. would anyone be aware of the difference? I do much like the idea of a perfect clone if possible though.
11:10 ^🔗	alard	xk_id: I'd choose the simplest solution. (send request -> process response -> sleep 1 -> repeat, perhaps?)
11:10 ^🔗	turnkit	Linux tool for multisession CD-EXTRA discs... http://www.phong.org/extricate/ ?
11:11 ^🔗	xk_id	alard: simplest maybe, but not most convenient :P
11:11 ^🔗	xk_id	alard: need to speed things up a bit here...
11:11 ^🔗	xk_id	alard: so I'm capturing when each request is being made and delay based on that
11:11 ^🔗	alard	In that case, why wait?
11:11 ^🔗	ersi	xk_id: Depending on how nice you want to be - the blonger wait the better - since it was like, 4M requests you needed to do? I usually opt for "race to the finish", but not with that amount. So I guess 1s, if you're okay with that
11:11 ^🔗	xk_id	alard: cause, morals
11:12 ^🔗	ersi	And it was like, a total of 4M~+ requests to be done, right?
11:12 ^🔗	xk_id	ersi: I found that in their robots.txt they allow a spider to wait 0.5s between requests. I'm basing mine on that. On the other hand, they ask a different spider to wait 4s
11:12 ^🔗	*	xk_id nods
11:12 ^🔗	xk_id	maximum 4M. between 3 and 4M
11:12 ^🔗	alard	Is it a small site? Or one with more than enough capacity?
11:13 ^🔗	xk_id	It's an online social network.
11:13 ^🔗	xk_id	http://gather.com
11:13 ^🔗	ersi	is freshness important? (Like, if it's suspecteble to.. fall over and die) If not, 1-2s sounds pretty fair
11:14 ^🔗	xk_id	I need to finish quickly. less than a week.
11:14 ^🔗	ersi	ouch
11:14 ^🔗	ersi	with 1s wait, it'll take 46 days
11:14 ^🔗	xk_id	I'm doing it distributely
11:15 ^🔗	ersi	(46 days basing on single thread - wait 1s between request)
11:15 ^🔗	*	xk_id nods
11:15 ^🔗	alard	But, erm, if you're running a number of threads, why wait at all?
11:15 ^🔗	xk_id	good question... I suppose one reason would be to avoid being identified as an attacher
11:15 ^🔗	alard	For the moral dimension, it's probably the number of requests that you send to the site that counts, not the number of requests per thread.
11:16 ^🔗	alard	Are you using a disposable IP address?
11:16 ^🔗	xk_id	EC2
11:16 ^🔗	alard	You could just try without a delay and see what happens.
11:16 ^🔗	ersi	And keep a watch, for when the banhammer falls and switch over to some new ones :)
11:16 ^🔗	alard	Switch to more threads with a delay if they block you.
11:17 ^🔗	ersi	Maybe randomise your useragent a little as well
11:17 ^🔗	xk_id	I thought of randomising the user agent :D
11:17 ^🔗	ersi	goodie
11:17 ^🔗	alard	So you've got rid of the morals? :)
11:17 ^🔗	*	ersi thanks chrome for updating all the freakin' time - plenty of useragent variants to go around
11:18 ^🔗	xk_id	I'm not yet decided how to go about this.
11:18 ^🔗	xk_id	There's several factors. morals is one of them. Another is the fact that it's part of a dissertation project, so I actually need to have morals. On the other hand, after reviewing many scholarly articles concerned with crawling, they seem very careless about this.
11:20 ^🔗	xk_id	but I also cannot go completely mental about this
11:20 ^🔗	xk_id	and finish my crawl in a day :P
11:22 ^🔗	xk_id	so basically: a) be polite. b) not be identified as an attacker. c) finish quickly.
11:24 ^🔗	xk_id	from the pov of the website, it's better to have 5 machines each delaying their requests (individually), then not delaying their requests, right?
11:26 ^🔗	alard	Delaying is better than not delaying?
11:26 ^🔗	xk_id	well, certainly, if I only had one machine.
11:27 ^🔗	ersi	The "Be polite"-coin has two sides. Side A: "Finish as fast as possible" and Side B: "Load as little as possible, over a long time"
11:27 ^🔗	alard	I'm not sure that 5 machines is better than 3 if the total requests/second is equal.
11:28 ^🔗	alard	The site doesn't seem very fast, by the way. At least the groups pages take a long time to come up. (But that could be my location.)
11:28 ^🔗	ersi	maybe only US hosting, seems slow for me as well
11:28 ^🔗	ersi	and maybe crummy infra
11:29 ^🔗	xk_id	it's a bit slow for me too
11:30 ^🔗	xk_id	so is it pointless to delay between requests if I'm using more than one machine?
11:30 ^🔗	xk_id	probably not, right?
11:31 ^🔗	alard	It may be useful to avoid detection, but otherwise I think it's the number of requests that count.
11:32 ^🔗	xk_id	detection-wise, yes. I was curious politeness-wise..
11:32 ^🔗	xk_id	hmm
11:33 ^🔗	alard	Politeness-wise I'd say it's the difference between the politeness of a DOS and a DDOS.
11:33 ^🔗	xk_id	hah
11:34 ^🔗	alard	Given that it's slow already, it's probably good to keep an eye on the site while you're crawling.
11:34 ^🔗	xk_id	do you think I could actually harm it?
11:34 ^🔗	alard	Yes.
11:35 ^🔗	xk_id	what would follow to that?
11:35 ^🔗	alard	You're sending a lot of difficult questions.
11:35 ^🔗	xk_id	hmm
11:36 ^🔗	alard	It's also hard on their cache, since you're asking for a different page each time.
11:36 ^🔗	xk_id	I think I'll do some delays across the cluster as well.
11:37 ^🔗	alard	Can you pause and resume easily?
11:37 ^🔗	xk_id	and perhaps I should do it during US-time night, I think most of their userbase is from there.
11:37 ^🔗	xk_id	yes, I know the job queue in advance.
11:37 ^🔗	xk_id	and I will work with that
11:38 ^🔗	alard	Start slow and then increase your speed if you find that you can?
11:39 ^🔗	xk_id	ok
11:44 ^🔗	xk_id	You said that: "I'm not sure that 5 machines is better than 3 if the total requests/second is equal.". However, I reckon that if I implement delays in the code for each worker, I may achieve a more polite crawl than otherwise. Am I wrong?
11:45 ^🔗	alard	No, I think you're right. 5 workers with delays is more polite than 5 workers without delays.
11:45 ^🔗	xk_id	This is an interesting Maths problem.
11:45 ^🔗	xk_id	even if the delays don't take into account the operations of the other workers?
11:46 ^🔗	alard	Of course: adding a delay means each worker is sending fewer requests per second.
11:46 ^🔗	xk_id	but does this decrease the requests per second of the entire cluster? :)
11:47 ^🔗	xk_id	hmm
11:47 ^🔗	alard	Yes. So you can use a smaller cluster to get the same result.
11:47 ^🔗	alard	number of workers * (1 / (request length + delay)) = total requests per second
11:48 ^🔗	xk_id	but what matters here is also the rate between two consecutive requests, regardless of where they come from from inside the cluster.
11:48 ^🔗	xk_id	*rate/time
11:48 ^🔗	alard	Why does that matter?
11:49 ^🔗	xk_id	because if I do 5 requests in 5 seconds, all five in the first second, that is more stressful than if I do one request per second
11:49 ^🔗	xk_id	no?
11:50 ^🔗	alard	That is true, but it's unlikely that your workers remain synchronized.
11:50 ^🔗	alard	(Unless you're implementing something difficult to ensure that they are.)
11:50 ^🔗	xk_id	will adding delays in the code of each worker improve this, or will it remain practically neutral?
11:52 ^🔗	alard	I think that if each worker sends a request, waits for the response, then sends the next request, the workers will get out of sync pretty soon.
11:52 ^🔗	xk_id	So, it is not possible to say in advance if delays in the code will improve things in respect to this issue, if I don't synchronise the workers.
11:53 ^🔗	alard	I don't think delays will help.
11:53 ^🔗	xk_id	yes. I think you're right
11:53 ^🔗	alard	They'll only increase your cost, because you need a larger cluster because workers are sleeping part of the time.
11:54 ^🔗	xk_id	let's just hope my worker's won't randomly synchronise haha
11:54 ^🔗	xk_id	:P
11:55 ^🔗	xk_id	but, interesting situation.
11:56 ^🔗	xk_id	alard: thank you
12:00 ^🔗	alard	xk_id: Good luck. (They will synchronize if the site synchronizes its responses, of course. :)
12:22 ^🔗	Smiley	http://wikemacs.org/wiki/Main_Page - going away shortly
12:22 ^🔗	ersi	mediawikigrab that shizzle
12:22 ^🔗	Smiley	yeah I'm looking how now.
12:23 ^🔗	Smiley	errrr
12:23 ^🔗	Smiley	yeah how ? :S
12:23 ^🔗	*	Smiley needs to go to lunch :<
12:23 ^🔗	Smiley	http://code.google.com/p/wikiteam/
12:25 ^🔗	Smiley	Is that sstill up to date?
12:26 ^🔗	Smiley	it doesn't say anything about warc :/
12:29 ^🔗	ersi	It's not doing WARC at all
12:29 ^🔗	ersi	If I'm not mistaken, emijrp hacked it together
12:31 ^🔗	Smiley	hmmm there is no index.php :S
12:32 ^🔗	Smiley	./dumpgenerator.py --index=http://wikemacs.org/index.php --xml --images --delay=3
12:32 ^🔗	Smiley	Checking index.php... http://wikemacs.org/index.php
12:32 ^🔗	Smiley	Error in index.php, please, provide a correct path to index.php
12:32 ^🔗	Smiley	:<
12:32 ^🔗	Smiley	tried with /wiki/main_page too
12:33 ^🔗	alard	http://wikemacs.org/w/index.php ?
12:37 ^🔗	ersi	You need to find the api.php I think. I know the infoz is on the wikiteam page
12:47 ^🔗	Smiley	http://code.google.com/p/wikiteam/wiki/NewTutorial#I_have_no_shell_access_to_server
13:04 ^🔗	Nemo_bis	yes, python dumpgenerator.py --api=http://wikemacs.org/w/api.php --xml --images
13:06 ^🔗	Smiley	woah :/
13:06 ^🔗	Smiley	whats with the /w/ directory then?
13:08 ^🔗	Smiley	yey running now :)
13:24 ^🔗	GLaDOS	Smiley: they use it if they're using rewrite rules.
13:33 ^🔗	Smiley	ah ok
13:45 ^🔗	Smiley	ok so I have the dump.... heh
13:45 ^🔗	Smiley	Whats next Â¬_Â¬
13:53 ^🔗	db48x	Smiley: keep it safe
13:59 ^🔗	Smiley	:)
13:59 ^🔗	Smiley	should I tar it or something and upload it to the archive?
14:00 ^🔗	db48x	that'd be one way to keep it safe
14:02 ^🔗	*	Smiley ponders what to name it
14:02 ^🔗	Smiley	wikemacsorg....
14:03 ^🔗	ersi	domaintld-yearmonthdate.something
14:03 ^🔗	Smiley	wikemacsorg29012013.tgz
14:03 ^🔗	Smiley	awww close :D
14:06 ^🔗	Smiley	And upload it to the archive under the same name?
14:12 ^🔗	Smiley	going up now
14:25 ^🔗	Nemo_bis	Smiley: there's also an uploader.py
14:25 ^🔗	Smiley	doh! :D
15:50 ^🔗	schbiridi	german video site http://de.sevenload.com/ will delete all user "generated" videos on the 28.02.2013
15:50 ^🔗	Smiley	:<
15:50 ^🔗	Smiley	"We are sorry but sevenload doesn't offer its service in your country."
15:50 ^🔗	Smiley	herp.
15:55 ^🔗	xk_id	what follows, if I crash the webserver I'm crawling?
15:56 ^🔗	alard	You won't be able to get more data from it.
15:57 ^🔗	xk_id	ever?
15:57 ^🔗	xk_id	hmm.
15:57 ^🔗	alard	No, as long as it's down.
15:58 ^🔗	xk_id	oh, but surely it will resume after short a while.
15:58 ^🔗	alard	And you're more likely to get blocked and cause frowns.
15:58 ^🔗	xk_id	will there be serious frowns?
15:59 ^🔗	alard	Heh. Who knows? Look, if I were you I would just crawl as fast as I could without causing any visible trouble. Start slow -- with one thread, for example -- then add more if it goes well and you want to go a little faster.
16:00 ^🔗	xk_id	good idea
16:49 ^🔗	alard	SketchCow: The Xanga trial has run out of usernames. Current estimate is still 35TB for everything. Do you want to continue?
16:51 ^🔗	SketchCow	COntinue like download it?
16:51 ^🔗	SketchCow	Not download it.
16:51 ^🔗	SketchCow	But a mapping of all the people is somthing we should upload to archive.org
16:52 ^🔗	SketchCow	And we should have their crawlers put it on the roster
16:55 ^🔗	balrog_	just exactly how big was mobileme?
16:57 ^🔗	alard	mobileme was 200TB.
16:58 ^🔗	alard	The archive.org crawlers probably won't download the audio and video, by the way.
17:00 ^🔗	alard	The list of users is here, http://archive.org/details/archiveteam-xanga-userlist-20130142 (not sure if I had linked that before)
17:01 ^🔗	balrog_	how much has the trial downloaded?
17:02 ^🔗	alard	81GB. Part of that was without logging in and with earlier versions of the script.
17:02 ^🔗	DFJustin	"one file per user" or one line per user
17:03 ^🔗	alard	Hmm. As you may have noticed from the item name, this is the dyslexia edition.
19:11 ^🔗	chronomex	lol
19:25 ^🔗	godane	i'm uploading a 35min interview of the guy that made mairo
19:27 ^🔗	godane	filename: E3Interview_Miyamoto_G4700_flv.flv
19:35 ^🔗	omf__	godane, do you know what year that video came from?
19:36 ^🔗	Smiley	hmmm
19:36 ^🔗	Smiley	whats in it?
19:36 ^🔗	*	Smiley bets 2007
19:41 ^🔗	DFJustin	if you like that this may be worth a watch https://archive.org/details/history-of-zelda-courtesy-of-zentendo
19:52 ^🔗	godane	omf__: i think 2005
19:53 ^🔗	godane	also found a interview with bruce campbell
19:53 ^🔗	godane	problem is 11mins in the clip it goes very slow with no sound
19:54 ^🔗	godane	its very sad that it was not fixed
19:55 ^🔗	godane	its also 619mb
19:55 ^🔗	godane	this is will be archived more the there screwed up anyways
19:59 ^🔗	godane	http://abandonware-magazines.org/

irclogger-viewer