#warrior 2013-02-26,Tue

↑back Search

Time	Nickname	Message
06:09 ^🔗	omf_	Do we have any sites we use for testing spiders and software
06:10 ^🔗	omf_	I know we use wget now but I know we will need more advanced stuff later. I am working on converting some stuff I wrote for my own data mining to be more generalized so everyone can use them
06:10 ^🔗	omf_	Right now I just spider the wiki as a test
08:27 ^🔗	ersi	omf_: Not really, that I know of. We just prod sites we're gonna fetch afaik
11:33 ^🔗	soultcer	alard: What's the reason for limiting the concurrent tasks to 6?
12:25 ^🔗	ewook	besides the official reason, being nice to the site should be one :).
12:39 ^🔗	ersi	I would imagine it being related to not loading the connection or disk of the VM too much. </imagination>
18:52 ^🔗	alard	soultcer: Mainly memory (and disk, but that's less critical). And the limit prevents people from starting hundreds of tasks and then failing to complete any of them.
19:11 ^🔗	soultcer	alard: Well with all the ec2 machines running there will still be a lot of abandoned tasks. I assume the tracker will reassign them eventually, right?
19:15 ^🔗	soultcer	Oh
19:16 ^🔗	soultcer	What will happen with uploads that are incomplete, i.e. where the client shut off during the upload?
19:17 ^🔗	soultcer	On rsync, they go into a partial dir, which can be ignored when creating a pack
19:17 ^🔗	soultcer	But what happens on curlupload?
20:09 ^🔗	alard	soultcer: The abandoned tasks will be reassigned if someone clicks the button.
20:09 ^🔗	alard	HTTP uploads, that depends on the web server. The Nginx server we used so far kept uploads in a temporary directory, so that's fine.
20:10 ^🔗	soultcer	Good

irclogger-viewer