Time |
Nickname |
Message |
06:09
🔗
|
omf_ |
Do we have any sites we use for testing spiders and software |
06:10
🔗
|
omf_ |
I know we use wget now but I know we will need more advanced stuff later. I am working on converting some stuff I wrote for my own data mining to be more generalized so everyone can use them |
06:10
🔗
|
omf_ |
Right now I just spider the wiki as a test |
08:27
🔗
|
ersi |
omf_: Not really, that I know of. We just prod sites we're gonna fetch afaik |
11:33
🔗
|
soultcer |
alard: What's the reason for limiting the concurrent tasks to 6? |
12:25
🔗
|
ewook |
besides the official reason, being nice to the site should be one :). |
12:39
🔗
|
ersi |
I would imagine it being related to not loading the connection or disk of the VM too much. </imagination> |
18:52
🔗
|
alard |
soultcer: Mainly memory (and disk, but that's less critical). And the limit prevents people from starting hundreds of tasks and then failing to complete any of them. |
19:11
🔗
|
soultcer |
alard: Well with all the ec2 machines running there will still be a lot of abandoned tasks. I assume the tracker will reassign them eventually, right? |
19:15
🔗
|
soultcer |
Oh |
19:16
🔗
|
soultcer |
What will happen with uploads that are incomplete, i.e. where the client shut off during the upload? |
19:17
🔗
|
soultcer |
On rsync, they go into a partial dir, which can be ignored when creating a pack |
19:17
🔗
|
soultcer |
But what happens on curlupload? |
20:09
🔗
|
alard |
soultcer: The abandoned tasks will be reassigned if someone clicks the button. |
20:09
🔗
|
alard |
HTTP uploads, that depends on the web server. The Nginx server we used so far kept uploads in a temporary directory, so that's fine. |
20:10
🔗
|
soultcer |
Good |