Time |
Nickname |
Message |
00:17
🔗
|
|
Jusque has quit IRC (Read error: Operation timed out) |
01:51
🔗
|
|
Jusque has joined #warrior |
06:19
🔗
|
|
nertzy3 has quit IRC (Read error: Operation timed out) |
07:55
🔗
|
|
Flashfire has joined #warrior |
07:55
🔗
|
|
Flashfire has quit IRC (Client Quit) |
07:57
🔗
|
|
Flashfire has joined #warrior |
10:48
🔗
|
|
Flashfire has quit IRC (Quit: Bye) |
12:11
🔗
|
|
Jusque_ has joined #warrior |
12:11
🔗
|
|
Jusque has quit IRC (Read error: Connection reset by peer) |
12:11
🔗
|
|
Jusque_ is now known as Jusque |
12:33
🔗
|
|
SmileyG has joined #warrior |
12:35
🔗
|
|
Smiley has quit IRC (Ping timeout: 252 seconds) |
18:09
🔗
|
|
Phoen1x_ is now known as Phoen1x |
18:47
🔗
|
|
midas3 has joined #warrior |
21:52
🔗
|
|
redlizard has joined #warrior |
21:53
🔗
|
arkiver |
pasting for the logs: |
21:53
🔗
|
arkiver |
<redlizard>arkiver: I wonder whether there would be value in having the tracker support combining jobs on the fly. Such that the tracker on job reservation can always return a list of atomic jobs, and the warrior will automatically iterate over that list and not report back until it has the whole set. That way, you never really have to think hard about job size again, because it's a value you can tune on the fly, and you just always store |
21:53
🔗
|
arkiver |
<redlizard>atomic jobs in the tracker. |
21:53
🔗
|
arkiver |
<arkiver>Well that might be handy, but to be honest I think it might be a lot of work for a problem that's not really big currently. |
21:53
🔗
|
arkiver |
<redlizard>Because tracker performance is approximately never the bottleneck in an archival project anyway? |
21:53
🔗
|
arkiver |
<arkiver>Something else would be to support selecting multiple projects to work on. If a warrior hits the items/min limit, it will continue with a different project, etc. |
21:53
🔗
|
arkiver |
<arkiver>redlizard: not really, no |
21:53
🔗
|
arkiver |
<arkiver>targets are sometimes, but that's only on very big projects, just depends on the number of targets and their capacity then |
21:53
🔗
|
arkiver |
<arkiver>This project will be ~1 TB, not big |
21:53
🔗
|
arkiver |
<redlizard>Right. |
21:53
🔗
|
arkiver |
<arkiver>Or for example an optional URLs/min limit, where permission for a certain number of URLs would be requested by the warrior. |
21:53
🔗
|
arkiver |
<arkiver>That has been a problem in some projects, since we currently can't control the URL rate, only the item rate with most of the time different numbers of URLs per item. |
21:53
🔗
|
arkiver |
<redlizard>And wildly unpredictable numbers, too. |
21:53
🔗
|
arkiver |
<arkiver>Well over time it becomes kind of predictable per items, but it can be a problem. |
21:53
🔗
|
arkiver |
<arkiver>item* |
21:53
🔗
|
arkiver |
<redlizard>The puresilence project had lots and lots of bands with 20 urls, and then the occasional band with actual fans with a hundred thousand urls. |
21:53
🔗
|
arkiver |
<arkiver>Yep, though purevolume is then again a little more exceptional when it comes to URLs/item. |
21:53
🔗
|
arkiver |
<redlizard>Fair enough. |
21:53
🔗
|
arkiver |
<arkiver>For example this project has very different numbers too, but more or less predictable over time. |
21:54
🔗
|
arkiver |
<arkiver>Not as extreme as purevolume |
21:54
🔗
|
arkiver |
<redlizard>Yeah, I don't see any four-digit numbers here. |
21:54
🔗
|
arkiver |
<arkiver>Yeah |
21:54
🔗
|
arkiver |
<arkiver>You might want to ask chfoo too for an opinion on this, but I think an optional URL rate could be really good improvement. |
21:54
🔗
|
arkiver |
<arkiver>But chfoo might know of other more pressing issues/improvements. |
21:54
🔗
|
arkiver |
<redlizard>Is there a bug tracker? |
21:54
🔗
|
arkiver |
<Jens>Since we're talking about the tracker. Items should have a timeout, so if they're not done in, say 24 hours, they get requeued. |
21:54
🔗
|
arkiver |
<Jens>Right now we have to pester people with tracker access. |
21:54
🔗
|
arkiver |
<redlizard>Jens: Better to just requeue things when the queue runs out. That way you don't have to guess an expiracy date. |
21:54
🔗
|
arkiver |
<Jens>That also works. Except when remaining jobs go low enough, you've got an infinite loop kind of situation. |
21:54
🔗
|
arkiver |
<redlizard>If that means the last 0.1% of items get done by 5 people simultaneously, so be it. |
21:54
🔗
|
arkiver |
<redlizard>Yeah, there definitely needs to be some sanity checks in this process. |
21:54
🔗
|
arkiver |
<arkiver>Should also be optional I think, but could be a nice improvement. |
21:54
🔗
|
arkiver |
<arkiver>Little box saying 'requeue automatically' or something if you knnow what I mean. |
21:54
🔗
|
arkiver |
<redlizard>Yeah. |
21:54
🔗
|
arkiver |
wooh spamming, sorry people |
21:54
🔗
|
JAA |
From #zetatheplank, for the record. |
21:56
🔗
|
redlizard |
So is the in fact a bugtracker for AT infrastructure improvements? |
21:57
🔗
|
arkiver |
tracker is here https://github.com/ArchiveTeam/universal-tracker |
21:57
🔗
|
arkiver |
Some enhancement ideas there I see |
21:58
🔗
|
arkiver |
Looks like plenty of stuff to do :), some nice ideas there |
21:59
🔗
|
|
sep332 has quit IRC (Read error: Operation timed out) |
22:02
🔗
|
redlizard |
Indeed. |
22:03
🔗
|
redlizard |
Clearly we should use the tracker itself to store feature improvements. Anyone wanting to do development can just get a ticket assigned to them by the tracker :) |
22:04
🔗
|
arkiver |
Hah :) |
22:07
🔗
|
redlizard |
By the way, I was wondering. |
22:07
🔗
|
redlizard |
Do we ever do enumeration and item enumeration in parallel? |
22:07
🔗
|
redlizard |
Or is it a strictly sequential pipeline? |
22:08
🔗
|
arkiver |
We make the lists of items, and load them in the tracker. |
22:08
🔗
|
arkiver |
Did you mean that? |
22:08
🔗
|
redlizard |
What I meant is, do you ever feed lists of items discovered *so far* into the item tracker, while discovery is still going? |
22:09
🔗
|
astrid |
we usually don't feed the whole list into the tracker because it often overloads the database to no purpose |
22:09
🔗
|
astrid |
computers aren't very good |
22:10
🔗
|
arkiver |
Ah, no, the data we get from a discover project usually needs some checking or post processing before it can be feeded again to the tracker |
22:10
🔗
|
arkiver |
This could be because of decision that have not been taken yet, or because processing needs all discovered data (for example in the case of prioritization) |
22:10
🔗
|
arkiver |
decisions* |
22:11
🔗
|
arkiver |
Also note that most project don't have a discovery part, only recently many projects needed discovery |
22:12
🔗
|
redlizard |
Ah. |
22:13
🔗
|
arkiver |
Manually though yeah, there have been projects when a discovery project and an archiving project were running at the same time. |
22:13
🔗
|
arkiver |
But everything was manual. It wasn't a very big problem to be honest. |
22:14
🔗
|
arkiver |
I remember one nightmare website. It's performance was going up and down during the day and it was basically a 'get as much as possible' project. So we'd have to look at the tracker and the performance of the website all the time, adjust item rate, etc. |
22:15
🔗
|
|
jut_ has joined #warrior |
22:17
🔗
|
arkiver |
this one for the record https://www.archiveteam.org/index.php?title=Xfire |
22:17
🔗
|
arkiver |
still up :) http://tracker.archiveteam.org/xfire/ |
22:18
🔗
|
arkiver |
should probably clear those to do items |
22:21
🔗
|
jut_ |
does archive team keep its own archives of past tracker leaderboards? |
22:22
🔗
|
arkiver |
we have logs, those can be used to reconstruct leaderboards I think |