#warrior 2018-07-10,Tue

↑back Search

Time Nickname Message
00:17 🔗 Jusque has quit IRC (Read error: Operation timed out)
01:51 🔗 Jusque has joined #warrior
06:19 🔗 nertzy3 has quit IRC (Read error: Operation timed out)
07:55 🔗 Flashfire has joined #warrior
07:55 🔗 Flashfire has quit IRC (Client Quit)
07:57 🔗 Flashfire has joined #warrior
10:48 🔗 Flashfire has quit IRC (Quit: Bye)
12:11 🔗 Jusque_ has joined #warrior
12:11 🔗 Jusque has quit IRC (Read error: Connection reset by peer)
12:11 🔗 Jusque_ is now known as Jusque
12:33 🔗 SmileyG has joined #warrior
12:35 🔗 Smiley has quit IRC (Ping timeout: 252 seconds)
18:09 🔗 Phoen1x_ is now known as Phoen1x
18:47 🔗 midas3 has joined #warrior
21:52 🔗 redlizard has joined #warrior
21:53 🔗 arkiver pasting for the logs:
21:53 🔗 arkiver <redlizard>arkiver: I wonder whether there would be value in having the tracker support combining jobs on the fly. Such that the tracker on job reservation can always return a list of atomic jobs, and the warrior will automatically iterate over that list and not report back until it has the whole set. That way, you never really have to think hard about job size again, because it's a value you can tune on the fly, and you just always store
21:53 🔗 arkiver <redlizard>atomic jobs in the tracker.
21:53 🔗 arkiver <arkiver>Well that might be handy, but to be honest I think it might be a lot of work for a problem that's not really big currently.
21:53 🔗 arkiver <redlizard>Because tracker performance is approximately never the bottleneck in an archival project anyway?
21:53 🔗 arkiver <arkiver>Something else would be to support selecting multiple projects to work on. If a warrior hits the items/min limit, it will continue with a different project, etc.
21:53 🔗 arkiver <arkiver>redlizard: not really, no
21:53 🔗 arkiver <arkiver>targets are sometimes, but that's only on very big projects, just depends on the number of targets and their capacity then
21:53 🔗 arkiver <arkiver>This project will be ~1 TB, not big
21:53 🔗 arkiver <redlizard>Right.
21:53 🔗 arkiver <arkiver>Or for example an optional URLs/min limit, where permission for a certain number of URLs would be requested by the warrior.
21:53 🔗 arkiver <arkiver>That has been a problem in some projects, since we currently can't control the URL rate, only the item rate with most of the time different numbers of URLs per item.
21:53 🔗 arkiver <redlizard>And wildly unpredictable numbers, too.
21:53 🔗 arkiver <arkiver>Well over time it becomes kind of predictable per items, but it can be a problem.
21:53 🔗 arkiver <arkiver>item*
21:53 🔗 arkiver <redlizard>The puresilence project had lots and lots of bands with 20 urls, and then the occasional band with actual fans with a hundred thousand urls.
21:53 🔗 arkiver <arkiver>Yep, though purevolume is then again a little more exceptional when it comes to URLs/item.
21:53 🔗 arkiver <redlizard>Fair enough.
21:53 🔗 arkiver <arkiver>For example this project has very different numbers too, but more or less predictable over time.
21:54 🔗 arkiver <arkiver>Not as extreme as purevolume
21:54 🔗 arkiver <redlizard>Yeah, I don't see any four-digit numbers here.
21:54 🔗 arkiver <arkiver>Yeah
21:54 🔗 arkiver <arkiver>You might want to ask chfoo too for an opinion on this, but I think an optional URL rate could be really good improvement.
21:54 🔗 arkiver <arkiver>But chfoo might know of other more pressing issues/improvements.
21:54 🔗 arkiver <redlizard>Is there a bug tracker?
21:54 🔗 arkiver <Jens>Since we're talking about the tracker. Items should have a timeout, so if they're not done in, say 24 hours, they get requeued.
21:54 🔗 arkiver <Jens>Right now we have to pester people with tracker access.
21:54 🔗 arkiver <redlizard>Jens: Better to just requeue things when the queue runs out. That way you don't have to guess an expiracy date.
21:54 🔗 arkiver <Jens>That also works. Except when remaining jobs go low enough, you've got an infinite loop kind of situation.
21:54 🔗 arkiver <redlizard>If that means the last 0.1% of items get done by 5 people simultaneously, so be it.
21:54 🔗 arkiver <redlizard>Yeah, there definitely needs to be some sanity checks in this process.
21:54 🔗 arkiver <arkiver>Should also be optional I think, but could be a nice improvement.
21:54 🔗 arkiver <arkiver>Little box saying 'requeue automatically' or something if you knnow what I mean.
21:54 🔗 arkiver <redlizard>Yeah.
21:54 🔗 arkiver wooh spamming, sorry people
21:54 🔗 JAA From #zetatheplank, for the record.
21:56 🔗 redlizard So is the in fact a bugtracker for AT infrastructure improvements?
21:57 🔗 arkiver tracker is here https://github.com/ArchiveTeam/universal-tracker
21:57 🔗 arkiver Some enhancement ideas there I see
21:58 🔗 arkiver Looks like plenty of stuff to do :), some nice ideas there
21:59 🔗 sep332 has quit IRC (Read error: Operation timed out)
22:02 🔗 redlizard Indeed.
22:03 🔗 redlizard Clearly we should use the tracker itself to store feature improvements. Anyone wanting to do development can just get a ticket assigned to them by the tracker :)
22:04 🔗 arkiver Hah :)
22:07 🔗 redlizard By the way, I was wondering.
22:07 🔗 redlizard Do we ever do enumeration and item enumeration in parallel?
22:07 🔗 redlizard Or is it a strictly sequential pipeline?
22:08 🔗 arkiver We make the lists of items, and load them in the tracker.
22:08 🔗 arkiver Did you mean that?
22:08 🔗 redlizard What I meant is, do you ever feed lists of items discovered *so far* into the item tracker, while discovery is still going?
22:09 🔗 astrid we usually don't feed the whole list into the tracker because it often overloads the database to no purpose
22:09 🔗 astrid computers aren't very good
22:10 🔗 arkiver Ah, no, the data we get from a discover project usually needs some checking or post processing before it can be feeded again to the tracker
22:10 🔗 arkiver This could be because of decision that have not been taken yet, or because processing needs all discovered data (for example in the case of prioritization)
22:10 🔗 arkiver decisions*
22:11 🔗 arkiver Also note that most project don't have a discovery part, only recently many projects needed discovery
22:12 🔗 redlizard Ah.
22:13 🔗 arkiver Manually though yeah, there have been projects when a discovery project and an archiving project were running at the same time.
22:13 🔗 arkiver But everything was manual. It wasn't a very big problem to be honest.
22:14 🔗 arkiver I remember one nightmare website. It's performance was going up and down during the day and it was basically a 'get as much as possible' project. So we'd have to look at the tracker and the performance of the website all the time, adjust item rate, etc.
22:15 🔗 jut_ has joined #warrior
22:17 🔗 arkiver this one for the record https://www.archiveteam.org/index.php?title=Xfire
22:17 🔗 arkiver still up :) http://tracker.archiveteam.org/xfire/
22:18 🔗 arkiver should probably clear those to do items
22:21 🔗 jut_ does archive team keep its own archives of past tracker leaderboards?
22:22 🔗 arkiver we have logs, those can be used to reconstruct leaderboards I think

irclogger-viewer