[00:57] *** qw3rty114 has quit IRC (Read error: Operation timed out)
[00:59] *** qw3rty114 has joined #internetarchive
[01:12] *** qw3rty115 has joined #internetarchive
[01:15] *** qw3rty116 has joined #internetarchive
[01:16] *** qw3rty114 has quit IRC (Ping timeout: 600 seconds)
[01:21] *** Stiletto has quit IRC (Ping timeout: 268 seconds)
[01:22] *** qw3rty115 has quit IRC (Ping timeout: 600 seconds)
[01:29] *** Stilett0- has joined #internetarchive
[01:29] *** Stilett0- is now known as Stiletto
[01:31] *** Stiletto has quit IRC (Client Quit)
[01:32] *** Stiletto has joined #internetarchive
[01:35] *** qw3rty117 has joined #internetarchive
[01:35] *** qw3rty116 has quit IRC (Ping timeout: 600 seconds)
[01:40] *** qw3rty118 has joined #internetarchive
[01:45] *** qw3rty117 has quit IRC (Ping timeout: 600 seconds)
[01:47] *** qw3rty118 has quit IRC (Read error: Operation timed out)
[01:47] *** qw3rty118 has joined #internetarchive
[02:09] *** fredgido has quit IRC (Read error: Connection reset by peer)
[02:10] *** fredgido has joined #internetarchive
[04:31] *** qw3rty119 has joined #internetarchive
[04:36] *** qw3rty118 has quit IRC (Read error: Operation timed out)
[04:41] *** Jasjar has joined #internetarchive
[04:47] *** odemg has quit IRC (Ping timeout: 615 seconds)
[04:53] *** odemg has joined #internetarchive
[07:06] *** deevious has quit IRC (Quit: deevious)
[07:09] *** DopefishJ has joined #internetarchive
[07:10] *** DFJustin has quit IRC (Ping timeout: 615 seconds)
[07:18] *** deevious has joined #internetarchive
[09:14] <Nemo_bis> JAA: archive.php tends to do funny things when syncing to the same item from different s3 uploads
[09:14] <Nemo_bis> If you need a parallel upload to the same item, your best chance is using a torrent upload and multiple seeders
[09:15] <Nemo_bis> So the sync happens only once, when the entire thing is complete, and you don't risk conflicts
[09:18] <Nemo_bis> Serial upload of many files on a big item is also problematic, see e.g. https://archive.org/history/crossref-pre-1909-scholarly-works where 6k tasks were needed to upload a mere 2200 files
[09:20] <Nemo_bis> By the time the next 1 GB file was uploaded, the previous task moving around hundreds of GB of the previous files was usually not over, so all sorts of duplicates popped up in the history/ directory etc.
[09:57] *** atomotic has joined #internetarchive
[10:49] <JAA> Nemo_bis: Ah, interesting, thanks. Wouldn't the same issue also occur though when archive.php's too slow compared to the uploads? I know I've had items in the past where the archive.php tasks piled up. Especially so when a random derive.php appeared inbetween which started copying the data to another machine, processing one file, and then getting aborted because it noticed that there were new tasks.
[10:50] <JAA> (I now usually use --no-derive to avoid that particular problem, but the archive.php tasks piling up can still happen. In fact, it's happening right now while I'm moving a large amount of data from one item to another.)
[12:02] *** atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
[12:24] *** atomotic has joined #internetarchive
[12:27] *** atomotic has quit IRC (Client Quit)
[13:56] *** deevious has quit IRC (Read error: Connection reset by peer)
[13:57] *** deevious has joined #internetarchive
[16:06] *** DopefishJ is now known as DFJustin
[18:26] <Nemo_bis> JAA: yes, that's the main issue. When the item gets too big, a number of assumptions about tasks start crumbling down
[18:27] <JAA> Nemo_bis: Right, but then parallel uploads shouldn't be any worse, right?
[18:28] <Nemo_bis> JAA: they are because each upload causes the entire item to be copied over
[18:29] <Nemo_bis> In the example above, every time I uploaded 1 GB then there were hundreds of GB being copied 
[18:29] <Nemo_bis> So it was just unable to ever catch up
[18:29] <Nemo_bis> Then of course it *could* be smarter and avoid duplicate work, but it isn't
[18:30] <Nemo_bis> Or at least this is what I figured
[18:31] <JAA> Really? I've never seen an archive.php task after the upload rsync the entire item contents. It merely copies the new file(s?) from the S3 server to the storage server, updates _files.xml etc.
[18:31] <JAA> derive.php yes, but that can be prevented with the corresponding header or --no-derive.
[18:36] <Nemo_bis> That's what's supposed to happen, yes, but at some point e.g. https://catalogd.archive.org/log/1067157994 it started syncing a bunch of history/ files
[18:37] <Nemo_bis> So if you have parallel uploads I suspect you could have even more problems with incomplete uploads
[18:38] <Nemo_bis> Then I'm not sure what was going on in that case, I had never seen such a thing before
[18:38] <Nemo_bis> I do know that the next time I uploaded the very same files to another (test) item with torrent it all went very smoothly 
[18:39] <JAA> I see.
[18:39] <JAA> Maybe two parallel uploads ended up on the same S3 server?
[18:40] <JAA> I think the directories are just named by the item, not some unique identifier of the upload.
[18:40] <JAA> So perhaps when the first archive.php task ran, it shoved the partially uploaded other file into the item as well, and then the second task did that history mangling thing.
[18:41] <Nemo_bis> Yes, that was one of my hypotheses but I didn't bother to compare the ids
[18:41] <JAA> Ah nope, there is a UUID in the path.
[18:42] <Nemo_bis> Everything would be fine if only at some point the tasks didn't take hours 
[18:43] <JAA> Yeah
[18:44] <JAA> Also, tasks are really slow at the moment it seems. A derive of mine has been running on the same WARC for 5 hours now. Ok, that file is large (80 GiB), but last time the derive for it ran through in ~15 minutes.
[18:46] <Nemo_bis> When was the last time?
[18:46] <Nemo_bis> I saw such hours-long WARC derives several times in the last week or two
[18:47] <JAA> About two months ago.
[18:47] <Nemo_bis> At least now the derive servers don't seem to be struggling with iowait
[20:03] <JAA> Oh, the derive finally finished after over 5.5 hours.
[20:22] *** Stilett0 has joined #internetarchive
[20:24] *** Stiletto has quit IRC (Ping timeout: 265 seconds)
[20:38] *** Stilett0 is now known as Stiletto