[00:57] *** qw3rty114 has quit IRC (Read error: Operation timed out) [00:59] *** qw3rty114 has joined #internetarchive [01:12] *** qw3rty115 has joined #internetarchive [01:15] *** qw3rty116 has joined #internetarchive [01:16] *** qw3rty114 has quit IRC (Ping timeout: 600 seconds) [01:21] *** Stiletto has quit IRC (Ping timeout: 268 seconds) [01:22] *** qw3rty115 has quit IRC (Ping timeout: 600 seconds) [01:29] *** Stilett0- has joined #internetarchive [01:29] *** Stilett0- is now known as Stiletto [01:31] *** Stiletto has quit IRC (Client Quit) [01:32] *** Stiletto has joined #internetarchive [01:35] *** qw3rty117 has joined #internetarchive [01:35] *** qw3rty116 has quit IRC (Ping timeout: 600 seconds) [01:40] *** qw3rty118 has joined #internetarchive [01:45] *** qw3rty117 has quit IRC (Ping timeout: 600 seconds) [01:47] *** qw3rty118 has quit IRC (Read error: Operation timed out) [01:47] *** qw3rty118 has joined #internetarchive [02:09] *** fredgido has quit IRC (Read error: Connection reset by peer) [02:10] *** fredgido has joined #internetarchive [04:31] *** qw3rty119 has joined #internetarchive [04:36] *** qw3rty118 has quit IRC (Read error: Operation timed out) [04:41] *** Jasjar has joined #internetarchive [04:47] *** odemg has quit IRC (Ping timeout: 615 seconds) [04:53] *** odemg has joined #internetarchive [07:06] *** deevious has quit IRC (Quit: deevious) [07:09] *** DopefishJ has joined #internetarchive [07:10] *** DFJustin has quit IRC (Ping timeout: 615 seconds) [07:18] *** deevious has joined #internetarchive [09:14] JAA: archive.php tends to do funny things when syncing to the same item from different s3 uploads [09:14] If you need a parallel upload to the same item, your best chance is using a torrent upload and multiple seeders [09:15] So the sync happens only once, when the entire thing is complete, and you don't risk conflicts [09:18] Serial upload of many files on a big item is also problematic, see e.g. https://archive.org/history/crossref-pre-1909-scholarly-works where 6k tasks were needed to upload a mere 2200 files [09:20] By the time the next 1 GB file was uploaded, the previous task moving around hundreds of GB of the previous files was usually not over, so all sorts of duplicates popped up in the history/ directory etc. [09:57] *** atomotic has joined #internetarchive [10:49] Nemo_bis: Ah, interesting, thanks. Wouldn't the same issue also occur though when archive.php's too slow compared to the uploads? I know I've had items in the past where the archive.php tasks piled up. Especially so when a random derive.php appeared inbetween which started copying the data to another machine, processing one file, and then getting aborted because it noticed that there were new tasks. [10:50] (I now usually use --no-derive to avoid that particular problem, but the archive.php tasks piling up can still happen. In fact, it's happening right now while I'm moving a large amount of data from one item to another.) [12:02] *** atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [12:24] *** atomotic has joined #internetarchive [12:27] *** atomotic has quit IRC (Client Quit) [13:56] *** deevious has quit IRC (Read error: Connection reset by peer) [13:57] *** deevious has joined #internetarchive [16:06] *** DopefishJ is now known as DFJustin [18:26] JAA: yes, that's the main issue. When the item gets too big, a number of assumptions about tasks start crumbling down [18:27] Nemo_bis: Right, but then parallel uploads shouldn't be any worse, right? [18:28] JAA: they are because each upload causes the entire item to be copied over [18:29] In the example above, every time I uploaded 1 GB then there were hundreds of GB being copied [18:29] So it was just unable to ever catch up [18:29] Then of course it *could* be smarter and avoid duplicate work, but it isn't [18:30] Or at least this is what I figured [18:31] Really? I've never seen an archive.php task after the upload rsync the entire item contents. It merely copies the new file(s?) from the S3 server to the storage server, updates _files.xml etc. [18:31] derive.php yes, but that can be prevented with the corresponding header or --no-derive. [18:36] That's what's supposed to happen, yes, but at some point e.g. https://catalogd.archive.org/log/1067157994 it started syncing a bunch of history/ files [18:37] So if you have parallel uploads I suspect you could have even more problems with incomplete uploads [18:38] Then I'm not sure what was going on in that case, I had never seen such a thing before [18:38] I do know that the next time I uploaded the very same files to another (test) item with torrent it all went very smoothly [18:39] I see. [18:39] Maybe two parallel uploads ended up on the same S3 server? [18:40] I think the directories are just named by the item, not some unique identifier of the upload. [18:40] So perhaps when the first archive.php task ran, it shoved the partially uploaded other file into the item as well, and then the second task did that history mangling thing. [18:41] Yes, that was one of my hypotheses but I didn't bother to compare the ids [18:41] Ah nope, there is a UUID in the path. [18:42] Everything would be fine if only at some point the tasks didn't take hours [18:43] Yeah [18:44] Also, tasks are really slow at the moment it seems. A derive of mine has been running on the same WARC for 5 hours now. Ok, that file is large (80 GiB), but last time the derive for it ran through in ~15 minutes. [18:46] When was the last time? [18:46] I saw such hours-long WARC derives several times in the last week or two [18:47] About two months ago. [18:47] At least now the derive servers don't seem to be struggling with iowait [20:03] Oh, the derive finally finished after over 5.5 hours. [20:22] *** Stilett0 has joined #internetarchive [20:24] *** Stiletto has quit IRC (Ping timeout: 265 seconds) [20:38] *** Stilett0 is now known as Stiletto