#internetarchive 2019-03-28,Thu

↑back Search

Time Nickname Message
00:57 🔗 qw3rty114 has quit IRC (Read error: Operation timed out)
00:59 🔗 qw3rty114 has joined #internetarchive
01:12 🔗 qw3rty115 has joined #internetarchive
01:15 🔗 qw3rty116 has joined #internetarchive
01:16 🔗 qw3rty114 has quit IRC (Ping timeout: 600 seconds)
01:21 🔗 Stiletto has quit IRC (Ping timeout: 268 seconds)
01:22 🔗 qw3rty115 has quit IRC (Ping timeout: 600 seconds)
01:29 🔗 Stilett0- has joined #internetarchive
01:29 🔗 Stilett0- is now known as Stiletto
01:31 🔗 Stiletto has quit IRC (Client Quit)
01:32 🔗 Stiletto has joined #internetarchive
01:35 🔗 qw3rty117 has joined #internetarchive
01:35 🔗 qw3rty116 has quit IRC (Ping timeout: 600 seconds)
01:40 🔗 qw3rty118 has joined #internetarchive
01:45 🔗 qw3rty117 has quit IRC (Ping timeout: 600 seconds)
01:47 🔗 qw3rty118 has quit IRC (Read error: Operation timed out)
01:47 🔗 qw3rty118 has joined #internetarchive
02:09 🔗 fredgido has quit IRC (Read error: Connection reset by peer)
02:10 🔗 fredgido has joined #internetarchive
04:31 🔗 qw3rty119 has joined #internetarchive
04:36 🔗 qw3rty118 has quit IRC (Read error: Operation timed out)
04:41 🔗 Jasjar has joined #internetarchive
04:47 🔗 odemg has quit IRC (Ping timeout: 615 seconds)
04:53 🔗 odemg has joined #internetarchive
07:06 🔗 deevious has quit IRC (Quit: deevious)
07:09 🔗 DopefishJ has joined #internetarchive
07:10 🔗 DFJustin has quit IRC (Ping timeout: 615 seconds)
07:18 🔗 deevious has joined #internetarchive
09:14 🔗 Nemo_bis JAA: archive.php tends to do funny things when syncing to the same item from different s3 uploads
09:14 🔗 Nemo_bis If you need a parallel upload to the same item, your best chance is using a torrent upload and multiple seeders
09:15 🔗 Nemo_bis So the sync happens only once, when the entire thing is complete, and you don't risk conflicts
09:18 🔗 Nemo_bis Serial upload of many files on a big item is also problematic, see e.g. https://archive.org/history/crossref-pre-1909-scholarly-works where 6k tasks were needed to upload a mere 2200 files
09:20 🔗 Nemo_bis By the time the next 1 GB file was uploaded, the previous task moving around hundreds of GB of the previous files was usually not over, so all sorts of duplicates popped up in the history/ directory etc.
09:57 🔗 atomotic has joined #internetarchive
10:49 🔗 JAA Nemo_bis: Ah, interesting, thanks. Wouldn't the same issue also occur though when archive.php's too slow compared to the uploads? I know I've had items in the past where the archive.php tasks piled up. Especially so when a random derive.php appeared inbetween which started copying the data to another machine, processing one file, and then getting aborted because it noticed that there were new tasks.
10:50 🔗 JAA (I now usually use --no-derive to avoid that particular problem, but the archive.php tasks piling up can still happen. In fact, it's happening right now while I'm moving a large amount of data from one item to another.)
12:02 🔗 atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
12:24 🔗 atomotic has joined #internetarchive
12:27 🔗 atomotic has quit IRC (Client Quit)
13:56 🔗 deevious has quit IRC (Read error: Connection reset by peer)
13:57 🔗 deevious has joined #internetarchive
16:06 🔗 DopefishJ is now known as DFJustin
18:26 🔗 Nemo_bis JAA: yes, that's the main issue. When the item gets too big, a number of assumptions about tasks start crumbling down
18:27 🔗 JAA Nemo_bis: Right, but then parallel uploads shouldn't be any worse, right?
18:28 🔗 Nemo_bis JAA: they are because each upload causes the entire item to be copied over
18:29 🔗 Nemo_bis In the example above, every time I uploaded 1 GB then there were hundreds of GB being copied
18:29 🔗 Nemo_bis So it was just unable to ever catch up
18:29 🔗 Nemo_bis Then of course it *could* be smarter and avoid duplicate work, but it isn't
18:30 🔗 Nemo_bis Or at least this is what I figured
18:31 🔗 JAA Really? I've never seen an archive.php task after the upload rsync the entire item contents. It merely copies the new file(s?) from the S3 server to the storage server, updates _files.xml etc.
18:31 🔗 JAA derive.php yes, but that can be prevented with the corresponding header or --no-derive.
18:36 🔗 Nemo_bis That's what's supposed to happen, yes, but at some point e.g. https://catalogd.archive.org/log/1067157994 it started syncing a bunch of history/ files
18:37 🔗 Nemo_bis So if you have parallel uploads I suspect you could have even more problems with incomplete uploads
18:38 🔗 Nemo_bis Then I'm not sure what was going on in that case, I had never seen such a thing before
18:38 🔗 Nemo_bis I do know that the next time I uploaded the very same files to another (test) item with torrent it all went very smoothly
18:39 🔗 JAA I see.
18:39 🔗 JAA Maybe two parallel uploads ended up on the same S3 server?
18:40 🔗 JAA I think the directories are just named by the item, not some unique identifier of the upload.
18:40 🔗 JAA So perhaps when the first archive.php task ran, it shoved the partially uploaded other file into the item as well, and then the second task did that history mangling thing.
18:41 🔗 Nemo_bis Yes, that was one of my hypotheses but I didn't bother to compare the ids
18:41 🔗 JAA Ah nope, there is a UUID in the path.
18:42 🔗 Nemo_bis Everything would be fine if only at some point the tasks didn't take hours
18:43 🔗 JAA Yeah
18:44 🔗 JAA Also, tasks are really slow at the moment it seems. A derive of mine has been running on the same WARC for 5 hours now. Ok, that file is large (80 GiB), but last time the derive for it ran through in ~15 minutes.
18:46 🔗 Nemo_bis When was the last time?
18:46 🔗 Nemo_bis I saw such hours-long WARC derives several times in the last week or two
18:47 🔗 JAA About two months ago.
18:47 🔗 Nemo_bis At least now the derive servers don't seem to be struggling with iowait
20:03 🔗 JAA Oh, the derive finally finished after over 5.5 hours.
20:22 🔗 Stilett0 has joined #internetarchive
20:24 🔗 Stiletto has quit IRC (Ping timeout: 265 seconds)
20:38 🔗 Stilett0 is now known as Stiletto

irclogger-viewer