Time |
Nickname |
Message |
00:57
🔗
|
|
qw3rty114 has quit IRC (Read error: Operation timed out) |
00:59
🔗
|
|
qw3rty114 has joined #internetarchive |
01:12
🔗
|
|
qw3rty115 has joined #internetarchive |
01:15
🔗
|
|
qw3rty116 has joined #internetarchive |
01:16
🔗
|
|
qw3rty114 has quit IRC (Ping timeout: 600 seconds) |
01:21
🔗
|
|
Stiletto has quit IRC (Ping timeout: 268 seconds) |
01:22
🔗
|
|
qw3rty115 has quit IRC (Ping timeout: 600 seconds) |
01:29
🔗
|
|
Stilett0- has joined #internetarchive |
01:29
🔗
|
|
Stilett0- is now known as Stiletto |
01:31
🔗
|
|
Stiletto has quit IRC (Client Quit) |
01:32
🔗
|
|
Stiletto has joined #internetarchive |
01:35
🔗
|
|
qw3rty117 has joined #internetarchive |
01:35
🔗
|
|
qw3rty116 has quit IRC (Ping timeout: 600 seconds) |
01:40
🔗
|
|
qw3rty118 has joined #internetarchive |
01:45
🔗
|
|
qw3rty117 has quit IRC (Ping timeout: 600 seconds) |
01:47
🔗
|
|
qw3rty118 has quit IRC (Read error: Operation timed out) |
01:47
🔗
|
|
qw3rty118 has joined #internetarchive |
02:09
🔗
|
|
fredgido has quit IRC (Read error: Connection reset by peer) |
02:10
🔗
|
|
fredgido has joined #internetarchive |
04:31
🔗
|
|
qw3rty119 has joined #internetarchive |
04:36
🔗
|
|
qw3rty118 has quit IRC (Read error: Operation timed out) |
04:41
🔗
|
|
Jasjar has joined #internetarchive |
04:47
🔗
|
|
odemg has quit IRC (Ping timeout: 615 seconds) |
04:53
🔗
|
|
odemg has joined #internetarchive |
07:06
🔗
|
|
deevious has quit IRC (Quit: deevious) |
07:09
🔗
|
|
DopefishJ has joined #internetarchive |
07:10
🔗
|
|
DFJustin has quit IRC (Ping timeout: 615 seconds) |
07:18
🔗
|
|
deevious has joined #internetarchive |
09:14
🔗
|
Nemo_bis |
JAA: archive.php tends to do funny things when syncing to the same item from different s3 uploads |
09:14
🔗
|
Nemo_bis |
If you need a parallel upload to the same item, your best chance is using a torrent upload and multiple seeders |
09:15
🔗
|
Nemo_bis |
So the sync happens only once, when the entire thing is complete, and you don't risk conflicts |
09:18
🔗
|
Nemo_bis |
Serial upload of many files on a big item is also problematic, see e.g. https://archive.org/history/crossref-pre-1909-scholarly-works where 6k tasks were needed to upload a mere 2200 files |
09:20
🔗
|
Nemo_bis |
By the time the next 1 GB file was uploaded, the previous task moving around hundreds of GB of the previous files was usually not over, so all sorts of duplicates popped up in the history/ directory etc. |
09:57
🔗
|
|
atomotic has joined #internetarchive |
10:49
🔗
|
JAA |
Nemo_bis: Ah, interesting, thanks. Wouldn't the same issue also occur though when archive.php's too slow compared to the uploads? I know I've had items in the past where the archive.php tasks piled up. Especially so when a random derive.php appeared inbetween which started copying the data to another machine, processing one file, and then getting aborted because it noticed that there were new tasks. |
10:50
🔗
|
JAA |
(I now usually use --no-derive to avoid that particular problem, but the archive.php tasks piling up can still happen. In fact, it's happening right now while I'm moving a large amount of data from one item to another.) |
12:02
🔗
|
|
atomotic has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
12:24
🔗
|
|
atomotic has joined #internetarchive |
12:27
🔗
|
|
atomotic has quit IRC (Client Quit) |
13:56
🔗
|
|
deevious has quit IRC (Read error: Connection reset by peer) |
13:57
🔗
|
|
deevious has joined #internetarchive |
16:06
🔗
|
|
DopefishJ is now known as DFJustin |
18:26
🔗
|
Nemo_bis |
JAA: yes, that's the main issue. When the item gets too big, a number of assumptions about tasks start crumbling down |
18:27
🔗
|
JAA |
Nemo_bis: Right, but then parallel uploads shouldn't be any worse, right? |
18:28
🔗
|
Nemo_bis |
JAA: they are because each upload causes the entire item to be copied over |
18:29
🔗
|
Nemo_bis |
In the example above, every time I uploaded 1 GB then there were hundreds of GB being copied |
18:29
🔗
|
Nemo_bis |
So it was just unable to ever catch up |
18:29
🔗
|
Nemo_bis |
Then of course it *could* be smarter and avoid duplicate work, but it isn't |
18:30
🔗
|
Nemo_bis |
Or at least this is what I figured |
18:31
🔗
|
JAA |
Really? I've never seen an archive.php task after the upload rsync the entire item contents. It merely copies the new file(s?) from the S3 server to the storage server, updates _files.xml etc. |
18:31
🔗
|
JAA |
derive.php yes, but that can be prevented with the corresponding header or --no-derive. |
18:36
🔗
|
Nemo_bis |
That's what's supposed to happen, yes, but at some point e.g. https://catalogd.archive.org/log/1067157994 it started syncing a bunch of history/ files |
18:37
🔗
|
Nemo_bis |
So if you have parallel uploads I suspect you could have even more problems with incomplete uploads |
18:38
🔗
|
Nemo_bis |
Then I'm not sure what was going on in that case, I had never seen such a thing before |
18:38
🔗
|
Nemo_bis |
I do know that the next time I uploaded the very same files to another (test) item with torrent it all went very smoothly |
18:39
🔗
|
JAA |
I see. |
18:39
🔗
|
JAA |
Maybe two parallel uploads ended up on the same S3 server? |
18:40
🔗
|
JAA |
I think the directories are just named by the item, not some unique identifier of the upload. |
18:40
🔗
|
JAA |
So perhaps when the first archive.php task ran, it shoved the partially uploaded other file into the item as well, and then the second task did that history mangling thing. |
18:41
🔗
|
Nemo_bis |
Yes, that was one of my hypotheses but I didn't bother to compare the ids |
18:41
🔗
|
JAA |
Ah nope, there is a UUID in the path. |
18:42
🔗
|
Nemo_bis |
Everything would be fine if only at some point the tasks didn't take hours |
18:43
🔗
|
JAA |
Yeah |
18:44
🔗
|
JAA |
Also, tasks are really slow at the moment it seems. A derive of mine has been running on the same WARC for 5 hours now. Ok, that file is large (80 GiB), but last time the derive for it ran through in ~15 minutes. |
18:46
🔗
|
Nemo_bis |
When was the last time? |
18:46
🔗
|
Nemo_bis |
I saw such hours-long WARC derives several times in the last week or two |
18:47
🔗
|
JAA |
About two months ago. |
18:47
🔗
|
Nemo_bis |
At least now the derive servers don't seem to be struggling with iowait |
20:03
🔗
|
JAA |
Oh, the derive finally finished after over 5.5 hours. |
20:22
🔗
|
|
Stilett0 has joined #internetarchive |
20:24
🔗
|
|
Stiletto has quit IRC (Ping timeout: 265 seconds) |
20:38
🔗
|
|
Stilett0 is now known as Stiletto |