Time |
Nickname |
Message |
04:16
π
|
|
VADemon has quit IRC (Read error: Connection reset by peer) |
06:56
π
|
|
JesseW has quit IRC (Read error: Operation timed out) |
09:06
π
|
|
demize has joined #internetarchive.bak |
09:19
π
|
|
closure has quit IRC (Ping timeout: 250 seconds) |
09:19
π
|
|
closure has joined #internetarchive.bak |
09:20
π
|
|
svchfoo3 sets mode: +o closure |
09:21
π
|
demize |
Is it to be expected to get a bunch of 403 errors from the IA when doing git-annex-get's? |
09:22
π
|
demize |
or 417 Expectation Failed |
09:26
π
|
HCross |
is it doing it on the metadata? |
12:09
π
|
|
VADemon has joined #internetarchive.bak |
13:40
π
|
demize |
On actual files, like https://ptpb.pw/mWuT |
14:03
π
|
demize |
It seems that the 39 files that still only have 1 copy in shard 10 are ones that return either 403, 404, or 417. |
15:33
π
|
db48x |
yep |
15:33
π
|
db48x |
sometimes the IA has to hide items |
16:02
π
|
|
GLaDOS has quit IRC (Quit: Oh crap, I died.) |
16:03
π
|
|
GLaDOS has joined #internetarchive.bak |
16:11
π
|
|
Rotab has quit IRC (Read error: Connection reset by peer) |
16:11
π
|
demize |
I'd expect all hidden items to have the same status code though. |
16:30
π
|
db48x |
I'm sure there are plenty of interesting edge cases |
16:31
π
|
|
JesseW has joined #internetarchive.bak |
17:10
π
|
demize |
Wish things like that was documented, hmm. |
17:10
π
|
JesseW |
demize: like what? |
17:10
π
|
demize |
The status code of missing files. |
17:10
π
|
* |
JesseW is reading the logs now |
17:10
π
|
demize |
Getting a mix of 403s, 404s, and 417s. |
17:11
π
|
JesseW |
well, we can document them now. :-) That's one of the points of this project. :-) |
17:12
π
|
demize |
It's hard to document them when there's no information given about them though. |
17:12
π
|
JesseW |
well, we can start by documenting what the range of responses is |
17:13
π
|
JesseW |
and include those on the wiki page (or a subpage): http://archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK |
17:14
π
|
JesseW |
demize: one very useful way to investigate particular items on IA is (while logged in) go to https://archive.org/history/IDENTIFIER (where IDENTIFIER is the item identifier, in this case crankygeeks_080_episode ) |
17:15
π
|
JesseW |
This shows that the item was (likely incorrectly) marked as spam back on 2015-09-30 |
17:15
π
|
JesseW |
I'll send an email to info@ and they will likely make it visible again. |
17:16
π
|
JesseW |
demize: please drop the full list of other items that are generating errors into a pastebin |
17:20
π
|
demize |
Cool, I'll save the list. |
17:20
π
|
JesseW |
Want me to CC you on the email to info@? |
17:22
π
|
demize |
Sure, johannes@kyriasis.com |
17:23
π
|
demize |
Should we also make a sub page for each shard with a list of files that cannot be mirrored? |
17:23
π
|
JesseW |
yes |
17:24
π
|
JesseW |
ok, email sent |
17:34
π
|
demize |
Hmm, https://archive.org/download/tolcher2005-03-13.shnf/tolcher2005-03-13.shnf_64kb_mp3.zip is another one that fails, with 417, with the page saying "missing required path parameter", hmm. |
17:38
π
|
JesseW |
Hm, will check /history/ |
17:38
π
|
db48x |
that item doesn't have a zip file in it |
17:38
π
|
JesseW |
ok, yeah, that's a *different* (and interesting) error |
17:38
π
|
db48x |
https://ia902305.us.archive.org/25/items/tolcher2005-03-13.shnf/tolcher2005-03-13.shnf_files.xml |
17:40
π
|
JesseW |
I don't see anything in the history that would explain why that file used to exist but doesn't now... |
17:41
π
|
JesseW |
ah, maybe this task: https://catalogd.archive.org/log/345422350 |
17:41
π
|
demize |
https://catalogd.archive.org/log/400391779 |
17:41
π
|
demize |
IMPROPERLY _files.xml MARKED LIKELY DERIVATIVES: |
17:41
π
|
demize |
... |
17:41
π
|
demize |
DELETING tolcher2005-03-13.shnf_64kb_mp3.zip |
17:41
π
|
JesseW |
ha |
17:41
π
|
JesseW |
well, that explains it yeah |
17:42
π
|
db48x |
the error is probably weird because it's interacting with their online zip browser |
17:42
π
|
db48x |
which lets you load individual files from inside archives |
17:43
π
|
JesseW |
this is certainly exactly the sort of thing the IA.BAK project is intended to flush out :-) |
17:46
π
|
demize |
https://ptpb.pw/M2GY are the 39 files that aren't available in shard 10 |
17:46
π
|
JesseW |
nice! |
17:47
π
|
db48x |
we should just remove the zips from the shard |
17:47
π
|
demize |
Yeah |
17:47
π
|
JesseW |
https://archive.org/download/hhtm2008-12-04.Schoeps_64_24_bit/hhtm2008-12-04_mk4_2496_FLAC/.pureftpd-upload.4939a348.15.4bd9.b521510 -- this looks like a temporary file that got swept up in the shard |
17:47
π
|
JesseW |
agreed about removing the zips |
17:48
π
|
db48x |
ekafon048kmutant was darked |
17:49
π
|
demize |
Yeah, .pureftpd-upload.* files are essentially .part files. |
17:50
π
|
demize |
It seems that different darked files get either 403 or 404 |
17:51
π
|
JesseW |
Hm, I wonder why the difference. |
17:51
π
|
demize |
Actually, hmmm. Maybe the darked ones anly get 403 |
17:51
π
|
demize |
And the 404s are just ones that were deleted due to <reason> |
17:51
π
|
JesseW |
that would make sense, yeah |
17:53
π
|
demize |
Seems many of the 404s are mp3s that are previews |
17:54
π
|
db48x |
we should leave the dark items in the shard, in case they come back later |
17:54
π
|
demize |
Also |
17:54
π
|
demize |
"get nacion-libre-diy/NXL064/03. By my self (Feat. Raz - Anti-Ven%C3%B6m).mp3 (from web...) " |
17:54
π
|
db48x |
it's even possible we got a copy before they went dark |
17:55
π
|
JesseW |
yep :-) |
17:55
π
|
demize |
Is there, but getting a 404, which seems to be due to weird URL encoding |
17:55
π
|
JesseW |
well, that's a bug |
17:55
π
|
db48x |
hrm. I had thought that bug was fixed |
17:55
π
|
demize |
Real URL is https://archive.org/download/NXL064/03.%20By%20my%20self%20%28Feat.%20Raz%20-%20Anti-Ven%25C3%25B6m%29.mp3 |
17:55
π
|
demize |
But it tries https://archive.org/download/NXL064/03.%20By%20my%20self%20(Feat.%20Raz%20-%20Anti-Ven%C3%B6m).mp3_meta.txt |
17:56
π
|
demize |
err, https://archive.org/download/NXL064/03.%20By%20my%20self%20(Feat.%20Raz%20-%20Anti-Ven%C3%B6m).mp3 |
17:57
π
|
demize |
Actually |
17:57
π
|
demize |
I think it might be that the ΓΆ in VenΓΆm was a different encoding and was then normalized.. |
18:00
π
|
demize |
Or, no... |
18:00
π
|
demize |
Gah, it's double URL encoded. |
18:01
π
|
demize |
%C3%B6 turned into %25 C3 %25 B6 |
18:01
π
|
db48x |
hah |
18:01
π
|
JesseW |
That would be a problem, yeah |
18:01
π
|
db48x |
that's probably why it's not fixed |
18:02
π
|
db48x |
the fix would be inelegant :) |
18:02
π
|
demize |
Yeah.. |
18:02
π
|
JesseW |
It's still a bug in iabak, that we got the wrong encoding, no? |
18:02
π
|
db48x |
I wonder if IA would fix it on their end |
18:03
π
|
demize |
https://catalogd.archive.org/log/151966441 |
18:03
π
|
demize |
So the file seems to have been uploaded with a percent-encoded filename. |
18:03
π
|
demize |
And then iabak didn't percent-encode the percent-encoded filename. |
18:03
π
|
db48x |
yep |
18:04
π
|
db48x |
technically iabak doesn't encode anything |
18:04
π
|
db48x |
but the script that creates the shard could |
18:05
π
|
db48x |
keep up the good work; I'll be around later |
18:06
π
|
JesseW |
yeah, I was thinking of iabak in the sense of the whole project |
18:07
π
|
JesseW |
specifically the shard creation script |
18:07
π
|
|
HCross2 has joined #internetarchive.bak |
19:26
π
|
|
JesseW has quit IRC (Read error: Operation timed out) |
19:41
π
|
|
Start has quit IRC (Read error: Connection reset by peer) |
19:41
π
|
|
Start has joined #internetarchive.bak |
19:42
π
|
|
svchfoo3 sets mode: +o Start |
21:43
π
|
|
JesseW has joined #internetarchive.bak |
23:04
π
|
|
xperia64 has joined #internetarchive.bak |
23:55
π
|
|
demize has quit IRC (Ping timeout: 250 seconds) |