[04:16] *** VADemon has quit IRC (Read error: Connection reset by peer) [06:56] *** JesseW has quit IRC (Read error: Operation timed out) [09:06] *** demize has joined #internetarchive.bak [09:19] *** closure has quit IRC (Ping timeout: 250 seconds) [09:19] *** closure has joined #internetarchive.bak [09:20] *** svchfoo3 sets mode: +o closure [09:21] Is it to be expected to get a bunch of 403 errors from the IA when doing git-annex-get's? [09:22] or 417 Expectation Failed [09:26] is it doing it on the metadata? [12:09] *** VADemon has joined #internetarchive.bak [13:40] On actual files, like https://ptpb.pw/mWuT [14:03] It seems that the 39 files that still only have 1 copy in shard 10 are ones that return either 403, 404, or 417. [15:33] yep [15:33] sometimes the IA has to hide items [16:02] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [16:03] *** GLaDOS has joined #internetarchive.bak [16:11] *** Rotab has quit IRC (Read error: Connection reset by peer) [16:11] I'd expect all hidden items to have the same status code though. [16:30] I'm sure there are plenty of interesting edge cases [16:31] *** JesseW has joined #internetarchive.bak [17:10] Wish things like that was documented, hmm. [17:10] demize: like what? [17:10] The status code of missing files. [17:10] * JesseW is reading the logs now [17:10] Getting a mix of 403s, 404s, and 417s. [17:11] well, we can document them now. :-) That's one of the points of this project. :-) [17:12] It's hard to document them when there's no information given about them though. [17:12] well, we can start by documenting what the range of responses is [17:13] and include those on the wiki page (or a subpage): http://archiveteam.org/index.php?title=Talk:INTERNETARCHIVE.BAK [17:14] demize: one very useful way to investigate particular items on IA is (while logged in) go to https://archive.org/history/IDENTIFIER (where IDENTIFIER is the item identifier, in this case crankygeeks_080_episode ) [17:15] This shows that the item was (likely incorrectly) marked as spam back on 2015-09-30 [17:15] I'll send an email to info@ and they will likely make it visible again. [17:16] demize: please drop the full list of other items that are generating errors into a pastebin [17:20] Cool, I'll save the list. [17:20] Want me to CC you on the email to info@? [17:22] Sure, johannes@kyriasis.com [17:23] Should we also make a sub page for each shard with a list of files that cannot be mirrored? [17:23] yes [17:24] ok, email sent [17:34] Hmm, https://archive.org/download/tolcher2005-03-13.shnf/tolcher2005-03-13.shnf_64kb_mp3.zip is another one that fails, with 417, with the page saying "missing required path parameter", hmm. [17:38] Hm, will check /history/ [17:38] that item doesn't have a zip file in it [17:38] ok, yeah, that's a *different* (and interesting) error [17:38] https://ia902305.us.archive.org/25/items/tolcher2005-03-13.shnf/tolcher2005-03-13.shnf_files.xml [17:40] I don't see anything in the history that would explain why that file used to exist but doesn't now... [17:41] ah, maybe this task: https://catalogd.archive.org/log/345422350 [17:41] https://catalogd.archive.org/log/400391779 [17:41] IMPROPERLY _files.xml MARKED LIKELY DERIVATIVES: [17:41] ... [17:41] DELETING tolcher2005-03-13.shnf_64kb_mp3.zip [17:41] ha [17:41] well, that explains it yeah [17:42] the error is probably weird because it's interacting with their online zip browser [17:42] which lets you load individual files from inside archives [17:43] this is certainly exactly the sort of thing the IA.BAK project is intended to flush out :-) [17:46] https://ptpb.pw/M2GY are the 39 files that aren't available in shard 10 [17:46] nice! [17:47] we should just remove the zips from the shard [17:47] Yeah [17:47] https://archive.org/download/hhtm2008-12-04.Schoeps_64_24_bit/hhtm2008-12-04_mk4_2496_FLAC/.pureftpd-upload.4939a348.15.4bd9.b521510 -- this looks like a temporary file that got swept up in the shard [17:47] agreed about removing the zips [17:48] ekafon048kmutant was darked [17:49] Yeah, .pureftpd-upload.* files are essentially .part files. [17:50] It seems that different darked files get either 403 or 404 [17:51] Hm, I wonder why the difference. [17:51] Actually, hmmm. Maybe the darked ones anly get 403 [17:51] And the 404s are just ones that were deleted due to [17:51] that would make sense, yeah [17:53] Seems many of the 404s are mp3s that are previews [17:54] we should leave the dark items in the shard, in case they come back later [17:54] Also [17:54] "get nacion-libre-diy/NXL064/03. By my self (Feat. Raz - Anti-Ven%C3%B6m).mp3 (from web...) " [17:54] it's even possible we got a copy before they went dark [17:55] yep :-) [17:55] Is there, but getting a 404, which seems to be due to weird URL encoding [17:55] well, that's a bug [17:55] hrm. I had thought that bug was fixed [17:55] Real URL is https://archive.org/download/NXL064/03.%20By%20my%20self%20%28Feat.%20Raz%20-%20Anti-Ven%25C3%25B6m%29.mp3 [17:55] But it tries https://archive.org/download/NXL064/03.%20By%20my%20self%20(Feat.%20Raz%20-%20Anti-Ven%C3%B6m).mp3_meta.txt [17:56] err, https://archive.org/download/NXL064/03.%20By%20my%20self%20(Feat.%20Raz%20-%20Anti-Ven%C3%B6m).mp3 [17:57] Actually [17:57] I think it might be that the ö in Venöm was a different encoding and was then normalized.. [18:00] Or, no... [18:00] Gah, it's double URL encoded. [18:01] %C3%B6 turned into %25 C3 %25 B6 [18:01] hah [18:01] That would be a problem, yeah [18:01] that's probably why it's not fixed [18:02] the fix would be inelegant :) [18:02] Yeah.. [18:02] It's still a bug in iabak, that we got the wrong encoding, no? [18:02] I wonder if IA would fix it on their end [18:03] https://catalogd.archive.org/log/151966441 [18:03] So the file seems to have been uploaded with a percent-encoded filename. [18:03] And then iabak didn't percent-encode the percent-encoded filename. [18:03] yep [18:04] technically iabak doesn't encode anything [18:04] but the script that creates the shard could [18:05] keep up the good work; I'll be around later [18:06] yeah, I was thinking of iabak in the sense of the whole project [18:07] specifically the shard creation script [18:07] *** HCross2 has joined #internetarchive.bak [19:26] *** JesseW has quit IRC (Read error: Operation timed out) [19:41] *** Start has quit IRC (Read error: Connection reset by peer) [19:41] *** Start has joined #internetarchive.bak [19:42] *** svchfoo3 sets mode: +o Start [21:43] *** JesseW has joined #internetarchive.bak [23:04] *** xperia64 has joined #internetarchive.bak [23:55] *** demize has quit IRC (Ping timeout: 250 seconds)