[00:16] *** Raccoon has quit IRC (Ping timeout: 258 seconds) [02:10] *** britmob has joined #archiveteam-bs [02:20] *** odemgi has joined #archiveteam-bs [02:21] *** odemgi_ has quit IRC (Ping timeout: 252 seconds) [03:24] *** qw3rty has joined #archiveteam-bs [03:32] *** qw3rty2 has quit IRC (Ping timeout: 745 seconds) [03:47] *** odemgi_ has joined #archiveteam-bs [03:53] *** odemgi has quit IRC (Read error: Operation timed out) [05:25] *** systwi_ has joined #archiveteam-bs [05:31] *** systwi has quit IRC (Ping timeout: 612 seconds) [05:31] *** ShellyRol has quit IRC (Read error: Connection reset by peer) [05:32] *** ShellyRol has joined #archiveteam-bs [06:31] *** VADemon has quit IRC (Quit: left4dead) [07:45] *** fredgido_ has joined #archiveteam-bs [07:48] *** deevious1 has joined #archiveteam-bs [07:50] *** deevious has quit IRC (Ping timeout: 252 seconds) [07:50] *** deevious1 is now known as deevious [07:52] *** fredgido has quit IRC (Read error: Operation timed out) [08:31] *** Raccoon has joined #archiveteam-bs [08:37] SketchCow: tape is fully uploaded now [08:38] i'm also now uploading sbs 8 news for 2004-04 to 2004-06 [09:12] *** fuzzy802 has joined #archiveteam-bs [09:12] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [09:14] *** fuzzy802 has quit IRC (Read error: Connection reset by peer) [09:17] *** fuzzy8021 has joined #archiveteam-bs [09:30] *** schbirid has joined #archiveteam-bs [10:13] *** VADemon has joined #archiveteam-bs [10:27] *** fuzzy8021 has quit IRC (Read error: Connection reset by peer) [10:28] *** fuzzy8021 has joined #archiveteam-bs [10:57] *** Mateon1 has quit IRC (Read error: Operation timed out) [10:57] *** Mateon1 has joined #archiveteam-bs [11:24] *** deevious1 has joined #archiveteam-bs [11:25] *** deevious has quit IRC (Ping timeout: 252 seconds) [11:25] *** deevious1 is now known as deevious [13:24] *** killsushi has quit IRC (Quit: Leaving) [13:38] *** DogsRNice has joined #archiveteam-bs [13:41] I wonder how well https://www.medicare.gov/download/downloaddb.asp is crawled. Medicare data used by researchers and the media, as well as potential business uses. [13:45] *** killsushi has joined #archiveteam-bs [14:19] *** balrog has quit IRC (Quit: Bye) [14:29] *** balrog has joined #archiveteam-bs [14:36] alright, another one. I'm not an expert, but according to robots.txt, I feel like this should be able to be saved in the Wayback Machine: https://blog.revolutionanalytics.com/2019/07/r-361-is-now-available.html [14:40] The entire site? Or just that article? [14:40] And the Wayback Machine no longers obeys robots.txt AFAIK [14:50] I just noticed that article. [14:51] ...but it wouldn't hurt to crawl the site, if archivebot isn't too busy [14:56] I can go ahead and crawl it myself [15:01] Oh. Microsoft. Probably have rate limiting, that's gonna stop me. [15:09] I might be blind, but I don't see anything problematic in that robots.txt? [15:49] *** Raccoon` has joined #archiveteam-bs [15:49] *** Raccoon has quit IRC (Ping timeout: 258 seconds) [15:49] *** Raccoon` is now known as Raccoon [16:10] *** killsushi has quit IRC (Quit: Leaving) [16:49] *** Raccoon has quit IRC (Ping timeout: 258 seconds) [17:32] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [17:33] *** fuzzy8021 has joined #archiveteam-bs [17:37] *** icedice has joined #archiveteam-bs [17:59] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [17:59] *** fuzzy8021 has joined #archiveteam-bs [18:09] if only Wayback correctly interpreted robots.txt to begin with, because it just doesn't mind whitelisted URLs after a /* deny [18:53] *** BlueMax has quit IRC (Quit: Leaving) [19:13] *** Raccoon has joined #archiveteam-bs [19:31] *** Atom__ has joined #archiveteam-bs [19:37] *** Atom-- has quit IRC (Read error: Operation timed out) [20:44] *** Ryz has quit IRC (Remote host closed the connection) [20:44] *** kiska18 has quit IRC (Remote host closed the connection) [20:45] *** Ryz has joined #archiveteam-bs [20:45] *** svchfoo3 sets mode: +o Ryz [20:45] *** Fusl sets mode: +o Ryz [20:45] *** Fusl____ sets mode: +o Ryz [20:45] *** Fusl_ sets mode: +o Ryz [20:45] *** kiska18 has joined #archiveteam-bs [20:45] *** Fusl____ sets mode: +o kiska18 [20:45] *** Fusl sets mode: +o kiska18 [20:45] *** Fusl_ sets mode: +o kiska18 [21:32] *** schbirid has quit IRC (Remote host closed the connection) [22:18] So looking into that picosong S3 weirdness again. A delay doesn't seem to help, nor does retrying directly on the S3 URL. I also can't reproduce it though, even on the same track ID. I guess I'll try refetching the redirect. [22:21] I can't reproduce it in the sense that if I get 403s under load and then try again without load for the same track, it works fine. [22:21] At least it seems to have something to do with the load. [22:31] Are you downloading from s3 right after the page load? It is using signed s3 urls with looks like a 15 min expiry. [22:32] ^ [22:36] Yes [22:36] I send the request for the download a few milliseconds after receiving the 302. [22:38] And since I didn't mention it again: it doesn't happen on all downloads, only on some. [22:44] race condition on their side? [22:48] That's weird. cdn.picosong.com has the params for S3, but the actual server isn't S3. They're proxying the data through an nginx box somewhere. [22:48] Yeah, I guess so. And one that's only triggered when the site is under load and only for the IP causing that load (since I can't reproduce it from an independent connection). [22:49] Yeah, that's something I noticed as well. Sometimes cdn.picosong.com also redirects to picosong.s3.amazonaws.com. [22:51] Hmm, or maybe it has something to do with my parallelism? [22:53] Nope, doesn't look like it either. I'm sending lots of requests obviously, but sometimes the 403s also happen when there are no other requests pending. [22:55] It could be AWS doing rate limiting, but the limits are very high for S3 (5k+/s iirc) [22:56] oh, that [22:56] Yeah, I'm nowhere near that. [22:57] i had that on abox hel1 when doing a few hundred requests/sec against sketch [22:57] But is is also AWS, so random API failures/errors are normal [23:01] So I implemented refetching the /cdn/HEX.mp3 redirect whenever I get a non-200 response from the post-redirect request. It retries five times, then it gives up. This worked fine for about 2 minutes, then it started failing. There were plenty of 403s in those two minutes as well, but the retries were successful, unlike later. [23:04] Still a massive improvement over the previous numbers though. I had ~250 failures in ~2700 requests previously, now it's 34 failures. [23:09] That's 2700 requests in 9 minutes, by the way, so averaging 5 requests per second (against S3; many more against picosong.com). [23:09] JAA: dump the headers and check them? [23:09] *** HashbangI has quit IRC (Read error: Connection reset by peer) [23:09] also, tried on mips yet? [23:10] Fusl: Nothing telling in the headers, but the error in the body is "The request signature we calculated does not match the signature you provided. Check your key and signing method." [23:11] I assume the signature's tied to the IP address? [23:11] In that case, wouldn't work on mips. [23:12] it shouldnt be [23:12] Signed URLs aren't tied to ip [23:13] Huh, I think I got a 401 or 403 before when I tried to retrieve the same URL from a different IP, but not sure. [23:13] Might've been cdn.picosong.com rather than S3 also. [23:14] Without knowing how their server is generating the URLs hard to say why they are getting generated incorrectly. [23:15] Nevermind, yeah, seems to work from another IP. [23:15] Do those filenames have characters other than a-zA-Z in them? [23:16] Yes, all kinds of stuff, but I don't see a correlation with the errors there either. [23:17] *** HashbangI has joined #archiveteam-bs [23:17] As in, plenty of files with weird names get downloaded just fine. [23:18] Gotcha. (A few S3 things don't like special chars) [23:18] Something like %E0%B4%93%20%E0%B4%AE%E0%B4%B1%E0%B4%BF%E0%B4%AF%E0%B4%BE%E0%B4%82.........Kyamta%20-...........-%20O%20Mariyame.mp3 succeeded, just as an example. [23:19] Meanwhile the much simpler SlowLifeMeoko%20(1).mp3 failed. [23:19] JAA: you using wpull or grab-site for that? [23:19] Fusl: qwarc [23:20] i dunno if that has a similar problem but when trying to throw the sketches list into grab-site, it failed with 403s on the s3 urls [23:20] but curl -L worked fine [23:21] And it wasn't the UA I guess? [23:23] Also, did they fail consistently or randomly? [23:24] i think it was consistently [23:25] actually, no it wasnt. it just started happening after spinning up more containers with grab-site [23:25] Huh [23:26] That does sound a bit like my issue. [23:37] I'm finding all kinds of reports on this issue, some going back to 2013. :-| [23:38] One issue that's being mentioned repeatedly is signatures containing spaces which are then encoded incorrectly in the URL, but that doesn't appear to be the problem here since I don't see any such pattern in the failing vs. succeeding Signature values. [23:41] (I actually don't think it's spaces in the signature; rather, it's a + in the signature which is then inserted in the URL directly and therefore arrives as a space at AWS, failing the signature check. But multiple people claim "spaces in signatures", so I'll just leave it at that.) [23:41] so i'm going to be grabbing the Sight & Sound Magazine archive [23:42] The + seem to be encoded correctly as %2B by picosong though, and I see requests with it both succeeding and failing, so that's not the problem. [23:55] *** BlueMax has joined #archiveteam-bs