#archiveteam-bs 2019-10-01,Tue

↑back Search

Time Nickname Message
00:16 🔗 Raccoon has quit IRC (Ping timeout: 258 seconds)
02:10 🔗 britmob has joined #archiveteam-bs
02:20 🔗 odemgi has joined #archiveteam-bs
02:21 🔗 odemgi_ has quit IRC (Ping timeout: 252 seconds)
03:24 🔗 qw3rty has joined #archiveteam-bs
03:32 🔗 qw3rty2 has quit IRC (Ping timeout: 745 seconds)
03:47 🔗 odemgi_ has joined #archiveteam-bs
03:53 🔗 odemgi has quit IRC (Read error: Operation timed out)
05:25 🔗 systwi_ has joined #archiveteam-bs
05:31 🔗 systwi has quit IRC (Ping timeout: 612 seconds)
05:31 🔗 ShellyRol has quit IRC (Read error: Connection reset by peer)
05:32 🔗 ShellyRol has joined #archiveteam-bs
06:31 🔗 VADemon has quit IRC (Quit: left4dead)
07:45 🔗 fredgido_ has joined #archiveteam-bs
07:48 🔗 deevious1 has joined #archiveteam-bs
07:50 🔗 deevious has quit IRC (Ping timeout: 252 seconds)
07:50 🔗 deevious1 is now known as deevious
07:52 🔗 fredgido has quit IRC (Read error: Operation timed out)
08:31 🔗 Raccoon has joined #archiveteam-bs
08:37 🔗 godane SketchCow: tape is fully uploaded now
08:38 🔗 godane i'm also now uploading sbs 8 news for 2004-04 to 2004-06
09:12 🔗 fuzzy802 has joined #archiveteam-bs
09:12 🔗 fuzzy8021 has quit IRC (Read error: Operation timed out)
09:14 🔗 fuzzy802 has quit IRC (Read error: Connection reset by peer)
09:17 🔗 fuzzy8021 has joined #archiveteam-bs
09:30 🔗 schbirid has joined #archiveteam-bs
10:13 🔗 VADemon has joined #archiveteam-bs
10:27 🔗 fuzzy8021 has quit IRC (Read error: Connection reset by peer)
10:28 🔗 fuzzy8021 has joined #archiveteam-bs
10:57 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
10:57 🔗 Mateon1 has joined #archiveteam-bs
11:24 🔗 deevious1 has joined #archiveteam-bs
11:25 🔗 deevious has quit IRC (Ping timeout: 252 seconds)
11:25 🔗 deevious1 is now known as deevious
13:24 🔗 killsushi has quit IRC (Quit: Leaving)
13:38 🔗 DogsRNice has joined #archiveteam-bs
13:41 🔗 paul2520 I wonder how well https://www.medicare.gov/download/downloaddb.asp is crawled. Medicare data used by researchers and the media, as well as potential business uses.
13:45 🔗 killsushi has joined #archiveteam-bs
14:19 🔗 balrog has quit IRC (Quit: Bye)
14:29 🔗 balrog has joined #archiveteam-bs
14:36 🔗 paul2520 alright, another one. I'm not an expert, but according to robots.txt, I feel like this should be able to be saved in the Wayback Machine: https://blog.revolutionanalytics.com/2019/07/r-361-is-now-available.html
14:40 🔗 britmob The entire site? Or just that article?
14:40 🔗 britmob And the Wayback Machine no longers obeys robots.txt AFAIK
14:50 🔗 paul2520 I just noticed that article.
14:51 🔗 paul2520 ...but it wouldn't hurt to crawl the site, if archivebot isn't too busy
14:56 🔗 britmob I can go ahead and crawl it myself
15:01 🔗 britmob Oh. Microsoft. Probably have rate limiting, that's gonna stop me.
15:09 🔗 JAA I might be blind, but I don't see anything problematic in that robots.txt?
15:49 🔗 Raccoon` has joined #archiveteam-bs
15:49 🔗 Raccoon has quit IRC (Ping timeout: 258 seconds)
15:49 🔗 Raccoon` is now known as Raccoon
16:10 🔗 killsushi has quit IRC (Quit: Leaving)
16:49 🔗 Raccoon has quit IRC (Ping timeout: 258 seconds)
17:32 🔗 fuzzy8021 has quit IRC (Read error: Operation timed out)
17:33 🔗 fuzzy8021 has joined #archiveteam-bs
17:37 🔗 icedice has joined #archiveteam-bs
17:59 🔗 fuzzy8021 has quit IRC (Read error: Operation timed out)
17:59 🔗 fuzzy8021 has joined #archiveteam-bs
18:09 🔗 VADemon if only Wayback correctly interpreted robots.txt to begin with, because it just doesn't mind whitelisted URLs after a /* deny
18:53 🔗 BlueMax has quit IRC (Quit: Leaving)
19:13 🔗 Raccoon has joined #archiveteam-bs
19:31 🔗 Atom__ has joined #archiveteam-bs
19:37 🔗 Atom-- has quit IRC (Read error: Operation timed out)
20:44 🔗 Ryz has quit IRC (Remote host closed the connection)
20:44 🔗 kiska18 has quit IRC (Remote host closed the connection)
20:45 🔗 Ryz has joined #archiveteam-bs
20:45 🔗 svchfoo3 sets mode: +o Ryz
20:45 🔗 Fusl sets mode: +o Ryz
20:45 🔗 Fusl____ sets mode: +o Ryz
20:45 🔗 Fusl_ sets mode: +o Ryz
20:45 🔗 kiska18 has joined #archiveteam-bs
20:45 🔗 Fusl____ sets mode: +o kiska18
20:45 🔗 Fusl sets mode: +o kiska18
20:45 🔗 Fusl_ sets mode: +o kiska18
21:32 🔗 schbirid has quit IRC (Remote host closed the connection)
22:18 🔗 JAA So looking into that picosong S3 weirdness again. A delay doesn't seem to help, nor does retrying directly on the S3 URL. I also can't reproduce it though, even on the same track ID. I guess I'll try refetching the redirect.
22:21 🔗 JAA I can't reproduce it in the sense that if I get 403s under load and then try again without load for the same track, it works fine.
22:21 🔗 JAA At least it seems to have something to do with the load.
22:31 🔗 amelia386 Are you downloading from s3 right after the page load? It is using signed s3 urls with looks like a 15 min expiry.
22:32 🔗 Fusl ^
22:36 🔗 JAA Yes
22:36 🔗 JAA I send the request for the download a few milliseconds after receiving the 302.
22:38 🔗 JAA And since I didn't mention it again: it doesn't happen on all downloads, only on some.
22:44 🔗 Fusl race condition on their side?
22:48 🔗 amelia386 That's weird. cdn.picosong.com has the params for S3, but the actual server isn't S3. They're proxying the data through an nginx box somewhere.
22:48 🔗 JAA Yeah, I guess so. And one that's only triggered when the site is under load and only for the IP causing that load (since I can't reproduce it from an independent connection).
22:49 🔗 JAA Yeah, that's something I noticed as well. Sometimes cdn.picosong.com also redirects to picosong.s3.amazonaws.com.
22:51 🔗 JAA Hmm, or maybe it has something to do with my parallelism?
22:53 🔗 JAA Nope, doesn't look like it either. I'm sending lots of requests obviously, but sometimes the 403s also happen when there are no other requests pending.
22:55 🔗 amelia386 It could be AWS doing rate limiting, but the limits are very high for S3 (5k+/s iirc)
22:56 🔗 Fusl oh, that
22:56 🔗 JAA Yeah, I'm nowhere near that.
22:57 🔗 Fusl i had that on abox hel1 when doing a few hundred requests/sec against sketch
22:57 🔗 amelia386 But is is also AWS, so random API failures/errors are normal
23:01 🔗 JAA So I implemented refetching the /cdn/HEX.mp3 redirect whenever I get a non-200 response from the post-redirect request. It retries five times, then it gives up. This worked fine for about 2 minutes, then it started failing. There were plenty of 403s in those two minutes as well, but the retries were successful, unlike later.
23:04 🔗 JAA Still a massive improvement over the previous numbers though. I had ~250 failures in ~2700 requests previously, now it's 34 failures.
23:09 🔗 JAA That's 2700 requests in 9 minutes, by the way, so averaging 5 requests per second (against S3; many more against picosong.com).
23:09 🔗 Fusl JAA: dump the headers and check them?
23:09 🔗 HashbangI has quit IRC (Read error: Connection reset by peer)
23:09 🔗 Fusl also, tried on mips yet?
23:10 🔗 JAA Fusl: Nothing telling in the headers, but the error in the body is "The request signature we calculated does not match the signature you provided. Check your key and signing method."
23:11 🔗 JAA I assume the signature's tied to the IP address?
23:11 🔗 JAA In that case, wouldn't work on mips.
23:12 🔗 Fusl it shouldnt be
23:12 🔗 amelia386 Signed URLs aren't tied to ip
23:13 🔗 JAA Huh, I think I got a 401 or 403 before when I tried to retrieve the same URL from a different IP, but not sure.
23:13 🔗 JAA Might've been cdn.picosong.com rather than S3 also.
23:14 🔗 amelia386 Without knowing how their server is generating the URLs hard to say why they are getting generated incorrectly.
23:15 🔗 JAA Nevermind, yeah, seems to work from another IP.
23:15 🔗 amelia386 Do those filenames have characters other than a-zA-Z in them?
23:16 🔗 JAA Yes, all kinds of stuff, but I don't see a correlation with the errors there either.
23:17 🔗 HashbangI has joined #archiveteam-bs
23:17 🔗 JAA As in, plenty of files with weird names get downloaded just fine.
23:18 🔗 amelia386 Gotcha. (A few S3 things don't like special chars)
23:18 🔗 JAA Something like %E0%B4%93%20%E0%B4%AE%E0%B4%B1%E0%B4%BF%E0%B4%AF%E0%B4%BE%E0%B4%82.........Kyamta%20-...........-%20O%20Mariyame.mp3 succeeded, just as an example.
23:19 🔗 JAA Meanwhile the much simpler SlowLifeMeoko%20(1).mp3 failed.
23:19 🔗 Fusl JAA: you using wpull or grab-site for that?
23:19 🔗 JAA Fusl: qwarc
23:20 🔗 Fusl i dunno if that has a similar problem but when trying to throw the sketches list into grab-site, it failed with 403s on the s3 urls
23:20 🔗 Fusl but curl -L worked fine
23:21 🔗 JAA And it wasn't the UA I guess?
23:23 🔗 JAA Also, did they fail consistently or randomly?
23:24 🔗 Fusl i think it was consistently
23:25 🔗 Fusl actually, no it wasnt. it just started happening after spinning up more containers with grab-site
23:25 🔗 JAA Huh
23:26 🔗 JAA That does sound a bit like my issue.
23:37 🔗 JAA I'm finding all kinds of reports on this issue, some going back to 2013. :-|
23:38 🔗 JAA One issue that's being mentioned repeatedly is signatures containing spaces which are then encoded incorrectly in the URL, but that doesn't appear to be the problem here since I don't see any such pattern in the failing vs. succeeding Signature values.
23:41 🔗 JAA (I actually don't think it's spaces in the signature; rather, it's a + in the signature which is then inserted in the URL directly and therefore arrives as a space at AWS, failing the signature check. But multiple people claim "spaces in signatures", so I'll just leave it at that.)
23:41 🔗 godane so i'm going to be grabbing the Sight & Sound Magazine archive
23:42 🔗 JAA The + seem to be encoded correctly as %2B by picosong though, and I see requests with it both succeeding and failing, so that's not the problem.
23:55 🔗 BlueMax has joined #archiveteam-bs

irclogger-viewer