Time |
Nickname |
Message |
00:16
🔗
|
|
Raccoon has quit IRC (Ping timeout: 258 seconds) |
02:10
🔗
|
|
britmob has joined #archiveteam-bs |
02:20
🔗
|
|
odemgi has joined #archiveteam-bs |
02:21
🔗
|
|
odemgi_ has quit IRC (Ping timeout: 252 seconds) |
03:24
🔗
|
|
qw3rty has joined #archiveteam-bs |
03:32
🔗
|
|
qw3rty2 has quit IRC (Ping timeout: 745 seconds) |
03:47
🔗
|
|
odemgi_ has joined #archiveteam-bs |
03:53
🔗
|
|
odemgi has quit IRC (Read error: Operation timed out) |
05:25
🔗
|
|
systwi_ has joined #archiveteam-bs |
05:31
🔗
|
|
systwi has quit IRC (Ping timeout: 612 seconds) |
05:31
🔗
|
|
ShellyRol has quit IRC (Read error: Connection reset by peer) |
05:32
🔗
|
|
ShellyRol has joined #archiveteam-bs |
06:31
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
07:45
🔗
|
|
fredgido_ has joined #archiveteam-bs |
07:48
🔗
|
|
deevious1 has joined #archiveteam-bs |
07:50
🔗
|
|
deevious has quit IRC (Ping timeout: 252 seconds) |
07:50
🔗
|
|
deevious1 is now known as deevious |
07:52
🔗
|
|
fredgido has quit IRC (Read error: Operation timed out) |
08:31
🔗
|
|
Raccoon has joined #archiveteam-bs |
08:37
🔗
|
godane |
SketchCow: tape is fully uploaded now |
08:38
🔗
|
godane |
i'm also now uploading sbs 8 news for 2004-04 to 2004-06 |
09:12
🔗
|
|
fuzzy802 has joined #archiveteam-bs |
09:12
🔗
|
|
fuzzy8021 has quit IRC (Read error: Operation timed out) |
09:14
🔗
|
|
fuzzy802 has quit IRC (Read error: Connection reset by peer) |
09:17
🔗
|
|
fuzzy8021 has joined #archiveteam-bs |
09:30
🔗
|
|
schbirid has joined #archiveteam-bs |
10:13
🔗
|
|
VADemon has joined #archiveteam-bs |
10:27
🔗
|
|
fuzzy8021 has quit IRC (Read error: Connection reset by peer) |
10:28
🔗
|
|
fuzzy8021 has joined #archiveteam-bs |
10:57
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
10:57
🔗
|
|
Mateon1 has joined #archiveteam-bs |
11:24
🔗
|
|
deevious1 has joined #archiveteam-bs |
11:25
🔗
|
|
deevious has quit IRC (Ping timeout: 252 seconds) |
11:25
🔗
|
|
deevious1 is now known as deevious |
13:24
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
13:38
🔗
|
|
DogsRNice has joined #archiveteam-bs |
13:41
🔗
|
paul2520 |
I wonder how well https://www.medicare.gov/download/downloaddb.asp is crawled. Medicare data used by researchers and the media, as well as potential business uses. |
13:45
🔗
|
|
killsushi has joined #archiveteam-bs |
14:19
🔗
|
|
balrog has quit IRC (Quit: Bye) |
14:29
🔗
|
|
balrog has joined #archiveteam-bs |
14:36
🔗
|
paul2520 |
alright, another one. I'm not an expert, but according to robots.txt, I feel like this should be able to be saved in the Wayback Machine: https://blog.revolutionanalytics.com/2019/07/r-361-is-now-available.html |
14:40
🔗
|
britmob |
The entire site? Or just that article? |
14:40
🔗
|
britmob |
And the Wayback Machine no longers obeys robots.txt AFAIK |
14:50
🔗
|
paul2520 |
I just noticed that article. |
14:51
🔗
|
paul2520 |
...but it wouldn't hurt to crawl the site, if archivebot isn't too busy |
14:56
🔗
|
britmob |
I can go ahead and crawl it myself |
15:01
🔗
|
britmob |
Oh. Microsoft. Probably have rate limiting, that's gonna stop me. |
15:09
🔗
|
JAA |
I might be blind, but I don't see anything problematic in that robots.txt? |
15:49
🔗
|
|
Raccoon` has joined #archiveteam-bs |
15:49
🔗
|
|
Raccoon has quit IRC (Ping timeout: 258 seconds) |
15:49
🔗
|
|
Raccoon` is now known as Raccoon |
16:10
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
16:49
🔗
|
|
Raccoon has quit IRC (Ping timeout: 258 seconds) |
17:32
🔗
|
|
fuzzy8021 has quit IRC (Read error: Operation timed out) |
17:33
🔗
|
|
fuzzy8021 has joined #archiveteam-bs |
17:37
🔗
|
|
icedice has joined #archiveteam-bs |
17:59
🔗
|
|
fuzzy8021 has quit IRC (Read error: Operation timed out) |
17:59
🔗
|
|
fuzzy8021 has joined #archiveteam-bs |
18:09
🔗
|
VADemon |
if only Wayback correctly interpreted robots.txt to begin with, because it just doesn't mind whitelisted URLs after a /* deny |
18:53
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
19:13
🔗
|
|
Raccoon has joined #archiveteam-bs |
19:31
🔗
|
|
Atom__ has joined #archiveteam-bs |
19:37
🔗
|
|
Atom-- has quit IRC (Read error: Operation timed out) |
20:44
🔗
|
|
Ryz has quit IRC (Remote host closed the connection) |
20:44
🔗
|
|
kiska18 has quit IRC (Remote host closed the connection) |
20:45
🔗
|
|
Ryz has joined #archiveteam-bs |
20:45
🔗
|
|
svchfoo3 sets mode: +o Ryz |
20:45
🔗
|
|
Fusl sets mode: +o Ryz |
20:45
🔗
|
|
Fusl____ sets mode: +o Ryz |
20:45
🔗
|
|
Fusl_ sets mode: +o Ryz |
20:45
🔗
|
|
kiska18 has joined #archiveteam-bs |
20:45
🔗
|
|
Fusl____ sets mode: +o kiska18 |
20:45
🔗
|
|
Fusl sets mode: +o kiska18 |
20:45
🔗
|
|
Fusl_ sets mode: +o kiska18 |
21:32
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
22:18
🔗
|
JAA |
So looking into that picosong S3 weirdness again. A delay doesn't seem to help, nor does retrying directly on the S3 URL. I also can't reproduce it though, even on the same track ID. I guess I'll try refetching the redirect. |
22:21
🔗
|
JAA |
I can't reproduce it in the sense that if I get 403s under load and then try again without load for the same track, it works fine. |
22:21
🔗
|
JAA |
At least it seems to have something to do with the load. |
22:31
🔗
|
amelia386 |
Are you downloading from s3 right after the page load? It is using signed s3 urls with looks like a 15 min expiry. |
22:32
🔗
|
Fusl |
^ |
22:36
🔗
|
JAA |
Yes |
22:36
🔗
|
JAA |
I send the request for the download a few milliseconds after receiving the 302. |
22:38
🔗
|
JAA |
And since I didn't mention it again: it doesn't happen on all downloads, only on some. |
22:44
🔗
|
Fusl |
race condition on their side? |
22:48
🔗
|
amelia386 |
That's weird. cdn.picosong.com has the params for S3, but the actual server isn't S3. They're proxying the data through an nginx box somewhere. |
22:48
🔗
|
JAA |
Yeah, I guess so. And one that's only triggered when the site is under load and only for the IP causing that load (since I can't reproduce it from an independent connection). |
22:49
🔗
|
JAA |
Yeah, that's something I noticed as well. Sometimes cdn.picosong.com also redirects to picosong.s3.amazonaws.com. |
22:51
🔗
|
JAA |
Hmm, or maybe it has something to do with my parallelism? |
22:53
🔗
|
JAA |
Nope, doesn't look like it either. I'm sending lots of requests obviously, but sometimes the 403s also happen when there are no other requests pending. |
22:55
🔗
|
amelia386 |
It could be AWS doing rate limiting, but the limits are very high for S3 (5k+/s iirc) |
22:56
🔗
|
Fusl |
oh, that |
22:56
🔗
|
JAA |
Yeah, I'm nowhere near that. |
22:57
🔗
|
Fusl |
i had that on abox hel1 when doing a few hundred requests/sec against sketch |
22:57
🔗
|
amelia386 |
But is is also AWS, so random API failures/errors are normal |
23:01
🔗
|
JAA |
So I implemented refetching the /cdn/HEX.mp3 redirect whenever I get a non-200 response from the post-redirect request. It retries five times, then it gives up. This worked fine for about 2 minutes, then it started failing. There were plenty of 403s in those two minutes as well, but the retries were successful, unlike later. |
23:04
🔗
|
JAA |
Still a massive improvement over the previous numbers though. I had ~250 failures in ~2700 requests previously, now it's 34 failures. |
23:09
🔗
|
JAA |
That's 2700 requests in 9 minutes, by the way, so averaging 5 requests per second (against S3; many more against picosong.com). |
23:09
🔗
|
Fusl |
JAA: dump the headers and check them? |
23:09
🔗
|
|
HashbangI has quit IRC (Read error: Connection reset by peer) |
23:09
🔗
|
Fusl |
also, tried on mips yet? |
23:10
🔗
|
JAA |
Fusl: Nothing telling in the headers, but the error in the body is "The request signature we calculated does not match the signature you provided. Check your key and signing method." |
23:11
🔗
|
JAA |
I assume the signature's tied to the IP address? |
23:11
🔗
|
JAA |
In that case, wouldn't work on mips. |
23:12
🔗
|
Fusl |
it shouldnt be |
23:12
🔗
|
amelia386 |
Signed URLs aren't tied to ip |
23:13
🔗
|
JAA |
Huh, I think I got a 401 or 403 before when I tried to retrieve the same URL from a different IP, but not sure. |
23:13
🔗
|
JAA |
Might've been cdn.picosong.com rather than S3 also. |
23:14
🔗
|
amelia386 |
Without knowing how their server is generating the URLs hard to say why they are getting generated incorrectly. |
23:15
🔗
|
JAA |
Nevermind, yeah, seems to work from another IP. |
23:15
🔗
|
amelia386 |
Do those filenames have characters other than a-zA-Z in them? |
23:16
🔗
|
JAA |
Yes, all kinds of stuff, but I don't see a correlation with the errors there either. |
23:17
🔗
|
|
HashbangI has joined #archiveteam-bs |
23:17
🔗
|
JAA |
As in, plenty of files with weird names get downloaded just fine. |
23:18
🔗
|
amelia386 |
Gotcha. (A few S3 things don't like special chars) |
23:18
🔗
|
JAA |
Something like %E0%B4%93%20%E0%B4%AE%E0%B4%B1%E0%B4%BF%E0%B4%AF%E0%B4%BE%E0%B4%82.........Kyamta%20-...........-%20O%20Mariyame.mp3 succeeded, just as an example. |
23:19
🔗
|
JAA |
Meanwhile the much simpler SlowLifeMeoko%20(1).mp3 failed. |
23:19
🔗
|
Fusl |
JAA: you using wpull or grab-site for that? |
23:19
🔗
|
JAA |
Fusl: qwarc |
23:20
🔗
|
Fusl |
i dunno if that has a similar problem but when trying to throw the sketches list into grab-site, it failed with 403s on the s3 urls |
23:20
🔗
|
Fusl |
but curl -L worked fine |
23:21
🔗
|
JAA |
And it wasn't the UA I guess? |
23:23
🔗
|
JAA |
Also, did they fail consistently or randomly? |
23:24
🔗
|
Fusl |
i think it was consistently |
23:25
🔗
|
Fusl |
actually, no it wasnt. it just started happening after spinning up more containers with grab-site |
23:25
🔗
|
JAA |
Huh |
23:26
🔗
|
JAA |
That does sound a bit like my issue. |
23:37
🔗
|
JAA |
I'm finding all kinds of reports on this issue, some going back to 2013. :-| |
23:38
🔗
|
JAA |
One issue that's being mentioned repeatedly is signatures containing spaces which are then encoded incorrectly in the URL, but that doesn't appear to be the problem here since I don't see any such pattern in the failing vs. succeeding Signature values. |
23:41
🔗
|
JAA |
(I actually don't think it's spaces in the signature; rather, it's a + in the signature which is then inserted in the URL directly and therefore arrives as a space at AWS, failing the signature check. But multiple people claim "spaces in signatures", so I'll just leave it at that.) |
23:41
🔗
|
godane |
so i'm going to be grabbing the Sight & Sound Magazine archive |
23:42
🔗
|
JAA |
The + seem to be encoded correctly as %2B by picosong though, and I see requests with it both succeeding and failing, so that's not the problem. |
23:55
🔗
|
|
BlueMax has joined #archiveteam-bs |