#archiveteam-bs 2019-10-02,Wed

โ†‘back Search

Time Nickname Message
01:07 ๐Ÿ”— markedL this sounds quite a bit like the but I found on
01:07 ๐Ÿ”— markedL the bug I reported for AB running on Sketch/S3 which was maybe more so in wpull
01:10 ๐Ÿ”— markedL trying to remember, must have been something like the Host: header not being reset on the 2nd request after the 302, leading to the signature not matching which is 403
01:16 ๐Ÿ”— markedL it should apply to Fusl's experience with Sketch and wpull
01:16 ๐Ÿ”— markedL https://github.com/ArchiveTeam/wpull/issues/425
01:19 ๐Ÿ”— Fusl yeah, very likely that
01:57 ๐Ÿ”— DogsRNice has quit IRC (Read error: Connection reset by peer)
03:23 ๐Ÿ”— qw3rty2 has joined #archiveteam-bs
03:31 ๐Ÿ”— icedice has quit IRC (Quit: Leaving)
03:32 ๐Ÿ”— qw3rty has quit IRC (Ping timeout: 745 seconds)
03:46 ๐Ÿ”— odemgi has joined #archiveteam-bs
03:51 ๐Ÿ”— odemgi_ has quit IRC (Read error: Operation timed out)
04:01 ๐Ÿ”— SynMonger has quit IRC (Quit: Wait, what?)
04:01 ๐Ÿ”— SynMonger has joined #archiveteam-bs
04:25 ๐Ÿ”— Lord_Nigh JAA: any news on reverting that patch from falconk which broke CONNECT support in archivebot?
04:25 ๐Ÿ”— Lord_Nigh i know you (or was it someone else?) were working on that
04:26 ๐Ÿ”— Lord_Nigh since falconk isn't even part of the archivebot project anymore that i can see, i think its time to bite the bullet personally
06:34 ๐Ÿ”— Stiletto has quit IRC ()
06:54 ๐Ÿ”— m007a83 has joined #archiveteam-bs
07:00 ๐Ÿ”— Ryz has quit IRC (Remote host closed the connection)
07:00 ๐Ÿ”— kiska18 has quit IRC (Remote host closed the connection)
07:00 ๐Ÿ”— Ryz has joined #archiveteam-bs
07:00 ๐Ÿ”— Fusl sets mode: +o Ryz
07:00 ๐Ÿ”— Fusl_ sets mode: +o Ryz
07:00 ๐Ÿ”— Fusl____ sets mode: +o Ryz
07:00 ๐Ÿ”— kiska18 has joined #archiveteam-bs
07:00 ๐Ÿ”— Fusl____ sets mode: +o kiska18
07:00 ๐Ÿ”— Fusl sets mode: +o kiska18
07:00 ๐Ÿ”— Fusl_ sets mode: +o kiska18
07:48 ๐Ÿ”— deevious has quit IRC (Quit: deevious)
07:49 ๐Ÿ”— Stiletto has joined #archiveteam-bs
08:12 ๐Ÿ”— BlueMax has quit IRC (Read error: Connection reset by peer)
08:12 ๐Ÿ”— deevious has joined #archiveteam-bs
08:37 ๐Ÿ”— JAA markedL: Yeah, but I don't have that issue in qwarc.
08:39 ๐Ÿ”— JAA Lord_Nigh: Nope, no news. Too much other stuff on my list to do AB development. The CONNECT thing needs very thorough testing; FalconK actually removed it for a good reason.
08:39 ๐Ÿ”— JAA s/AB/wpull/ actually
09:09 ๐Ÿ”— Igloo What do you want CONNECT or Lord_Nigh ?
09:14 ๐Ÿ”— JAA Igloo: youtube-dl
09:25 ๐Ÿ”— Lord_Nigh if it turns out to be an exploitable security hole by using a carefully crafted video name or something on youtube to pwn the archivebot warrior, then maybe it isn't a good idea
09:25 ๐Ÿ”— Lord_Nigh (or maybe the warrior needs to do some better string sanitization in that case)
09:25 ๐Ÿ”— JAA I don't think it's a security issue. But it might break stuff inside asyncio.
09:26 ๐Ÿ”— JAA (Also, the warrior has nothing to do with AB.)
09:26 ๐Ÿ”— JAA The CONNECT code abused the private API of asyncio.
09:27 ๐Ÿ”— JAA Let's take further details to -dev though.
10:53 ๐Ÿ”— schbirid has joined #archiveteam-bs
11:26 ๐Ÿ”— HashbangI has quit IRC (Remote host closed the connection)
11:37 ๐Ÿ”— HashbangI has joined #archiveteam-bs
12:03 ๐Ÿ”— JAA I just looked at those picosong failures again and noticed something: the 403s always come from picosong.s3.amazonaws.com after cdn.picosong.com redirects to there. The retries are only successful if they get the file directly from cdn.picosong.com rather than being redirected. So clearly that "CDN" proxy is the problem here. *However*, I do get legitimate redirects sometimes as well, so it's not as
12:04 ๐Ÿ”— JAA simple as treating redirects from cdn.picosong.com as errors.
12:05 ๐Ÿ”— JAA Also, there's no point in refetching the /cdn/HEX.mp3 URL; it always redirects to the same URL on cdn.picosong.com once it's been generated.
12:06 ๐Ÿ”— JAA (But I'm too lazy to implement that, so I'll keep refetching it.)
12:08 ๐Ÿ”— Raccoon any way to predict a good working path if it tries to send you to a bad cdn redirect?
12:09 ๐Ÿ”— Raccoon or are the redirects always static
12:09 ๐Ÿ”— JAA The redirects are generated but then static.
12:09 ๐Ÿ”— JAA So when you fetch the song page, it generates the redirect for you it seems.
12:09 ๐Ÿ”— Raccoon see, because i noticed this happen on open movie database omdbapi.com for their posters links
12:10 ๐Ÿ”— JAA But then that redirect always goes to the same URL on cdn.picosong.com, which will sometimes redirect to S3 and sometimes return the file directly. When it redirects to S3, it sometimes reproducibly fails.
12:10 ๐Ÿ”— JAA But the same signature etc. then succeeds once it doesn't redirect.
12:10 ๐Ÿ”— Raccoon it would randomly change between cdn hosts to distribute load, but some reason it cycled a bad host, so we took it out
12:10 ๐Ÿ”— Raccoon only the host part of the path was different
12:11 ๐Ÿ”— JAA Interesting.
12:11 ๐Ÿ”— Raccoon i don't know the story about how it was set up, but found the bug and got it solved
12:14 ๐Ÿ”— JAA cdn.picosong.com seems to be hosted at online.net, but they offer load balancers as well, so could be that.
12:18 ๐Ÿ”— markedL what's the ID discovery mechanism on picosong? if you export a list of ID's I can run on a different crawler to compare error rates
12:20 ๐Ÿ”— JAA I bruteforce all possible IDs.
12:21 ๐Ÿ”— JAA For the test, it's a random sample of 29k IDs.
12:22 ๐Ÿ”— JAA More precisely, an "item" for me is an ID prefix, e.g. wK2i?. The question mark is then replaced by 0-9A-Za-z. And I run 481 random items covering 62 IDs each.
12:23 ๐Ÿ”— JAA Of the 480563 possible items in total.
12:23 ๐Ÿ”— JAA I.e. about 1 per mille.
12:24 ๐Ÿ”— JAA (Plus a handful of special case IDs that Dash found on scraping the Disqus forums for picosong.)
12:29 ๐Ÿ”— JAA markedL: Here are the 481 items I ran in my last test (they change on each run): https://transfer.notkiska.pw/aIsyO/picosong-test-items
12:30 ๐Ÿ”— markedL Ok, I'll run that list
13:37 ๐Ÿ”— JAA markedL: So, what does it look like on your end?
13:42 ๐Ÿ”— yano has quit IRC (Quit: WeeChat, The Better IRC Client, https://weechat.org/)
13:46 ๐Ÿ”— yano has joined #archiveteam-bs
14:24 ๐Ÿ”— markedL I'm getting redirects to https://picosong.com/ so far, does it need a referer or cookie?
14:43 ๐Ÿ”— JAA markedL: That's inexistent songs.
14:57 ๐Ÿ”— slyphic has quit IRC (Read error: Connection reset by peer)
14:58 ๐Ÿ”— slyphic has joined #archiveteam-bs
15:16 ๐Ÿ”— godane has quit IRC (Ping timeout: 745 seconds)
15:23 ๐Ÿ”— godane has joined #archiveteam-bs
17:16 ๐Ÿ”— markedL ok, I have something working, do you want it slow or fast, and if fast how many connections?
17:48 ๐Ÿ”— JAA markedL: *FAST* :-)
17:49 ๐Ÿ”— JAA I did 100 connections and 73k requests in 9 minutes.
17:57 ๐Ÿ”— markedL Ok, I went to 50ish and the first one starts to slow down
18:00 ๐Ÿ”— markedL ok, that is indeed odd that sometimes it's 200 and sometimes 302 from cdn.picosong.com
18:00 ๐Ÿ”— JAA That part is fine. I get redirects with no load in the browser as well. The odd part is just when it fails on the S3 URL then, and that one only works on cdn.picosong.com.
18:07 ๐Ÿ”— markedL ok, it finished a run at 53 connections, but I didn't test the AWS possibility so it'll be surprising if that was all captured fine as well
18:09 ๐Ÿ”— JAA What didn't you test exactly?
18:10 ๐Ÿ”— markedL it thinks it saved 1328 audio files
18:10 ๐Ÿ”— markedL is that high or low? I didn't have an example of the AWS redirect so I never checked if that was followed as well, but the logs will tell me now
18:10 ๐Ÿ”— JAA Yep, that's the correct number.
18:21 ๐Ÿ”— markedL 2656 HTTP/1.1 200
18:21 ๐Ÿ”— markedL 29900 HTTP/1.1 302
18:22 ๐Ÿ”— JAA How many requests per second did you send?
18:25 ๐Ÿ”— markedL I don't think I have that metric, all I know is I ran and fed 53 wget processes simultaneously, but there were pauses for bug fixes. I can run it again more densely.
18:27 ๐Ÿ”— JAA Running a test at 50 connections now from here.
18:27 ๐Ÿ”— JAA By the way, did you run it from a single IP?
18:27 ๐Ÿ”— markedL 1 consumer IP
18:27 ๐Ÿ”— JAA Hmm ok no, I'm getting failures at 50 conns as well.
18:28 ๐Ÿ”— markedL I can run you qwarc test file?
18:29 ๐Ÿ”— JAA 11.2k requests in 80 seconds seem to be too much. lol
18:30 ๐Ÿ”— JAA The problem's just: if I don't run it at this request rate, it probably won't finish in time.
18:30 ๐Ÿ”— JAA So if it is indeed a slightly overloaded server at cdn.picosong.com causing the problems, well...
18:30 ๐Ÿ”— JAA Sucks to be that server.
18:33 ๐Ÿ”— JAA I get the errors even at a concurrency of one.
18:34 ๐Ÿ”— JAA So uh...
18:34 ๐Ÿ”— JAA markedL: You running anything at the moment?
18:35 ๐Ÿ”— markedL I was about to start it again at 100 processes
18:35 ๐Ÿ”— JAA This is weird. Why would it fail now?
18:35 ๐Ÿ”— JAA Unless other people are also hammering the server.
18:37 ๐Ÿ”— JAA markedL: If you want to try with qwarc, this is my current code: https://transfer.notkiska.pw/oZaiY/picosong-dev.py
18:39 ๐Ÿ”— markedL Ok, I started qwarc with that
18:40 ๐Ÿ”— JAA You'll want to set that mmap env var and a --memorylimit unless you like your OOM killer.
18:41 ๐Ÿ”— JAA Well, I guess a test run shouldn't use more than 2-3 GiB of RAM, so probably it's fine.
18:41 ๐Ÿ”— JAA With the env var, that is.
18:41 ๐Ÿ”— JAA Without, it could be much much more.
18:41 ๐Ÿ”— JAA "That mmap env var" = MALLOC_MMAP_THRESHOLD_=4096
18:41 ๐Ÿ”— JAA (Details in -dev a while ago)
18:44 ๐Ÿ”— markedL are there any args that are needed to specify parallelization level?
18:45 ๐Ÿ”— JAA --concurrency
18:46 ๐Ÿ”— JAA Oh, and --warcdedupe to cut the WARC size in half in this case.
18:47 ๐Ÿ”— JAA So MALLOC_MMAP_THRESHOLD_=4096 qwarc --concurrency 100 --warcdedupe --memorylimit 1073741824 picosong-dev.py or similar
18:55 ๐Ÿ”— JAA And to get the numbers of successful S3 retrievals: grep -Fh 'MP3 URL' qwarc.*.log | sed "s,^.*',," | sort | uniq -c
18:56 ๐Ÿ”— JAA Er, qwarc.log in your case I guess.
18:59 ๐Ÿ”— markedL finished... 16 failed five times
18:59 ๐Ÿ”— markedL 1288 successful
18:59 ๐Ÿ”— JAA Seems low, I guess it finished due to the memory limit.
19:00 ๐Ÿ”— JAA Anyway, you're getting failures as well.
19:01 ๐Ÿ”— JAA Maybe your request rate with wget just wasn't high enough?
19:01 ๐Ÿ”— markedL I'll run wget at 100 processes
19:01 ๐Ÿ”— JAA Try to write a log file and get a request rate from that if possible.
19:02 ๐Ÿ”— Stiletto has quit IRC (Ping timeout: 252 seconds)
19:09 ๐Ÿ”— Stilett0 has joined #archiveteam-bs
19:17 ๐Ÿ”— icedice has joined #archiveteam-bs
19:28 ๐Ÿ”— markedL still not at the same rate, 9mins for qwarc and 15mins for wget-lua, but let me look at qwarcs
19:28 ๐Ÿ”— markedL look at the .warc.gz
19:38 ๐Ÿ”— markedL so first difference is I started with /download/$ID/ and qwarc starts with /$ID
19:40 ๐Ÿ”— markedL I don't expect that's it, but it could explain a race condition, so moving on
19:47 ๐Ÿ”— markedL I'll put the URLS in this https://gist.github.com/marked/e1797350f68e0f45762cf5c8c148fdbf
19:50 ๐Ÿ”— JAA Yeah, that's almost certainly not it. I'm seeing the MP3 URLs from both the stream widget and the download fail.
19:58 ๐Ÿ”— markedL I have a theory, but it might require another run of both
20:05 ๐Ÿ”— Stilett0 is now known as Stiletto
20:11 ๐Ÿ”— markedL so here's the unexpected thing after 2 URL traces: In the qwarc warc that has a 403, the wget warc got it from the cdn instead of s3.
20:11 ๐Ÿ”— systwi_ is now known as systwi
20:12 ๐Ÿ”— markedL it might be because I ran qwarc before wget, so the cdn is populated, but it also could be because the cdn send any misses to s3 even if they're invalid
20:18 ๐Ÿ”— JAA But they're not invalid.
20:18 ๐Ÿ”— JAA Most of the time, my retries actually work.
20:18 ๐Ÿ”— JAA And those hit the exact same URL.
20:19 ๐Ÿ”— JAA Except that eventually the CDN does respond with the file instead of a redirect to S3 which then returns the 403 for some reason.
20:31 ๐Ÿ”— icedice has quit IRC (Quit: Leaving)
20:48 ๐Ÿ”— markedL that would imply you put load on the CDN enough that it tells you to spill over to S3, and qwarc vs S3 have different success rates getting S3 filled into cloudflare
20:49 ๐Ÿ”— markedL but I'll track it down from the qwarc's
20:49 ๐Ÿ”— markedL ^warcs
20:50 ๐Ÿ”— godane so i found this site for my pocketmags project: http://pocketmagsv2appservice.magazinecloner.com/
20:51 ๐Ÿ”— godane looks like the hyperlinks from this api are public : http://pocketmagsv2appservice.magazinecloner.com/GetIssueHyperlinks/GetIssueHyperlinks?issueId=74532
20:51 ๐Ÿ”— godane i want some one here to please help find documentation of this api
20:52 ๐Ÿ”— godane i want get the image path ids to grab the bin and jpg files
20:53 ๐Ÿ”— godane image paths will look like this : https://mcdatastore.blob.core.windows.net/mcmags/ca74a247-fd3a-4891-885c-9527728ecf0f/44fdcee7-8727-4aca-96d9-55eababff38a/high/0000.bin
21:13 ๐Ÿ”— ats has quit IRC (Read error: Operation timed out)
21:14 ๐Ÿ”— JAA markedL: Yeah, I don't understand why you're not getting this issue with wget. Are you requesting the same things as my qwarc spec file?
21:50 ๐Ÿ”— markedL I found a scenario that's run on both now, looks like qwarc did an extra level of unescaping
21:53 ๐Ÿ”— markedL qwarc sent this, is this legal
21:53 ๐Ÿ”— markedL GET /YW5d/New%20year's%20day.MP3?Signature=zzwvr3hZ2l/amFOXl8p3SHxkIzA%3D&Expires=1570053387&AWSAccessKeyId=AKIAIVYGJY7
21:53 ๐Ÿ”— markedL GGRJY2Y3A HTTP/1.1
21:55 ๐Ÿ”— markedL and, it wasn't even given that. it was given Location: http://cdn.picosong.com/YW5d/New%20year%27s%20day.MP3?Signature=zzwvr3hZ2l%2FamFOXl8p3SHxkIzA%3D&Expires=1570
21:55 ๐Ÿ”— markedL 053387&AWSAccessKeyId=AKIAIVYGJY7GGRJY2Y3A
21:56 ๐Ÿ”— schbirid has quit IRC (Remote host closed the connection)
21:58 ๐Ÿ”— JAA Yes, that's valid, and the format qwarc uses is the "normal form" per RFC 7230.
21:58 ๐Ÿ”— JAA https://tools.ietf.org/html/rfc7230#section-2.7.3
22:06 ๐Ÿ”— ats has joined #archiveteam-bs
22:09 ๐Ÿ”— JAA In any case, I'm pretty sure that it's not related to filenames. I couldn't see any patterns in the filenames failing and succeeding.
22:44 ๐Ÿ”— BlueMax has joined #archiveteam-bs
22:45 ๐Ÿ”— markedL I'm willing to bet money it's this> cat qwarc.log | grep ": 403 " | cut -d" " -f6 | cut -d'?' -f1 | egrep --color=always "[?()'@&'\!,]"
22:46 ๐Ÿ”— JAA Sure, I'll take your money. :-)
22:50 ๐Ÿ”— JAA I have examples of successful retrievals with filenames containing these: ()'@&!,
22:51 ๐Ÿ”— JAA Can't find one with ?, succeeding or failing.
22:51 ๐Ÿ”— markedL against s3 or against cdn ?
22:52 ๐Ÿ”— JAA CDN
22:52 ๐Ÿ”— markedL CDN is unsigned, it'll take either forms. have to only check S3 hosts
22:54 ๐Ÿ”— markedL is said that unprecisely. CDN is signed but will take either form. can't prove it on S3 because we can't generate new URLs for it
22:58 ๐Ÿ”— JAA Hmm, I see what you mean.
23:01 ๐Ÿ”— JAA This seems odd though. The S3 docs explicitly mention !,'() and a few others as safe special characters.
23:02 ๐Ÿ”— godane i'm starting my upload of Scientific American Frontiers
23:02 ๐Ÿ”— JAA I wonder if it has something to do with the signature generation involving those characters. And the CDN does some normalisation when attempting to retrieve the file itself.
23:14 ๐Ÿ”— JAA Yeah, seems like the picosong servers use the percent-encoded filename to calculate the signature. And presumably when the CDN server receives a request, it normalises the target URL to that form first as well, which is why the request then succeeds there in case it doesn't redirect to S3.
23:16 ๐Ÿ”— JAA At least one character is missing in your list, by the way: ;
23:26 ๐Ÿ”— JAA So that's happening in yarl's _Quoter class.
23:45 ๐Ÿ”— JAA I think this is a bug in yarl. I'll file an issue later or tomorrow. Per RFC 3986, this percent encoding shouldn't be stripped, even though the two URIs (percent-encoded and not) are considered equivalent; only encoded unreserved characters should be decoded. Which seems weird, but that's what the spec says as I understand it.
23:45 ๐Ÿ”— JAA In the context of HTTP, the decoded version is the "normal form" as mentioned above.
23:45 ๐Ÿ”— JAA But if the signature is generated using the percent-encoded path and then the request comes with the plain characters, the signature comparison fails.
23:46 ๐Ÿ”— JAA So that might additionally be a bug in S3. I'm not entirely sure about that though.
23:46 ๐Ÿ”— JAA In any case, it means that any aiohttp-based tool will break on picosong and similar sites that redirect to S3. Nice...
23:48 ๐Ÿ”— JAA Now, how to resolve this? Well, there's no safe method. It's impossible to disable that decoding in yarl. You can disable all encoding handling, but then you rely on the URL being encoded correctly already. That should be fine in this case, but it won't work in the general case.
23:49 ๐Ÿ”— JAA This is getting quite deep into -dev territory now though.
23:52 ๐Ÿ”— JAA Here's the full list of characters that could be causing this, by the way: a-zA-Z0-9-._~!$'()*,+&=;@:
23:53 ๐Ÿ”— JAA If any of these appear in the filename in percent-encoded form, they get decoded and then the signature mismatches.
23:54 ๐Ÿ”— JAA None of these should be percent-encoded according to RFC 7230 ยง2.7.3, but ! and following typically are.

irclogger-viewer