[01:07] this sounds quite a bit like the but I found on [01:07] the bug I reported for AB running on Sketch/S3 which was maybe more so in wpull [01:10] trying to remember, must have been something like the Host: header not being reset on the 2nd request after the 302, leading to the signature not matching which is 403 [01:16] it should apply to Fusl's experience with Sketch and wpull [01:16] https://github.com/ArchiveTeam/wpull/issues/425 [01:19] yeah, very likely that [01:57] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [03:23] *** qw3rty2 has joined #archiveteam-bs [03:31] *** icedice has quit IRC (Quit: Leaving) [03:32] *** qw3rty has quit IRC (Ping timeout: 745 seconds) [03:46] *** odemgi has joined #archiveteam-bs [03:51] *** odemgi_ has quit IRC (Read error: Operation timed out) [04:01] *** SynMonger has quit IRC (Quit: Wait, what?) [04:01] *** SynMonger has joined #archiveteam-bs [04:25] JAA: any news on reverting that patch from falconk which broke CONNECT support in archivebot? [04:25] i know you (or was it someone else?) were working on that [04:26] since falconk isn't even part of the archivebot project anymore that i can see, i think its time to bite the bullet personally [06:34] *** Stiletto has quit IRC () [06:54] *** m007a83 has joined #archiveteam-bs [07:00] *** Ryz has quit IRC (Remote host closed the connection) [07:00] *** kiska18 has quit IRC (Remote host closed the connection) [07:00] *** Ryz has joined #archiveteam-bs [07:00] *** Fusl sets mode: +o Ryz [07:00] *** Fusl_ sets mode: +o Ryz [07:00] *** Fusl____ sets mode: +o Ryz [07:00] *** kiska18 has joined #archiveteam-bs [07:00] *** Fusl____ sets mode: +o kiska18 [07:00] *** Fusl sets mode: +o kiska18 [07:00] *** Fusl_ sets mode: +o kiska18 [07:48] *** deevious has quit IRC (Quit: deevious) [07:49] *** Stiletto has joined #archiveteam-bs [08:12] *** BlueMax has quit IRC (Read error: Connection reset by peer) [08:12] *** deevious has joined #archiveteam-bs [08:37] markedL: Yeah, but I don't have that issue in qwarc. [08:39] Lord_Nigh: Nope, no news. Too much other stuff on my list to do AB development. The CONNECT thing needs very thorough testing; FalconK actually removed it for a good reason. [08:39] s/AB/wpull/ actually [09:09] What do you want CONNECT or Lord_Nigh ? [09:14] Igloo: youtube-dl [09:25] if it turns out to be an exploitable security hole by using a carefully crafted video name or something on youtube to pwn the archivebot warrior, then maybe it isn't a good idea [09:25] (or maybe the warrior needs to do some better string sanitization in that case) [09:25] I don't think it's a security issue. But it might break stuff inside asyncio. [09:26] (Also, the warrior has nothing to do with AB.) [09:26] The CONNECT code abused the private API of asyncio. [09:27] Let's take further details to -dev though. [10:53] *** schbirid has joined #archiveteam-bs [11:26] *** HashbangI has quit IRC (Remote host closed the connection) [11:37] *** HashbangI has joined #archiveteam-bs [12:03] I just looked at those picosong failures again and noticed something: the 403s always come from picosong.s3.amazonaws.com after cdn.picosong.com redirects to there. The retries are only successful if they get the file directly from cdn.picosong.com rather than being redirected. So clearly that "CDN" proxy is the problem here. *However*, I do get legitimate redirects sometimes as well, so it's not as [12:04] simple as treating redirects from cdn.picosong.com as errors. [12:05] Also, there's no point in refetching the /cdn/HEX.mp3 URL; it always redirects to the same URL on cdn.picosong.com once it's been generated. [12:06] (But I'm too lazy to implement that, so I'll keep refetching it.) [12:08] any way to predict a good working path if it tries to send you to a bad cdn redirect? [12:09] or are the redirects always static [12:09] The redirects are generated but then static. [12:09] So when you fetch the song page, it generates the redirect for you it seems. [12:09] see, because i noticed this happen on open movie database omdbapi.com for their posters links [12:10] But then that redirect always goes to the same URL on cdn.picosong.com, which will sometimes redirect to S3 and sometimes return the file directly. When it redirects to S3, it sometimes reproducibly fails. [12:10] But the same signature etc. then succeeds once it doesn't redirect. [12:10] it would randomly change between cdn hosts to distribute load, but some reason it cycled a bad host, so we took it out [12:10] only the host part of the path was different [12:11] Interesting. [12:11] i don't know the story about how it was set up, but found the bug and got it solved [12:14] cdn.picosong.com seems to be hosted at online.net, but they offer load balancers as well, so could be that. [12:18] what's the ID discovery mechanism on picosong? if you export a list of ID's I can run on a different crawler to compare error rates [12:20] I bruteforce all possible IDs. [12:21] For the test, it's a random sample of 29k IDs. [12:22] More precisely, an "item" for me is an ID prefix, e.g. wK2i?. The question mark is then replaced by 0-9A-Za-z. And I run 481 random items covering 62 IDs each. [12:23] Of the 480563 possible items in total. [12:23] I.e. about 1 per mille. [12:24] (Plus a handful of special case IDs that Dash found on scraping the Disqus forums for picosong.) [12:29] markedL: Here are the 481 items I ran in my last test (they change on each run): https://transfer.notkiska.pw/aIsyO/picosong-test-items [12:30] Ok, I'll run that list [13:37] markedL: So, what does it look like on your end? [13:42] *** yano has quit IRC (Quit: WeeChat, The Better IRC Client, https://weechat.org/) [13:46] *** yano has joined #archiveteam-bs [14:24] I'm getting redirects to https://picosong.com/ so far, does it need a referer or cookie? [14:43] markedL: That's inexistent songs. [14:57] *** slyphic has quit IRC (Read error: Connection reset by peer) [14:58] *** slyphic has joined #archiveteam-bs [15:16] *** godane has quit IRC (Ping timeout: 745 seconds) [15:23] *** godane has joined #archiveteam-bs [17:16] ok, I have something working, do you want it slow or fast, and if fast how many connections? [17:48] markedL: *FAST* :-) [17:49] I did 100 connections and 73k requests in 9 minutes. [17:57] Ok, I went to 50ish and the first one starts to slow down [18:00] ok, that is indeed odd that sometimes it's 200 and sometimes 302 from cdn.picosong.com [18:00] That part is fine. I get redirects with no load in the browser as well. The odd part is just when it fails on the S3 URL then, and that one only works on cdn.picosong.com. [18:07] ok, it finished a run at 53 connections, but I didn't test the AWS possibility so it'll be surprising if that was all captured fine as well [18:09] What didn't you test exactly? [18:10] it thinks it saved 1328 audio files [18:10] is that high or low? I didn't have an example of the AWS redirect so I never checked if that was followed as well, but the logs will tell me now [18:10] Yep, that's the correct number. [18:21] 2656 HTTP/1.1 200 [18:21] 29900 HTTP/1.1 302 [18:22] How many requests per second did you send? [18:25] I don't think I have that metric, all I know is I ran and fed 53 wget processes simultaneously, but there were pauses for bug fixes. I can run it again more densely. [18:27] Running a test at 50 connections now from here. [18:27] By the way, did you run it from a single IP? [18:27] 1 consumer IP [18:27] Hmm ok no, I'm getting failures at 50 conns as well. [18:28] I can run you qwarc test file? [18:29] 11.2k requests in 80 seconds seem to be too much. lol [18:30] The problem's just: if I don't run it at this request rate, it probably won't finish in time. [18:30] So if it is indeed a slightly overloaded server at cdn.picosong.com causing the problems, well... [18:30] Sucks to be that server. [18:33] I get the errors even at a concurrency of one. [18:34] So uh... [18:34] markedL: You running anything at the moment? [18:35] I was about to start it again at 100 processes [18:35] This is weird. Why would it fail now? [18:35] Unless other people are also hammering the server. [18:37] markedL: If you want to try with qwarc, this is my current code: https://transfer.notkiska.pw/oZaiY/picosong-dev.py [18:39] Ok, I started qwarc with that [18:40] You'll want to set that mmap env var and a --memorylimit unless you like your OOM killer. [18:41] Well, I guess a test run shouldn't use more than 2-3 GiB of RAM, so probably it's fine. [18:41] With the env var, that is. [18:41] Without, it could be much much more. [18:41] "That mmap env var" = MALLOC_MMAP_THRESHOLD_=4096 [18:41] (Details in -dev a while ago) [18:44] are there any args that are needed to specify parallelization level? [18:45] --concurrency [18:46] Oh, and --warcdedupe to cut the WARC size in half in this case. [18:47] So MALLOC_MMAP_THRESHOLD_=4096 qwarc --concurrency 100 --warcdedupe --memorylimit 1073741824 picosong-dev.py or similar [18:55] And to get the numbers of successful S3 retrievals: grep -Fh 'MP3 URL' qwarc.*.log | sed "s,^.*',," | sort | uniq -c [18:56] Er, qwarc.log in your case I guess. [18:59] finished... 16 failed five times [18:59] 1288 successful [18:59] Seems low, I guess it finished due to the memory limit. [19:00] Anyway, you're getting failures as well. [19:01] Maybe your request rate with wget just wasn't high enough? [19:01] I'll run wget at 100 processes [19:01] Try to write a log file and get a request rate from that if possible. [19:02] *** Stiletto has quit IRC (Ping timeout: 252 seconds) [19:09] *** Stilett0 has joined #archiveteam-bs [19:17] *** icedice has joined #archiveteam-bs [19:28] still not at the same rate, 9mins for qwarc and 15mins for wget-lua, but let me look at qwarcs [19:28] look at the .warc.gz [19:38] so first difference is I started with /download/$ID/ and qwarc starts with /$ID [19:40] I don't expect that's it, but it could explain a race condition, so moving on [19:47] I'll put the URLS in this https://gist.github.com/marked/e1797350f68e0f45762cf5c8c148fdbf [19:50] Yeah, that's almost certainly not it. I'm seeing the MP3 URLs from both the stream widget and the download fail. [19:58] I have a theory, but it might require another run of both [20:05] *** Stilett0 is now known as Stiletto [20:11] so here's the unexpected thing after 2 URL traces: In the qwarc warc that has a 403, the wget warc got it from the cdn instead of s3. [20:11] *** systwi_ is now known as systwi [20:12] it might be because I ran qwarc before wget, so the cdn is populated, but it also could be because the cdn send any misses to s3 even if they're invalid [20:18] But they're not invalid. [20:18] Most of the time, my retries actually work. [20:18] And those hit the exact same URL. [20:19] Except that eventually the CDN does respond with the file instead of a redirect to S3 which then returns the 403 for some reason. [20:31] *** icedice has quit IRC (Quit: Leaving) [20:48] that would imply you put load on the CDN enough that it tells you to spill over to S3, and qwarc vs S3 have different success rates getting S3 filled into cloudflare [20:49] but I'll track it down from the qwarc's [20:49] ^warcs [20:50] so i found this site for my pocketmags project: http://pocketmagsv2appservice.magazinecloner.com/ [20:51] looks like the hyperlinks from this api are public : http://pocketmagsv2appservice.magazinecloner.com/GetIssueHyperlinks/GetIssueHyperlinks?issueId=74532 [20:51] i want some one here to please help find documentation of this api [20:52] i want get the image path ids to grab the bin and jpg files [20:53] image paths will look like this : https://mcdatastore.blob.core.windows.net/mcmags/ca74a247-fd3a-4891-885c-9527728ecf0f/44fdcee7-8727-4aca-96d9-55eababff38a/high/0000.bin [21:13] *** ats has quit IRC (Read error: Operation timed out) [21:14] markedL: Yeah, I don't understand why you're not getting this issue with wget. Are you requesting the same things as my qwarc spec file? [21:50] I found a scenario that's run on both now, looks like qwarc did an extra level of unescaping [21:53] qwarc sent this, is this legal [21:53] GET /YW5d/New%20year's%20day.MP3?Signature=zzwvr3hZ2l/amFOXl8p3SHxkIzA%3D&Expires=1570053387&AWSAccessKeyId=AKIAIVYGJY7 [21:53] GGRJY2Y3A HTTP/1.1 [21:55] and, it wasn't even given that. it was given Location: http://cdn.picosong.com/YW5d/New%20year%27s%20day.MP3?Signature=zzwvr3hZ2l%2FamFOXl8p3SHxkIzA%3D&Expires=1570 [21:55] 053387&AWSAccessKeyId=AKIAIVYGJY7GGRJY2Y3A [21:56] *** schbirid has quit IRC (Remote host closed the connection) [21:58] Yes, that's valid, and the format qwarc uses is the "normal form" per RFC 7230. [21:58] https://tools.ietf.org/html/rfc7230#section-2.7.3 [22:06] *** ats has joined #archiveteam-bs [22:09] In any case, I'm pretty sure that it's not related to filenames. I couldn't see any patterns in the filenames failing and succeeding. [22:44] *** BlueMax has joined #archiveteam-bs [22:45] I'm willing to bet money it's this> cat qwarc.log | grep ": 403 " | cut -d" " -f6 | cut -d'?' -f1 | egrep --color=always "[?()'@&'\!,]" [22:46] Sure, I'll take your money. :-) [22:50] I have examples of successful retrievals with filenames containing these: ()'@&!, [22:51] Can't find one with ?, succeeding or failing. [22:51] against s3 or against cdn ? [22:52] CDN [22:52] CDN is unsigned, it'll take either forms. have to only check S3 hosts [22:54] is said that unprecisely. CDN is signed but will take either form. can't prove it on S3 because we can't generate new URLs for it [22:58] Hmm, I see what you mean. [23:01] This seems odd though. The S3 docs explicitly mention !,'() and a few others as safe special characters. [23:02] i'm starting my upload of Scientific American Frontiers [23:02] I wonder if it has something to do with the signature generation involving those characters. And the CDN does some normalisation when attempting to retrieve the file itself. [23:14] Yeah, seems like the picosong servers use the percent-encoded filename to calculate the signature. And presumably when the CDN server receives a request, it normalises the target URL to that form first as well, which is why the request then succeeds there in case it doesn't redirect to S3. [23:16] At least one character is missing in your list, by the way: ; [23:26] So that's happening in yarl's _Quoter class. [23:45] I think this is a bug in yarl. I'll file an issue later or tomorrow. Per RFC 3986, this percent encoding shouldn't be stripped, even though the two URIs (percent-encoded and not) are considered equivalent; only encoded unreserved characters should be decoded. Which seems weird, but that's what the spec says as I understand it. [23:45] In the context of HTTP, the decoded version is the "normal form" as mentioned above. [23:45] But if the signature is generated using the percent-encoded path and then the request comes with the plain characters, the signature comparison fails. [23:46] So that might additionally be a bug in S3. I'm not entirely sure about that though. [23:46] In any case, it means that any aiohttp-based tool will break on picosong and similar sites that redirect to S3. Nice... [23:48] Now, how to resolve this? Well, there's no safe method. It's impossible to disable that decoding in yarl. You can disable all encoding handling, but then you rely on the URL being encoded correctly already. That should be fine in this case, but it won't work in the general case. [23:49] This is getting quite deep into -dev territory now though. [23:52] Here's the full list of characters that could be causing this, by the way: a-zA-Z0-9-._~!$'()*,+&=;@: [23:53] If any of these appear in the filename in percent-encoded form, they get decoded and then the signature mismatches. [23:54] None of these should be percent-encoded according to RFC 7230 ยง2.7.3, but ! and following typically are.