#archiveteam 2012-01-26,Thu

↑back Search

Time Nickname Message
01:16 🔗 Evil_Rob This is relevant to our interests:
01:16 🔗 Evil_Rob http://www.goldenhillsoftware.com/
05:08 🔗 bsmith093 anyone still working on the ffnet grab?, i realize its a somewhat small, side project asteroid, compared to the mobile me blue supergiant, but its still important
07:56 🔗 Nemo_bis bsmith093, I've not understood what's needed
07:57 🔗 Nemo_bis hmm, I've a wget-warc consuming 95 % CPU and 2 GB RAM
07:59 🔗 Nemo_bis and 2300 MB now :/
07:59 🔗 Nemo_bis wget --mirror
08:20 🔗 Nemo_bis https://github.com/alard/wget-warc/issues/13
09:22 🔗 alard Nemo_bis: The memory usage is most likely the result of the design of wget --mirror, not a problem of the WARC extensions.
09:22 🔗 Nemo_bis alard, yes, I know
09:22 🔗 alard wget --mirror keeps lots of information about the links it finds, all in memory.
09:22 🔗 Nemo_bis but since that was the third result in Google I thought adding more info didn't harm
09:23 🔗 alard Doesn't make it a less annoying problem, true.
09:23 🔗 Nemo_bis and perhaps someone could try that user again
09:28 🔗 Nemo_bis alard, is this it? https://savannah.gnu.org/bugs/index.php?33183
09:30 🔗 alard Not sure, this could refer to something else. The bug report seems to be about large files, whereas the problem you're having probably has to do with the number of links.
09:32 🔗 Nemo_bis but perhaps the reporter got it wrong
09:32 🔗 Nemo_bis it's the only memory bug which mentions --mirror
09:32 🔗 Nemo_bis but yes, this user was only 82 MB so far
09:32 🔗 alard On which part do you get the error, by the way? web/public/?
09:32 🔗 Nemo_bis ehh, how do I know?
09:33 🔗 alard Oh, sorry, this is splinder.
09:33 🔗 alard No, it's not.
09:33 🔗 alard The download script should tell you what it's downloading.
09:33 🔗 Nemo_bis Yes, it's MobileMe.
09:33 🔗 ersi I wouldn't say the memory usage is a defect/bug
09:34 🔗 Nemo_bis But this is before getting there
09:34 🔗 ersi Before getting to large memory usage?
09:34 🔗 Nemo_bis sorry, I was still replying to alard
09:34 🔗 Nemo_bis it's in the first part, Downloading web.me.com/nicolehan
09:35 🔗 Nemo_bis http://p.defau.lt/?I1v3MH2_1xVumrt9A1n19g
09:36 🔗 alard Is there any strange pattern in the wget log for that user?
09:37 🔗 Nemo_bis Uh, yes there is.
09:38 🔗 Nemo_bis http://p.defau.lt/?lpD9fC2RNPxwRkOMHpFfOQ
09:38 🔗 Nemo_bis Lots of 404 errors, perhaps escaping problems?
09:40 🔗 Nemo_bis I don't know whether the memory spiked before or after that, though.
09:40 🔗 Nemo_bis Growing at 3 GB/h, might have originated before.
09:41 🔗 Nemo_bis Seems a user with lots of tags anyway.
09:41 🔗 Nemo_bis ?? 2012-01-26 08:20:59 ERRORE 402: Payment Required.
09:45 🔗 alard Some 404 errors are to be expected, since the script is making urls up based on the feeds.
09:45 🔗 alard 402 Payment Required is mobileme's equivalent of a 404 error, sometimes.
09:47 🔗 Nemo_bis But why is it always requesting the same URL?
09:50 🔗 alard Is that url also in the urls.txt file?
10:05 🔗 Nemo_bis alard, no
10:13 🔗 alard Nemo_bis: Hmm. Then there may be something strange in the pages that confuses the wget link extractor.
10:13 🔗 alard Anyway: if you can't find the problem, just skip this user, it will be retried by someone else later.
10:17 🔗 Nemo_bis alard, sure, I just love debugging :p
18:37 🔗 tef yipdw: btw did that warcbug go upstream to wget ?
18:37 🔗 tef btw hanzo-warc-tools is now on pypi
18:37 🔗 tef http://pypi.python.org/pypi/hanzo-warc-tools/0.2
18:45 🔗 yipdw tef: oh, I haven't filed it
18:45 🔗 yipdw I'll get on that
18:50 🔗 tef cool
18:50 🔗 tef or I can
18:50 🔗 yipdw yeah, whichever is fine
18:51 🔗 tef actually since I don't have it installed or running
18:51 🔗 tef in retrospect maybe you might be better at it :3
18:51 🔗 yipdw I'm not too worried about it, since we do have a fix, and the WARC code isn't even in an official release yet
18:51 🔗 yipdw but getting something in their tracker is good nonetheless
18:51 🔗 tef fix -> postprocessing ?
18:51 🔗 yipdw yeah
18:51 🔗 tef ah
18:51 🔗 yipdw (the bug you're referring to is the erroneous Transfer-Encoding header, right)
18:53 🔗 yipdw I should write "fix"
18:53 🔗 yipdw :P
18:54 🔗 tef yes
18:54 🔗 tef well the header/body mismatch
18:55 🔗 tef the headers refer to the raw body, not to the decoded body. if wget did gzip decoding too, then that might be an issue
18:55 🔗 yipdw yeah
18:55 🔗 yipdw I don't think wget does
18:55 🔗 yipdw gzip decompression that is
19:03 🔗 Nemo_bis there was also a bug in their tracker where wget instead of downloading a video misunderstands it as an HTML page, put it in memory and tries to parse it to get links
20:13 🔗 Coderjoe i'm pretty sure it does
20:14 🔗 Coderjoe Nemo_bis: that is probably because the server sent a content type of text/html for the video.
20:42 🔗 yipdw Coderjoe: how do you activate gzip decompression in wget?
20:42 🔗 yipdw I haven't found such an option
20:42 🔗 yipdw and if wget receives a response with Content-Encoding: gzip that it wasn't expecting, it definitely is not decompressed
20:43 🔗 Coderjoe hmm
20:44 🔗 Coderjoe upon further investigation, I was wrong. I thought it was saying it would accept it when it made a request
20:45 🔗 Coderjoe the server definitely shouldn't be returning gzipped data if the client didn't say it could accept it
20:45 🔗 Coderjoe s
20:45 🔗 Coderjoe yay broken config
20:47 🔗 alard yipdw: So, just as a confirmation, the problem is that wget doesn't write the chunk size + \r\n to the WARC file?
20:47 🔗 alard That should be relatively easy to fix. Or is there something else?
20:50 🔗 Coderjoe the problem is that wget should be writing the raw response data and isn't
20:51 🔗 Coderjoe it is instead handling the encodings first
20:51 🔗 Coderjoe so for chunked data it is removing the chunking
20:51 🔗 alard But isn't that the only transfer-encoding it understands?
20:52 🔗 Coderjoe yeah
20:53 🔗 Coderjoe and apparently wget doesn't do gzip at all
20:53 🔗 alard In the fd_read_body function, it starts doing strange things to handle the chunking: https://github.com/alard/wget-warc/blob/master/trunk/src/retr.c#L297
20:54 🔗 alard So I think that it's enough to write the result of that fd_read_line (and the one a little bit further on) to the warc file.
20:58 🔗 alard Also, isn't there a memory leak in that fd_read_body function? I think fd_read_line allocates space for the line, but it's never freed.
21:01 🔗 Coderjoe there is a free(dlbuf) at the bottom of the function
21:02 🔗 Coderjoe or is there another allocation I missed?
21:02 🔗 alard What about char *line = fd_read_line (fd); on line 301?
21:03 🔗 alard As I understand it, fd_read_line calls fd_read_hunk, which does its own xmalloc.
21:06 🔗 Coderjoe https://gist.github.com/1685109
21:08 🔗 Coderjoe that's using my debian-packaged version on a page I know is returning chunked data
21:09 🔗 Coderjoe I'll need to build a version with symbols to know where the leaks are
21:09 🔗 alard I've just added a few free (line); to my version. It doesn't crash.
21:10 🔗 Coderjoe can you remove those temporarily, do a valgrind before and then another one with the frees?
21:11 🔗 alard https://gist.github.com/3b754b4cc74660041201
21:12 🔗 alard Seems to work.
21:12 🔗 Coderjoe indeed
21:12 🔗 alard How does one 'compile with symbols'?
21:12 🔗 Coderjoe -g i think
21:12 🔗 chronomex yeah
21:12 🔗 chronomex I do -ggdb3 for maximal debuggery
21:18 🔗 alard The memory leak is indeed in fd_read_line / fd_read_hunk.
21:28 🔗 SketchCow WHY HELLO GANG
21:32 🔗 Coderjoe well that might explain some of the OOM problems I was having doing recursive scrapes
21:37 🔗 alard Does this look ok? https://gist.github.com/645a6c4b07de5f1e7d67
21:38 🔗 alard (The chunked record starts on line 37.)
21:38 🔗 Coderjoe looks good to me
21:39 🔗 Coderjoe tef, yipdw: how about you?
21:44 🔗 Coderjoe does your symbols build give you insight into where that last leak was?
21:46 🔗 alard https://github.com/alard/wget-warc/blob/master/trunk/src/http.c#L2924
21:46 🔗 alard The string allocated by xstrdup is never freed.
21:47 🔗 alard But it looks a bit scary. I don't know where this local_file comes from, so I don't really want to touch it.
21:47 🔗 Coderjoe yeah. I agree. probably best to mention it in a bug report and let someone that knows more about that code look at it
21:48 🔗 Coderjoe I feel they should add more tests to the test suite which include memory leak checks with valgrind or the like
22:15 🔗 yipdw alard: re: preserving chunks -- that looks better
22:16 🔗 yipdw also, I didn't know about HTTPWatch
22:16 🔗 yipdw that's a pretty awesome site
22:17 🔗 yipdw or at least the gallery is
22:18 🔗 alard That's good to hear, I'm currently preparing a patch + email. (I didn't know HTTPWatch either, it was just the first Google result for chunked encodings.)
22:18 🔗 bsmith093 any updates on ffnet grab? or are we all ahead full on mobile me for now?
22:18 🔗 yipdw alard: actually, one sec
22:19 🔗 yipdw I didn't check that the prefixed byte lengths actually match up
22:19 🔗 yipdw (checking that in a web browser is impossible)
22:19 🔗 yipdw unless there's some hex editor extension for Firefox I don't know about
22:20 🔗 alard Well, all I do in the patch is copying the lines from the HTTP response to the WARC file, so I'd assume that they're correct.
22:20 🔗 yipdw sure
22:20 🔗 yipdw doesn't hurt to double-check :P
22:21 🔗 alard Do you have the wget source somewhere? I can send you the patch so you can try it on your own examples.
22:21 🔗 yipdw I do, yes
22:22 🔗 yipdw I think the chunks being written by wget may not be RFC 2616-compliant; the chunk length and chunk data don't appear to be separated by CRLF
22:22 🔗 yipdw rather it's just LF
22:23 🔗 alard https://gist.github.com/fcbd1025f8a439f811c0
22:23 🔗 alard The line endings may have been mangled when I copied them.
22:23 🔗 yipdw that said, headers are terminated in the WARC by just LF so maybe that's not that big of an issue
22:23 🔗 yipdw yeah
22:24 🔗 alard (Opened the warc file in gedit, copied it to Chrome.)
22:24 🔗 yipdw ahh
22:24 🔗 yipdw I'll apply this patch and try httpwatch again
22:29 🔗 yipdw hmm
22:29 🔗 yipdw neither homebrew nor OS X ship the necessary autotools versions
22:29 🔗 yipdw VM time!
22:50 🔗 alard yipdw: Found any problems, or can I safely push 'Send'?
23:04 🔗 alard Anyway, no need to hurry. If you or anyone else finds something, please let me know. If everything is OK I'll send the patches to the wget mailing list tomorrow morning.
23:14 🔗 yipdw alard: sure thing, will do

irclogger-viewer