[01:16] This is relevant to our interests: [01:16] http://www.goldenhillsoftware.com/ [05:08] anyone still working on the ffnet grab?, i realize its a somewhat small, side project asteroid, compared to the mobile me blue supergiant, but its still important [07:56] bsmith093, I've not understood what's needed [07:57] hmm, I've a wget-warc consuming 95 % CPU and 2 GB RAM [07:59] and 2300 MB now :/ [07:59] wget --mirror [08:20] https://github.com/alard/wget-warc/issues/13 [09:22] Nemo_bis: The memory usage is most likely the result of the design of wget --mirror, not a problem of the WARC extensions. [09:22] alard, yes, I know [09:22] wget --mirror keeps lots of information about the links it finds, all in memory. [09:22] but since that was the third result in Google I thought adding more info didn't harm [09:23] Doesn't make it a less annoying problem, true. [09:23] and perhaps someone could try that user again [09:28] alard, is this it? https://savannah.gnu.org/bugs/index.php?33183 [09:30] Not sure, this could refer to something else. The bug report seems to be about large files, whereas the problem you're having probably has to do with the number of links. [09:32] but perhaps the reporter got it wrong [09:32] it's the only memory bug which mentions --mirror [09:32] but yes, this user was only 82 MB so far [09:32] On which part do you get the error, by the way? web/public/? [09:32] ehh, how do I know? [09:33] Oh, sorry, this is splinder. [09:33] No, it's not. [09:33] The download script should tell you what it's downloading. [09:33] Yes, it's MobileMe. [09:33] I wouldn't say the memory usage is a defect/bug [09:34] But this is before getting there [09:34] Before getting to large memory usage? [09:34] sorry, I was still replying to alard [09:34] it's in the first part, Downloading web.me.com/nicolehan [09:35] http://p.defau.lt/?I1v3MH2_1xVumrt9A1n19g [09:36] Is there any strange pattern in the wget log for that user? [09:37] Uh, yes there is. [09:38] http://p.defau.lt/?lpD9fC2RNPxwRkOMHpFfOQ [09:38] Lots of 404 errors, perhaps escaping problems? [09:40] I don't know whether the memory spiked before or after that, though. [09:40] Growing at 3 GB/h, might have originated before. [09:41] Seems a user with lots of tags anyway. [09:41] ?? 2012-01-26 08:20:59 ERRORE 402: Payment Required. [09:45] Some 404 errors are to be expected, since the script is making urls up based on the feeds. [09:45] 402 Payment Required is mobileme's equivalent of a 404 error, sometimes. [09:47] But why is it always requesting the same URL? [09:50] Is that url also in the urls.txt file? [10:05] alard, no [10:13] Nemo_bis: Hmm. Then there may be something strange in the pages that confuses the wget link extractor. [10:13] Anyway: if you can't find the problem, just skip this user, it will be retried by someone else later. [10:17] alard, sure, I just love debugging :p [18:37] yipdw: btw did that warcbug go upstream to wget ? [18:37] btw hanzo-warc-tools is now on pypi [18:37] http://pypi.python.org/pypi/hanzo-warc-tools/0.2 [18:45] tef: oh, I haven't filed it [18:45] I'll get on that [18:50] cool [18:50] or I can [18:50] yeah, whichever is fine [18:51] actually since I don't have it installed or running [18:51] in retrospect maybe you might be better at it :3 [18:51] I'm not too worried about it, since we do have a fix, and the WARC code isn't even in an official release yet [18:51] but getting something in their tracker is good nonetheless [18:51] fix -> postprocessing ? [18:51] yeah [18:51] ah [18:51] (the bug you're referring to is the erroneous Transfer-Encoding header, right) [18:53] I should write "fix" [18:53] :P [18:54] yes [18:54] well the header/body mismatch [18:55] the headers refer to the raw body, not to the decoded body. if wget did gzip decoding too, then that might be an issue [18:55] yeah [18:55] I don't think wget does [18:55] gzip decompression that is [19:03] there was also a bug in their tracker where wget instead of downloading a video misunderstands it as an HTML page, put it in memory and tries to parse it to get links [20:13] i'm pretty sure it does [20:14] Nemo_bis: that is probably because the server sent a content type of text/html for the video. [20:42] Coderjoe: how do you activate gzip decompression in wget? [20:42] I haven't found such an option [20:42] and if wget receives a response with Content-Encoding: gzip that it wasn't expecting, it definitely is not decompressed [20:43] hmm [20:44] upon further investigation, I was wrong. I thought it was saying it would accept it when it made a request [20:45] the server definitely shouldn't be returning gzipped data if the client didn't say it could accept it [20:45] s [20:45] yay broken config [20:47] yipdw: So, just as a confirmation, the problem is that wget doesn't write the chunk size + \r\n to the WARC file? [20:47] That should be relatively easy to fix. Or is there something else? [20:50] the problem is that wget should be writing the raw response data and isn't [20:51] it is instead handling the encodings first [20:51] so for chunked data it is removing the chunking [20:51] But isn't that the only transfer-encoding it understands? [20:52] yeah [20:53] and apparently wget doesn't do gzip at all [20:53] In the fd_read_body function, it starts doing strange things to handle the chunking: https://github.com/alard/wget-warc/blob/master/trunk/src/retr.c#L297 [20:54] So I think that it's enough to write the result of that fd_read_line (and the one a little bit further on) to the warc file. [20:58] Also, isn't there a memory leak in that fd_read_body function? I think fd_read_line allocates space for the line, but it's never freed. [21:01] there is a free(dlbuf) at the bottom of the function [21:02] or is there another allocation I missed? [21:02] What about char *line = fd_read_line (fd); on line 301? [21:03] As I understand it, fd_read_line calls fd_read_hunk, which does its own xmalloc. [21:06] https://gist.github.com/1685109 [21:08] that's using my debian-packaged version on a page I know is returning chunked data [21:09] I'll need to build a version with symbols to know where the leaks are [21:09] I've just added a few free (line); to my version. It doesn't crash. [21:10] can you remove those temporarily, do a valgrind before and then another one with the frees? [21:11] https://gist.github.com/3b754b4cc74660041201 [21:12] Seems to work. [21:12] indeed [21:12] How does one 'compile with symbols'? [21:12] -g i think [21:12] yeah [21:12] I do -ggdb3 for maximal debuggery [21:18] The memory leak is indeed in fd_read_line / fd_read_hunk. [21:28] WHY HELLO GANG [21:32] well that might explain some of the OOM problems I was having doing recursive scrapes [21:37] Does this look ok? https://gist.github.com/645a6c4b07de5f1e7d67 [21:38] (The chunked record starts on line 37.) [21:38] looks good to me [21:39] tef, yipdw: how about you? [21:44] does your symbols build give you insight into where that last leak was? [21:46] https://github.com/alard/wget-warc/blob/master/trunk/src/http.c#L2924 [21:46] The string allocated by xstrdup is never freed. [21:47] But it looks a bit scary. I don't know where this local_file comes from, so I don't really want to touch it. [21:47] yeah. I agree. probably best to mention it in a bug report and let someone that knows more about that code look at it [21:48] I feel they should add more tests to the test suite which include memory leak checks with valgrind or the like [22:15] alard: re: preserving chunks -- that looks better [22:16] also, I didn't know about HTTPWatch [22:16] that's a pretty awesome site [22:17] or at least the gallery is [22:18] That's good to hear, I'm currently preparing a patch + email. (I didn't know HTTPWatch either, it was just the first Google result for chunked encodings.) [22:18] any updates on ffnet grab? or are we all ahead full on mobile me for now? [22:18] alard: actually, one sec [22:19] I didn't check that the prefixed byte lengths actually match up [22:19] (checking that in a web browser is impossible) [22:19] unless there's some hex editor extension for Firefox I don't know about [22:20] Well, all I do in the patch is copying the lines from the HTTP response to the WARC file, so I'd assume that they're correct. [22:20] sure [22:20] doesn't hurt to double-check :P [22:21] Do you have the wget source somewhere? I can send you the patch so you can try it on your own examples. [22:21] I do, yes [22:22] I think the chunks being written by wget may not be RFC 2616-compliant; the chunk length and chunk data don't appear to be separated by CRLF [22:22] rather it's just LF [22:23] https://gist.github.com/fcbd1025f8a439f811c0 [22:23] The line endings may have been mangled when I copied them. [22:23] that said, headers are terminated in the WARC by just LF so maybe that's not that big of an issue [22:23] yeah [22:24] (Opened the warc file in gedit, copied it to Chrome.) [22:24] ahh [22:24] I'll apply this patch and try httpwatch again [22:29] hmm [22:29] neither homebrew nor OS X ship the necessary autotools versions [22:29] VM time! [22:50] yipdw: Found any problems, or can I safely push 'Send'? [23:04] Anyway, no need to hurry. If you or anyone else finds something, please let me know. If everything is OK I'll send the patches to the wget mailing list tomorrow morning. [23:14] alard: sure thing, will do