#archiveteam 2012-01-26,Thu

↑back Search

Time	Nickname	Message
01:16 ^🔗	Evil_Rob	This is relevant to our interests:
01:16 ^🔗	Evil_Rob	http://www.goldenhillsoftware.com/
05:08 ^🔗	bsmith093	anyone still working on the ffnet grab?, i realize its a somewhat small, side project asteroid, compared to the mobile me blue supergiant, but its still important
07:56 ^🔗	Nemo_bis	bsmith093, I've not understood what's needed
07:57 ^🔗	Nemo_bis	hmm, I've a wget-warc consuming 95 % CPU and 2 GB RAM
07:59 ^🔗	Nemo_bis	and 2300 MB now :/
07:59 ^🔗	Nemo_bis	wget --mirror
08:20 ^🔗	Nemo_bis	https://github.com/alard/wget-warc/issues/13
09:22 ^🔗	alard	Nemo_bis: The memory usage is most likely the result of the design of wget --mirror, not a problem of the WARC extensions.
09:22 ^🔗	Nemo_bis	alard, yes, I know
09:22 ^🔗	alard	wget --mirror keeps lots of information about the links it finds, all in memory.
09:22 ^🔗	Nemo_bis	but since that was the third result in Google I thought adding more info didn't harm
09:23 ^🔗	alard	Doesn't make it a less annoying problem, true.
09:23 ^🔗	Nemo_bis	and perhaps someone could try that user again
09:28 ^🔗	Nemo_bis	alard, is this it? https://savannah.gnu.org/bugs/index.php?33183
09:30 ^🔗	alard	Not sure, this could refer to something else. The bug report seems to be about large files, whereas the problem you're having probably has to do with the number of links.
09:32 ^🔗	Nemo_bis	but perhaps the reporter got it wrong
09:32 ^🔗	Nemo_bis	it's the only memory bug which mentions --mirror
09:32 ^🔗	Nemo_bis	but yes, this user was only 82 MB so far
09:32 ^🔗	alard	On which part do you get the error, by the way? web/public/?
09:32 ^🔗	Nemo_bis	ehh, how do I know?
09:33 ^🔗	alard	Oh, sorry, this is splinder.
09:33 ^🔗	alard	No, it's not.
09:33 ^🔗	alard	The download script should tell you what it's downloading.
09:33 ^🔗	Nemo_bis	Yes, it's MobileMe.
09:33 ^🔗	ersi	I wouldn't say the memory usage is a defect/bug
09:34 ^🔗	Nemo_bis	But this is before getting there
09:34 ^🔗	ersi	Before getting to large memory usage?
09:34 ^🔗	Nemo_bis	sorry, I was still replying to alard
09:34 ^🔗	Nemo_bis	it's in the first part, Downloading web.me.com/nicolehan
09:35 ^🔗	Nemo_bis	http://p.defau.lt/?I1v3MH2_1xVumrt9A1n19g
09:36 ^🔗	alard	Is there any strange pattern in the wget log for that user?
09:37 ^🔗	Nemo_bis	Uh, yes there is.
09:38 ^🔗	Nemo_bis	http://p.defau.lt/?lpD9fC2RNPxwRkOMHpFfOQ
09:38 ^🔗	Nemo_bis	Lots of 404 errors, perhaps escaping problems?
09:40 ^🔗	Nemo_bis	I don't know whether the memory spiked before or after that, though.
09:40 ^🔗	Nemo_bis	Growing at 3 GB/h, might have originated before.
09:41 ^🔗	Nemo_bis	Seems a user with lots of tags anyway.
09:41 ^🔗	Nemo_bis	?? 2012-01-26 08:20:59 ERRORE 402: Payment Required.
09:45 ^🔗	alard	Some 404 errors are to be expected, since the script is making urls up based on the feeds.
09:45 ^🔗	alard	402 Payment Required is mobileme's equivalent of a 404 error, sometimes.
09:47 ^🔗	Nemo_bis	But why is it always requesting the same URL?
09:50 ^🔗	alard	Is that url also in the urls.txt file?
10:05 ^🔗	Nemo_bis	alard, no
10:13 ^🔗	alard	Nemo_bis: Hmm. Then there may be something strange in the pages that confuses the wget link extractor.
10:13 ^🔗	alard	Anyway: if you can't find the problem, just skip this user, it will be retried by someone else later.
10:17 ^🔗	Nemo_bis	alard, sure, I just love debugging :p
18:37 ^🔗	tef	yipdw: btw did that warcbug go upstream to wget ?
18:37 ^🔗	tef	btw hanzo-warc-tools is now on pypi
18:37 ^🔗	tef	http://pypi.python.org/pypi/hanzo-warc-tools/0.2
18:45 ^🔗	yipdw	tef: oh, I haven't filed it
18:45 ^🔗	yipdw	I'll get on that
18:50 ^🔗	tef	cool
18:50 ^🔗	tef	or I can
18:50 ^🔗	yipdw	yeah, whichever is fine
18:51 ^🔗	tef	actually since I don't have it installed or running
18:51 ^🔗	tef	in retrospect maybe you might be better at it :3
18:51 ^🔗	yipdw	I'm not too worried about it, since we do have a fix, and the WARC code isn't even in an official release yet
18:51 ^🔗	yipdw	but getting something in their tracker is good nonetheless
18:51 ^🔗	tef	fix -> postprocessing ?
18:51 ^🔗	yipdw	yeah
18:51 ^🔗	tef	ah
18:51 ^🔗	yipdw	(the bug you're referring to is the erroneous Transfer-Encoding header, right)
18:53 ^🔗	yipdw	I should write "fix"
18:53 ^🔗	yipdw	:P
18:54 ^🔗	tef	yes
18:54 ^🔗	tef	well the header/body mismatch
18:55 ^🔗	tef	the headers refer to the raw body, not to the decoded body. if wget did gzip decoding too, then that might be an issue
18:55 ^🔗	yipdw	yeah
18:55 ^🔗	yipdw	I don't think wget does
18:55 ^🔗	yipdw	gzip decompression that is
19:03 ^🔗	Nemo_bis	there was also a bug in their tracker where wget instead of downloading a video misunderstands it as an HTML page, put it in memory and tries to parse it to get links
20:13 ^🔗	Coderjoe	i'm pretty sure it does
20:14 ^🔗	Coderjoe	Nemo_bis: that is probably because the server sent a content type of text/html for the video.
20:42 ^🔗	yipdw	Coderjoe: how do you activate gzip decompression in wget?
20:42 ^🔗	yipdw	I haven't found such an option
20:42 ^🔗	yipdw	and if wget receives a response with Content-Encoding: gzip that it wasn't expecting, it definitely is not decompressed
20:43 ^🔗	Coderjoe	hmm
20:44 ^🔗	Coderjoe	upon further investigation, I was wrong. I thought it was saying it would accept it when it made a request
20:45 ^🔗	Coderjoe	the server definitely shouldn't be returning gzipped data if the client didn't say it could accept it
20:45 ^🔗	Coderjoe	s
20:45 ^🔗	Coderjoe	yay broken config
20:47 ^🔗	alard	yipdw: So, just as a confirmation, the problem is that wget doesn't write the chunk size + \r\n to the WARC file?
20:47 ^🔗	alard	That should be relatively easy to fix. Or is there something else?
20:50 ^🔗	Coderjoe	the problem is that wget should be writing the raw response data and isn't
20:51 ^🔗	Coderjoe	it is instead handling the encodings first
20:51 ^🔗	Coderjoe	so for chunked data it is removing the chunking
20:51 ^🔗	alard	But isn't that the only transfer-encoding it understands?
20:52 ^🔗	Coderjoe	yeah
20:53 ^🔗	Coderjoe	and apparently wget doesn't do gzip at all
20:53 ^🔗	alard	In the fd_read_body function, it starts doing strange things to handle the chunking: https://github.com/alard/wget-warc/blob/master/trunk/src/retr.c#L297
20:54 ^🔗	alard	So I think that it's enough to write the result of that fd_read_line (and the one a little bit further on) to the warc file.
20:58 ^🔗	alard	Also, isn't there a memory leak in that fd_read_body function? I think fd_read_line allocates space for the line, but it's never freed.
21:01 ^🔗	Coderjoe	there is a free(dlbuf) at the bottom of the function
21:02 ^🔗	Coderjoe	or is there another allocation I missed?
21:02 ^🔗	alard	What about char *line = fd_read_line (fd); on line 301?
21:03 ^🔗	alard	As I understand it, fd_read_line calls fd_read_hunk, which does its own xmalloc.
21:06 ^🔗	Coderjoe	https://gist.github.com/1685109
21:08 ^🔗	Coderjoe	that's using my debian-packaged version on a page I know is returning chunked data
21:09 ^🔗	Coderjoe	I'll need to build a version with symbols to know where the leaks are
21:09 ^🔗	alard	I've just added a few free (line); to my version. It doesn't crash.
21:10 ^🔗	Coderjoe	can you remove those temporarily, do a valgrind before and then another one with the frees?
21:11 ^🔗	alard	https://gist.github.com/3b754b4cc74660041201
21:12 ^🔗	alard	Seems to work.
21:12 ^🔗	Coderjoe	indeed
21:12 ^🔗	alard	How does one 'compile with symbols'?
21:12 ^🔗	Coderjoe	-g i think
21:12 ^🔗	chronomex	yeah
21:12 ^🔗	chronomex	I do -ggdb3 for maximal debuggery
21:18 ^🔗	alard	The memory leak is indeed in fd_read_line / fd_read_hunk.
21:28 ^🔗	SketchCow	WHY HELLO GANG
21:32 ^🔗	Coderjoe	well that might explain some of the OOM problems I was having doing recursive scrapes
21:37 ^🔗	alard	Does this look ok? https://gist.github.com/645a6c4b07de5f1e7d67
21:38 ^🔗	alard	(The chunked record starts on line 37.)
21:38 ^🔗	Coderjoe	looks good to me
21:39 ^🔗	Coderjoe	tef, yipdw: how about you?
21:44 ^🔗	Coderjoe	does your symbols build give you insight into where that last leak was?
21:46 ^🔗	alard	https://github.com/alard/wget-warc/blob/master/trunk/src/http.c#L2924
21:46 ^🔗	alard	The string allocated by xstrdup is never freed.
21:47 ^🔗	alard	But it looks a bit scary. I don't know where this local_file comes from, so I don't really want to touch it.
21:47 ^🔗	Coderjoe	yeah. I agree. probably best to mention it in a bug report and let someone that knows more about that code look at it
21:48 ^🔗	Coderjoe	I feel they should add more tests to the test suite which include memory leak checks with valgrind or the like
22:15 ^🔗	yipdw	alard: re: preserving chunks -- that looks better
22:16 ^🔗	yipdw	also, I didn't know about HTTPWatch
22:16 ^🔗	yipdw	that's a pretty awesome site
22:17 ^🔗	yipdw	or at least the gallery is
22:18 ^🔗	alard	That's good to hear, I'm currently preparing a patch + email. (I didn't know HTTPWatch either, it was just the first Google result for chunked encodings.)
22:18 ^🔗	bsmith093	any updates on ffnet grab? or are we all ahead full on mobile me for now?
22:18 ^🔗	yipdw	alard: actually, one sec
22:19 ^🔗	yipdw	I didn't check that the prefixed byte lengths actually match up
22:19 ^🔗	yipdw	(checking that in a web browser is impossible)
22:19 ^🔗	yipdw	unless there's some hex editor extension for Firefox I don't know about
22:20 ^🔗	alard	Well, all I do in the patch is copying the lines from the HTTP response to the WARC file, so I'd assume that they're correct.
22:20 ^🔗	yipdw	sure
22:20 ^🔗	yipdw	doesn't hurt to double-check :P
22:21 ^🔗	alard	Do you have the wget source somewhere? I can send you the patch so you can try it on your own examples.
22:21 ^🔗	yipdw	I do, yes
22:22 ^🔗	yipdw	I think the chunks being written by wget may not be RFC 2616-compliant; the chunk length and chunk data don't appear to be separated by CRLF
22:22 ^🔗	yipdw	rather it's just LF
22:23 ^🔗	alard	https://gist.github.com/fcbd1025f8a439f811c0
22:23 ^🔗	alard	The line endings may have been mangled when I copied them.
22:23 ^🔗	yipdw	that said, headers are terminated in the WARC by just LF so maybe that's not that big of an issue
22:23 ^🔗	yipdw	yeah
22:24 ^🔗	alard	(Opened the warc file in gedit, copied it to Chrome.)
22:24 ^🔗	yipdw	ahh
22:24 ^🔗	yipdw	I'll apply this patch and try httpwatch again
22:29 ^🔗	yipdw	hmm
22:29 ^🔗	yipdw	neither homebrew nor OS X ship the necessary autotools versions
22:29 ^🔗	yipdw	VM time!
22:50 ^🔗	alard	yipdw: Found any problems, or can I safely push 'Send'?
23:04 ^🔗	alard	Anyway, no need to hurry. If you or anyone else finds something, please let me know. If everything is OK I'll send the patches to the wget mailing list tomorrow morning.
23:14 ^🔗	yipdw	alard: sure thing, will do

irclogger-viewer