[00:50] in "how the hell does it still exist" news, http://www.mysimon.com/ [01:09] uploaded and checked it: http://archive.org/details/laptops-manuals-dump-from-tim.id.au-20121111 [06:59] Back once again [06:59] Off to the UK Tuesday morning [12:38] i think i got sometime from way back before archvieteam was around [12:38] its called digpicz [12:38] it was to digg for pictures site [12:39] no warc but i have (maybe) full mirror of site after it was shutdown [12:40] including letter to them from digg.com to not use digg as part of there site [12:46] lol [13:32] uploaded volume 20 of smart computing or year 2009 [13:32] example url: http://archive.org/details/smartcomputing-magazine-v20i1 [14:32] item is going to be called BBC_Schools_Progs_VHSRip [16:02] alard: you're probably aware but standalone warctozip chokes on many warcs and online warctozip is down :/ [16:06] balrog: Does it? (The standalone thing might be an older version than the online version.) Is it? underscor maintains warctozip.archive.org. http://warctozip.herokuapp.com/ is an alternative. [16:06] I tried a warc from mobileme with the standalone one and it didn't work [16:06] http://ia601401.us.archive.org/tarview.php?tar=/27/items/mobileme-hero-1341176093/mobileme-full-1341176093.tar&file=full-1341176093%2Fc%2Fca%2Fcal%2Fcalhoun%2Fhomepage.mac.com%2Fhomepage.mac.com-calhoun.warc.gz [16:07] http://warctozip.herokuapp.com/http://ia601401.us.archive.org/tarview.php?tar=/27/items/mobileme-hero-1341176093/mobileme-full-1341176093.tar&file=full-1341176093%2Fc%2Fca%2Fcal%2Fcalhoun%2Fhomepage.mac.com%2Fhomepage.mac.com-calhoun.warc.gz returns an internal server error [16:08] And it is a valid warc.gz? (tarview.php doesn't always work.) [16:09] the standalone one is at https://github.com/alard/warctozip [16:09] how do I verify that it's valid? [16:09] un-gzipping it and looking at it with less *looks* ok [16:11] http://warctozip.herokuapp.com/46875685376-46888852316/homepage.mac.com-calhoun.zip/http://archive.org/download/mobileme-hero-1341176093/mobileme-full-1341176093.tar [16:11] well that's a 62.5gb file [16:13] it's possible that warc is not fully compliant [16:13] It isn't if you include the range. E.g.: curl -L --range 46875685376-46888852316 http://archive.org/download/mobileme-hero-1341176093/mobileme-full-1341176093.tar [16:14] is there a way to verify a warc? some better error reporting would be nice [16:15] "bad prefix on WARC version header" hmm [16:15] Heh, yes. Error reporting is not for lazy people. :) [16:15] "version field is not known (0.17,1.0,0.18)" [16:15] hmmmmmm [16:16] aha [16:16] line endings :/ [16:17] I've just tried the warc-proxy, it can't load the thing either. (But that may be since it shares the same code.) There's apparently something wrong with one of the WARC headers. [16:17] no [16:17] the issue is line endings [16:17] gunzip and look at each file in vim [16:17] this warc and another warc [16:18] Any line I should look for? [16:18] any line, from the beginning [16:19] That looks fine, I think. If I remember correctly the line endings should be \r\n in the WARC headers. [16:19] hmm, yeah they ARE \r\n [16:20] the errors look like this: [16:20] [('ignored line', 'software: Wget/1.13.4.59-2b1dd (darwin9.8.0)\r\n'), ('ignored line', 'format: WARC File Format 1.0\r\n'), ('version field is not known (0.17,1.0,0.18)', 'WARC_ISO_28500_version1_latestdraft.pdf'), ('bad prefix on WARC version header', 'conformsTo: http://bibnum.bnf.fr/')] [16:20] warc errors at /Users/me/Downloads/homepage.mac.com-calhoun.warc:405 [16:22] wait, Content-Length: 0? [16:23] Hmm, that's wrong. [16:23] the rest of the warc looks fine [16:24] The Content-Length is always 0 (in the WARC headers, not in the HTTP headers). [16:24] another WARC I'm looking at has non-zero Content-Lengths in WARC headers [16:24] http://ia601401.us.archive.org/tarview.php?tar=/25/items/mobileme-hero-mobileme-hero-1331593225/mobileme-hero-1331593225.tar&file=mobileme-hero-1331593225%2Fd%2Fdg%2Fdgc%2Fdgcx%2Fhomepage.mac.com%2Fhomepage.mac.com-dgcx.warc.gz [16:25] http://warctozip.herokuapp.com/40017991168-40023049642/homepage.mac.com-dgcx.zip/http://archive.org/download/mobileme-hero-mobileme-hero-1331593225/mobileme-hero-1331593225.tar [16:25] No problem there, it seems. [16:26] that warc works fine [16:26] the calhoun one doesn't [16:26] No, the calhoun one is broken. [16:27] that's what I just said [16:27] Yes, and I agreed. :) [16:27] question is how it's broken and how it can be fixed [16:27] since the data looks ok [16:28] also what's the index that warcvalid says? [16:28] it's certainly not line [16:28] It might have something to do with the Wget version: Wget/1.13.4.59-2b1dd (darwin9.8.0) [16:28] I don't know, I've never used warcvalid. [16:28] The byte offset, perhaps? [16:29] We do have another copy of calhoun, by the way. [16:29] ok [16:29] still, it would be nice to fix broken warcs [16:30] Certainly, and even better to know why they exist in the first place. [16:31] balrog: It seems that you're the source of that file. [16:32] probably. [16:32] it probably was dumped on a big-endian machine [16:32] I hope that's not the issue [16:34] it's failing a regex match [16:34] Well, here's how Wget gets that Content-Length: it writes the response to an open file, does an fseek to the end, uses ftell to determine the length of the file. [16:34] The regex is not the problem, I think, just a symptom. [16:35] The Content-Length: 0 header tells the warc reader that the record does not contain any data, and that the next record starts immediately. That 'record' doesn't have a WARC-Type header, obviously, so it breaks down. [16:35] Have you made any warcs on that system, recently? [16:36] Do they work? [16:36] I haven't, that was probably the Mac G5 [16:36] yup, that was the Mac G5 [16:37] right now it's inaccessible but I can test on a Mac G4 probably [16:37] other WARCs were made on a Mac G4 by Lord_Nigh [16:37] (or Lord_Nightmare) [16:38] might be worth checking if they have the same issues [16:40] http://ia700403.us.archive.org/30/items/archiveteam-mobileme-index/mobileme-20120817.html#randycottingham [16:40] It seems so. [16:40] Lots of Content-Length: 0. [16:41] Is it possible to make a fresh WARC with the latest Wget version? [16:47] latest being latest git or latest release? [16:47] Whatever is easiest. They should be the same, at least for the WARC parts. [16:48] once wget compiles, sure [16:49] it may take some time on a G4 [16:51] 72843264 89% 21.70kB/s 0:06:53 [16:51] :/ [16:51] worlds slowest connection to aus it seems :( [18:05] aus only has one connection, x.25 over troposcatter packet radio [18:05] a nickel a minute [18:39] D: [18:43] alard: I have here some warcs that were generated with wget-1.14 on debian armel [18:43] and they have content-length 1 for the warc headers [18:43] and warcvalid also chokes on them with errors [18:44] yup [18:44] wget-1.14 on debian armel spits out Content-Length: 1 [18:44] Is armel also big-endian? (Weird that it is 1, not 0.) [18:44] little-endian. [18:44] Hmm. [18:45] yes, on ppc it writes 0 [18:46] just tested with wget-1.14 from homebrew [18:46] on intel/x86_64 it works as expected [18:49] it's only the WARC headers that are wrong, the normal wget headers are correct [18:49] btw, armel means "ARM, endian little" [18:49] I think [18:50] yep [18:51] anyway, this is annoying, though totally fixable in post [18:51] you just read the length of the headers and write the correct Content-Length [18:51] As long as there are no warc headers inside the responses. [18:51] you mean that you didn't wget a site containing warcs? yeah, but then you could go looking for response headers and ignore all data within those [18:52] Ah, yes, of course. [18:52] because all WARC headers start with WARC/1.0 and contain a Content-Length: [18:52] while HTTP headers start with HTTP/ [18:52] and contain a *valid* Content-Length [18:53] now where is the code messing this up? [18:53] that's my question [18:54] In here: http://git.savannah.gnu.org/cgit/wget.git/tree/src/warc.c [18:54] I know, I had that open [18:54] http://git.savannah.gnu.org/cgit/wget.git/tree/src/warc.c#n247 to be exact [18:55] The extra gzip header might also be incorrect: http://git.savannah.gnu.org/cgit/wget.git/tree/src/warc.c#n196 [18:55] doesn't asprintf return -1 on failure? [18:56] The documentation says so. [18:56] well !(-1) == 0 [18:57] That might explain why there isn't an error message. But why does it go wrong? [18:58] the data in content_length must be corrupted [18:58] somewhere [18:58] if the asprintf doesn't complete it would just be uninitialized mem [18:59] but why does it spit out 0 or 1 [19:00] ftello should be returning the appropriate value [19:20] alard: [19:20] char *content_length; [19:20] off_t data_len; [19:20] asprintf (&content_length, "%ld", data_len) [19:20] tell me what's wrong there. [19:21] or not? this is weird [19:21] content_length is never getting filled properly [19:21] data_len is correct [19:21] hmm, okay... [19:32] I don't get this [19:38] I changed the code to look like: [19:38] data_len = ftello (data_in); [19:38] if (asprintf (&content_length, "%ld", data_len) == -1) [19:38] and data_len is correct while content_length is still zero [19:45] alard: the issue is incorrect use of asprintf [19:45] you need to use "%jd", not "%ld" [19:45] also while you're at it, do == -1 [19:45] add jack daniels! [19:46] alard: how to get this fixed upstream? [19:52] http://gizmodo.com/5959812/john-mcafee-wanted-for-murder [19:53] balrog: So %jd does work? [19:53] yes [19:53] Problem is, if I see that correctly, that %jd is for int while %ld is for long int. [19:54] So %jd would break with large files on 64-bit systems? [19:54] j is intmax_t [19:54] it's whatever is the maximum width supported on the platform [19:54] so on 64-bit systems, it would be 64-bits [19:55] on 32-bit systems, it would be 32-bits [19:55] look it up ;) [19:55] I may be confused with off_t, here: http://linux.die.net/man/3/fseeko [19:56] off_t's width may vary [19:56] Suppose you would compile Wget with large file support, so off_t is a 64-bit number. Would %jd still work? [19:57] j is C99 though. [19:58] not sure if wget allows that [19:58] I think intmax_t is also C99 [19:59] Maybe it's better to use wgint (a Wget type) instead of off_t. [19:59] what is a wgint? [20:00] I mean, how wide? [20:00] http://git.savannah.gnu.org/cgit/wget.git/tree/src/wget.h#n142 [20:00] There's a filesize function in util.c that returns a wgint. [20:00] yes, that's probably the way to go [20:00] using that function. [20:01] and how do you print a wgint? [20:01] No, not the function, that needs a filename. [20:01] oh, doesn't work on a file pointer? :\ [20:01] But all it does is fseek to the end, then ftello, so that's the same. [20:01] Unless there's no fseeko/ftello, but that's another case. [20:02] number_to_static_string [20:02] ah [20:02] yes, that's the way to go here. [20:03] can you fix it up and get me a patch to test? I don't have the time right now [20:03] already wasted time debugging this :( [20:03] Yes, I'll do that. ( http://git.savannah.gnu.org/cgit/wget.git/tree/src/utils.c#n1805 ) [20:04] that concisely describes the problem here ;) [20:07] Someone's gonna have to write a script or something to fix broken warcs [20:10] https://gist.github.com/741a969548d892183646 [20:12] Works for me. (But I have a sensible, modern OS. :) [20:13] It didn't work for me on modern Debian for ARM [20:13] ;) [20:15] heh, "doesn't work on edgy new platform that barely no one runs" [20:15] :p [20:16] good that you guys look into these things though [20:16] <3<3 [20:21] alard: is that patch against 1.14 or git? [20:22] git, but I think it works for 1.14 as well. [20:22] It only touches src/warc.c, and that hasn't changed (I think). [20:22] well that fixed the issue on ppc [20:30] gonna test on arm [20:30] how long will it take to get added upstream? :) [20:41] alard: yes, works on arm too [20:41] next steps are getting it fixed upstream, and making a warc-fixer [20:55] balrog: Great. Shall I send the patch to the Wget people? [20:55] alard: please do. [20:56] explain that it's a pretty critical issue that breaks WARC on some platforms [20:56] (It will take a year to get fixed. :) [20:56] :[ [20:56] they won't be releasing a 1.14.1 sometime soon? [20:56] or do they only release .x versions to fix security bugs? [20:56] There was a year between 1.14 and the previous version, I think. [20:57] Don't know, we'll see. [20:57] So on which platforms is this a problem? 32 bit? No. Platforms without large-file support? [20:58] not sure what's the trigger [20:58] I've had it happen on debian armel and on ppc mac os x 10.5.8 [20:58] which means it probably happens on other arm platforms too, like Raspberry Pi [20:59] (which is armhf — hf means hardware floating point) [21:03] anyway how do we fix the warcs that have the wrong values? [21:04] What should Content-Length be in a WARC? [21:04] s/WARC/WARC Header/ [21:05] it should be the length of the WARC header, correct? [21:05] Content-Length should be the length of the record payload. [21:05] record body, sorry. [21:05] this bug caused it to be garbage (0 or 1 in the cases we've looked at, but this is not guaranteed)) [21:05] So that would be the length of the HTTP headers + the HTTP body. [21:05] Ah, alright. I think I've seen a bug in tef's tools regarding that~ I'll investigatezor [21:05] content length is how far to seek to skip the record, so headers + body sounds right to me [21:06] the annoying part is that we have *many* warcs which are broken this way, that need fixing :| [21:06] :| |: :| |: [21:06] Well, it's not *that* much of a problem - since it's actually fixable :) [21:06] since Lord_Nigh and I can't be the only ones who used PPC or ARM machines to participate [21:06] iterate through ALL the records! [21:07] Luckily, SketchCow *is* going to do that anyway, for the mobileme data. [21:07] do what? :P [21:07] balrog: iterate through all the records [21:07] might as well fix as doing that [21:08] chug chug chug [21:08] It will make it a lot slower, though. [21:08] It's worth it, in the long run [21:08] I think that warcs should be validated on the receiving end when rsynced [21:09] then errors like this would be caught more quickly [21:09] maybe only smaller warcs [21:09] Was the Transfer-Encoding: chunked issue fixed before or during mobileme? [21:09] Transfer-Encoding: chunked issue? what was that? [21:09] I think that was during, but I'm not entirely sure [21:09] Very early versions of the Wget code de-chunked the response but kept the Transfer-Encoding header. [21:09] oh, oops [21:10] that's stil fixable, though painfully so [21:10] what concerns me the most is when data is thrown away because of bugs [21:10] The hanzo warctools fix that (warc2warc, I believe). [21:10] and I doubt we have that here [21:10] It's not thrown away. [21:10] why not modify warc2warc to fix the Content-Length issues? :) [21:10] It's just a bit retardedly stored [21:11] There's plenty of ways to fix the WARCs, it's basically a non-issue [21:11] it's an issue when you're trying to extract data from them [21:11] Although the megawarc script doesn't check the warc headers, just checks for gzip errors, so these warcs will mess up the megawarc.warc.gz's. [21:11] I know it's an issue, didn't say it wasn't. mark words dude :p [21:11] what does megawarc do, cat warcs together? [21:11] Yes. [21:12] It checks for gzip errors, if it's a valid gzip named .warc.gz it gets added to the megawarc, if not it's added to a tar file. [21:22] alard: are you sending the patch to the mailing list? [21:23] Yes. [21:23] Have any changes? [21:23] also, http://www.mail-archive.com/bug-wget@gnu.org/msg02220.html [21:23] no [21:23] but that's an interesting thing to look at and probably introduced this bug (while fixing another) [21:23] The code works fine on many platforms, but it is apparently a problem on some PowerPC and ARM systems, and maybe others as well. [21:23] There's a somewhat serious issue in the WARC-generating code: on some platforms (presumably the ones where off_t is not a 64-bit number) the Content-Length header at the top of each WARC record has an incorrect length. On these platforms it is sometimes 0, sometimes 1, but never the correct length. This makes the whole WARC file unreadable. [21:23] Existing WARC files with this problem can be repaired by replacing the value of the Content-Length header with the correct value, for each WARC record in the file. The content of the WARC records is there, it's just the Content-Length header that is wrong. [21:23] The attached patch fixes the problem in warc.c. It replaces off_t by wgint and uses the number_to_static_string function from util.c. [21:24] are there any other places where off_t is used "raw"? [21:24] or size_t [21:24] Oh, ah, wait, it's in the warc.h. [21:25] funny that this compiled [21:25] and worked [21:28] Fixed, good catch. [21:28] There's a little bit of size_t in warc.c. [21:28] But that's just used for string and buffer sizes. [21:29] there's a little in trunc.c but I think that's ok [21:29] of off_t [21:29] size_t has its place ... [21:30] Where is trunc_t? [21:30] yeah, size_t does [21:30] trunc.c, sorry? [21:30] do not cast size_t :( [21:30] util/trunc.c [21:30] I see. (Was looking in src/) [21:30] every time you truncate an integer god bukkakes on a kitten [21:31] awesome! [21:31] trunc.c is probably ok [21:32] since it looks like a standalone util [21:32] I think that is everything. [21:33] yup [21:37] I'm curious to see which errors this patch will introduce. [21:38] what sort of errors could it introduce? maybe we should test it more heavily first? [21:38] I have no idea, but I thought off_t was a good solution. [21:39] This wgint seems to be how they do it in the other parts of Wget, so it's probably better. [21:39] off_t is good but isn't portable. [21:39] I didn't know that, but a bit of googling indicated as much. [21:39] Would this mean that your ARM system can't download files larger than 2GB? [21:40] it shouldn't mean that, hmm [21:40] I'm sure I was able to download 2+gb files before [21:43] ah [21:43] give it a try :) [21:43] getconf LFS_CFLAGS tells you what defines you need for 64-bit off_t [21:43] and it's -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 [21:43] I think the issue is that the system is 32-bit and the off_t is 64-bit [21:44] and a long is 32 bits, not 64, typically [21:44] %ld means long [21:45] does this all make sense? :) [21:46] long is 32 bits on 32 bit platforms, 64 bits on 64 bit platforms [21:46] off_t is 64 bits on all platforms with the proper defines [21:46] technically, this should have been broken if wget was compiled for 32-bit on any platform (without the patch) [21:47] huh. [21:47] and nobody noticed? [21:47] There aren't that many 2GB files. [21:48] it doesn't have anything to do with >2gb files [21:48] ah [21:48] >2gb files will download fine [21:48] since that's another part of wget, and the proper flags are being set for 64-bit off_t [21:48] Yes, but if you have a >2gb file, on a 32-bit system, the WARC record will have an incorrect Content-Length? [21:49] if you have anything on a 32-bit system the WARC record will have an incorrect Content-Length [21:49] because it's truncating [21:49] Anything? [21:49] Even if the Content-length is <32-bit? [21:50] in value [21:50] If it really is any size on a 32-bit system, that seems like something we should have noticed, then. [21:50] that's what it seemed, my tests were downloading http://google.com/ which is only one small page [21:50] I'm testing on i386 [21:50] wät, huh cool [21:51] So it wouldn't crash, but it would print the wrong Content-Length and continue? [21:51] nope, it's fine on intel-32 [21:51] wat [21:52] and yes, that's what it was doing [21:55] The webshots warcs are currently being parsed by the CDX indexer, so they're fine. That implies that there is either no-one with a 32-bit system working on webshots, or that it's not a problem (for those systems). [21:56] I think the warriors are 32-bits, even. [21:56] Isn't the VM 32-bit? [21:56] It may be that asprintf behaves differently on OS X and Linux [21:56] Gonna test it :) [21:56] I did some webshots but only on intel [21:56] Hmm... [21:56] There might be platform specific behavior of asprintf [21:56] Or some GNU-specific fix. [21:58] Then Debian ARM should work. [22:00] Yes, you'd expect that. Unless the workaround excludes ARM. [22:00] possible. [22:00] this is why I hate hacks :/ [22:01] hmmm fascinating hack https://code.google.com/p/linear-book-scanner/ [22:02] alard: on ARM, I can printf an off_t only if I use %jd [22:02] hmm [22:03] or %lld. so it's a long long [22:03] that's the problem [22:03] stuff becomes "undefined behavior" here [22:05] Somehow I find this a very tiring subject. :) [22:05] Why doesn't it just work?! [22:05] http://en.wikipedia.org/wiki/Undefined_behavior — implementation-defined stuff [22:06] which you simply cannot rely on [22:06] "undefined behavior" translates to "sends dirty pictures to your mum" [22:06] keep that in mind and you'll be ok [23:57] http://www.infodocket.com/2012/11/12/gif-is-the-oxford-dictionaries-usa-2012-word-of-the-year/ [23:57] so, who's worse: Network Solutions or GoDaddy? [23:58] what, it took 20 years?