#archiveteam-bs 2012-11-12,Mon

โ†‘back Search

Time Nickname Message
00:50 ๐Ÿ”— DFJustin in "how the hell does it still exist" news, http://www.mysimon.com/
01:09 ๐Ÿ”— godane uploaded and checked it: http://archive.org/details/laptops-manuals-dump-from-tim.id.au-20121111
06:59 ๐Ÿ”— SketchCow Back once again
06:59 ๐Ÿ”— SketchCow Off to the UK Tuesday morning
12:38 ๐Ÿ”— godane i think i got sometime from way back before archvieteam was around
12:38 ๐Ÿ”— godane its called digpicz
12:38 ๐Ÿ”— godane it was to digg for pictures site
12:39 ๐Ÿ”— godane no warc but i have (maybe) full mirror of site after it was shutdown
12:40 ๐Ÿ”— godane including letter to them from digg.com to not use digg as part of there site
12:46 ๐Ÿ”— SmileyG lol
13:32 ๐Ÿ”— godane uploaded volume 20 of smart computing or year 2009
13:32 ๐Ÿ”— godane example url: http://archive.org/details/smartcomputing-magazine-v20i1
14:32 ๐Ÿ”— godane item is going to be called BBC_Schools_Progs_VHSRip
16:02 ๐Ÿ”— balrog alard: you're probably aware but standalone warctozip chokes on many warcs and online warctozip is down :/
16:06 ๐Ÿ”— alard balrog: Does it? (The standalone thing might be an older version than the online version.) Is it? underscor maintains warctozip.archive.org. http://warctozip.herokuapp.com/ is an alternative.
16:06 ๐Ÿ”— balrog I tried a warc from mobileme with the standalone one and it didn't work
16:06 ๐Ÿ”— balrog http://ia601401.us.archive.org/tarview.php?tar=/27/items/mobileme-hero-1341176093/mobileme-full-1341176093.tar&file=full-1341176093%2Fc%2Fca%2Fcal%2Fcalhoun%2Fhomepage.mac.com%2Fhomepage.mac.com-calhoun.warc.gz
16:07 ๐Ÿ”— balrog http://warctozip.herokuapp.com/http://ia601401.us.archive.org/tarview.php?tar=/27/items/mobileme-hero-1341176093/mobileme-full-1341176093.tar&file=full-1341176093%2Fc%2Fca%2Fcal%2Fcalhoun%2Fhomepage.mac.com%2Fhomepage.mac.com-calhoun.warc.gz returns an internal server error
16:08 ๐Ÿ”— alard And it is a valid warc.gz? (tarview.php doesn't always work.)
16:09 ๐Ÿ”— balrog the standalone one is at https://github.com/alard/warctozip
16:09 ๐Ÿ”— balrog how do I verify that it's valid?
16:09 ๐Ÿ”— balrog un-gzipping it and looking at it with less *looks* ok
16:11 ๐Ÿ”— alard http://warctozip.herokuapp.com/46875685376-46888852316/homepage.mac.com-calhoun.zip/http://archive.org/download/mobileme-hero-1341176093/mobileme-full-1341176093.tar
16:11 ๐Ÿ”— balrog well that's a 62.5gb file
16:13 ๐Ÿ”— balrog it's possible that warc is not fully compliant
16:13 ๐Ÿ”— alard It isn't if you include the range. E.g.: curl -L --range 46875685376-46888852316 http://archive.org/download/mobileme-hero-1341176093/mobileme-full-1341176093.tar
16:14 ๐Ÿ”— balrog is there a way to verify a warc? some better error reporting would be nice
16:15 ๐Ÿ”— balrog "bad prefix on WARC version header" hmm
16:15 ๐Ÿ”— alard Heh, yes. Error reporting is not for lazy people. :)
16:15 ๐Ÿ”— balrog "version field is not known (0.17,1.0,0.18)"
16:15 ๐Ÿ”— balrog hmmmmmm
16:16 ๐Ÿ”— balrog aha
16:16 ๐Ÿ”— balrog line endings :/
16:17 ๐Ÿ”— alard I've just tried the warc-proxy, it can't load the thing either. (But that may be since it shares the same code.) There's apparently something wrong with one of the WARC headers.
16:17 ๐Ÿ”— balrog no
16:17 ๐Ÿ”— balrog the issue is line endings
16:17 ๐Ÿ”— balrog gunzip and look at each file in vim
16:17 ๐Ÿ”— balrog this warc and another warc
16:18 ๐Ÿ”— alard Any line I should look for?
16:18 ๐Ÿ”— balrog any line, from the beginning
16:19 ๐Ÿ”— alard That looks fine, I think. If I remember correctly the line endings should be \r\n in the WARC headers.
16:19 ๐Ÿ”— balrog hmm, yeah they ARE \r\n
16:20 ๐Ÿ”— balrog the errors look like this:
16:20 ๐Ÿ”— balrog [('ignored line', 'software: Wget/1.13.4.59-2b1dd (darwin9.8.0)\r\n'), ('ignored line', 'format: WARC File Format 1.0\r\n'), ('version field is not known (0.17,1.0,0.18)', 'WARC_ISO_28500_version1_latestdraft.pdf'), ('bad prefix on WARC version header', 'conformsTo: http://bibnum.bnf.fr/')]
16:20 ๐Ÿ”— balrog warc errors at /Users/me/Downloads/homepage.mac.com-calhoun.warc:405
16:22 ๐Ÿ”— balrog wait, Content-Length: 0?
16:23 ๐Ÿ”— alard Hmm, that's wrong.
16:23 ๐Ÿ”— balrog the rest of the warc looks fine
16:24 ๐Ÿ”— alard The Content-Length is always 0 (in the WARC headers, not in the HTTP headers).
16:24 ๐Ÿ”— balrog another WARC I'm looking at has non-zero Content-Lengths in WARC headers
16:24 ๐Ÿ”— balrog http://ia601401.us.archive.org/tarview.php?tar=/25/items/mobileme-hero-mobileme-hero-1331593225/mobileme-hero-1331593225.tar&file=mobileme-hero-1331593225%2Fd%2Fdg%2Fdgc%2Fdgcx%2Fhomepage.mac.com%2Fhomepage.mac.com-dgcx.warc.gz
16:25 ๐Ÿ”— alard http://warctozip.herokuapp.com/40017991168-40023049642/homepage.mac.com-dgcx.zip/http://archive.org/download/mobileme-hero-mobileme-hero-1331593225/mobileme-hero-1331593225.tar
16:25 ๐Ÿ”— alard No problem there, it seems.
16:26 ๐Ÿ”— balrog that warc works fine
16:26 ๐Ÿ”— balrog the calhoun one doesn't
16:26 ๐Ÿ”— alard No, the calhoun one is broken.
16:27 ๐Ÿ”— balrog that's what I just said
16:27 ๐Ÿ”— alard Yes, and I agreed. :)
16:27 ๐Ÿ”— balrog question is how it's broken and how it can be fixed
16:27 ๐Ÿ”— balrog since the data looks ok
16:28 ๐Ÿ”— balrog also what's the index that warcvalid says?
16:28 ๐Ÿ”— balrog it's certainly not line
16:28 ๐Ÿ”— alard It might have something to do with the Wget version: Wget/1.13.4.59-2b1dd (darwin9.8.0)
16:28 ๐Ÿ”— alard I don't know, I've never used warcvalid.
16:28 ๐Ÿ”— alard The byte offset, perhaps?
16:29 ๐Ÿ”— alard We do have another copy of calhoun, by the way.
16:29 ๐Ÿ”— balrog ok
16:29 ๐Ÿ”— balrog still, it would be nice to fix broken warcs
16:30 ๐Ÿ”— alard Certainly, and even better to know why they exist in the first place.
16:31 ๐Ÿ”— alard balrog: It seems that you're the source of that file.
16:32 ๐Ÿ”— balrog probably.
16:32 ๐Ÿ”— balrog it probably was dumped on a big-endian machine
16:32 ๐Ÿ”— balrog I hope that's not the issue
16:34 ๐Ÿ”— balrog it's failing a regex match
16:34 ๐Ÿ”— alard Well, here's how Wget gets that Content-Length: it writes the response to an open file, does an fseek to the end, uses ftell to determine the length of the file.
16:34 ๐Ÿ”— alard The regex is not the problem, I think, just a symptom.
16:35 ๐Ÿ”— alard The Content-Length: 0 header tells the warc reader that the record does not contain any data, and that the next record starts immediately. That 'record' doesn't have a WARC-Type header, obviously, so it breaks down.
16:35 ๐Ÿ”— alard Have you made any warcs on that system, recently?
16:36 ๐Ÿ”— alard Do they work?
16:36 ๐Ÿ”— balrog I haven't, that was probably the Mac G5
16:36 ๐Ÿ”— balrog yup, that was the Mac G5
16:37 ๐Ÿ”— balrog right now it's inaccessible but I can test on a Mac G4 probably
16:37 ๐Ÿ”— balrog other WARCs were made on a Mac G4 by Lord_Nigh
16:37 ๐Ÿ”— balrog (or Lord_Nightmare)
16:38 ๐Ÿ”— balrog might be worth checking if they have the same issues
16:40 ๐Ÿ”— alard http://ia700403.us.archive.org/30/items/archiveteam-mobileme-index/mobileme-20120817.html#randycottingham
16:40 ๐Ÿ”— alard It seems so.
16:40 ๐Ÿ”— alard Lots of Content-Length: 0.
16:41 ๐Ÿ”— alard Is it possible to make a fresh WARC with the latest Wget version?
16:47 ๐Ÿ”— balrog latest being latest git or latest release?
16:47 ๐Ÿ”— alard Whatever is easiest. They should be the same, at least for the WARC parts.
16:48 ๐Ÿ”— balrog once wget compiles, sure
16:49 ๐Ÿ”— balrog it may take some time on a G4
16:51 ๐Ÿ”— SmileyG 72843264 89% 21.70kB/s 0:06:53
16:51 ๐Ÿ”— SmileyG :/
16:51 ๐Ÿ”— SmileyG worlds slowest connection to aus it seems :(
18:05 ๐Ÿ”— chronomex aus only has one connection, x.25 over troposcatter packet radio
18:05 ๐Ÿ”— chronomex a nickel a minute
18:39 ๐Ÿ”— SmileyG D:
18:43 ๐Ÿ”— balrog alard: I have here some warcs that were generated with wget-1.14 on debian armel
18:43 ๐Ÿ”— balrog and they have content-length 1 for the warc headers
18:43 ๐Ÿ”— balrog and warcvalid also chokes on them with errors
18:44 ๐Ÿ”— balrog yup
18:44 ๐Ÿ”— balrog wget-1.14 on debian armel spits out Content-Length: 1
18:44 ๐Ÿ”— alard Is armel also big-endian? (Weird that it is 1, not 0.)
18:44 ๐Ÿ”— balrog little-endian.
18:44 ๐Ÿ”— alard Hmm.
18:45 ๐Ÿ”— balrog yes, on ppc it writes 0
18:46 ๐Ÿ”— balrog just tested with wget-1.14 from homebrew
18:46 ๐Ÿ”— balrog on intel/x86_64 it works as expected
18:49 ๐Ÿ”— balrog it's only the WARC headers that are wrong, the normal wget headers are correct
18:49 ๐Ÿ”— balrog btw, armel means "ARM, endian little"
18:49 ๐Ÿ”— balrog I think
18:50 ๐Ÿ”— chronomex yep
18:51 ๐Ÿ”— balrog anyway, this is annoying, though totally fixable in post
18:51 ๐Ÿ”— balrog you just read the length of the headers and write the correct Content-Length
18:51 ๐Ÿ”— alard As long as there are no warc headers inside the responses.
18:51 ๐Ÿ”— balrog you mean that you didn't wget a site containing warcs? yeah, but then you could go looking for response headers and ignore all data within those
18:52 ๐Ÿ”— alard Ah, yes, of course.
18:52 ๐Ÿ”— balrog because all WARC headers start with WARC/1.0 and contain a Content-Length:
18:52 ๐Ÿ”— balrog while HTTP headers start with HTTP/
18:52 ๐Ÿ”— balrog and contain a *valid* Content-Length
18:53 ๐Ÿ”— balrog now where is the code messing this up?
18:53 ๐Ÿ”— balrog that's my question
18:54 ๐Ÿ”— alard In here: http://git.savannah.gnu.org/cgit/wget.git/tree/src/warc.c
18:54 ๐Ÿ”— balrog I know, I had that open
18:54 ๐Ÿ”— balrog http://git.savannah.gnu.org/cgit/wget.git/tree/src/warc.c#n247 to be exact
18:55 ๐Ÿ”— alard The extra gzip header might also be incorrect: http://git.savannah.gnu.org/cgit/wget.git/tree/src/warc.c#n196
18:55 ๐Ÿ”— balrog doesn't asprintf return -1 on failure?
18:56 ๐Ÿ”— alard The documentation says so.
18:56 ๐Ÿ”— balrog well !(-1) == 0
18:57 ๐Ÿ”— alard That might explain why there isn't an error message. But why does it go wrong?
18:58 ๐Ÿ”— balrog the data in content_length must be corrupted
18:58 ๐Ÿ”— balrog somewhere
18:58 ๐Ÿ”— balrog if the asprintf doesn't complete it would just be uninitialized mem
18:59 ๐Ÿ”— balrog but why does it spit out 0 or 1
19:00 ๐Ÿ”— balrog ftello should be returning the appropriate value
19:20 ๐Ÿ”— balrog alard:
19:20 ๐Ÿ”— balrog char *content_length;
19:20 ๐Ÿ”— balrog off_t data_len;
19:20 ๐Ÿ”— balrog asprintf (&content_length, "%ld", data_len)
19:20 ๐Ÿ”— balrog tell me what's wrong there.
19:21 ๐Ÿ”— balrog or not? this is weird
19:21 ๐Ÿ”— balrog content_length is never getting filled properly
19:21 ๐Ÿ”— balrog data_len is correct
19:21 ๐Ÿ”— balrog hmm, okay...
19:32 ๐Ÿ”— balrog I don't get this
19:38 ๐Ÿ”— balrog I changed the code to look like:
19:38 ๐Ÿ”— balrog data_len = ftello (data_in);
19:38 ๐Ÿ”— balrog if (asprintf (&content_length, "%ld", data_len) == -1)
19:38 ๐Ÿ”— balrog and data_len is correct while content_length is still zero
19:45 ๐Ÿ”— balrog alard: the issue is incorrect use of asprintf
19:45 ๐Ÿ”— balrog you need to use "%jd", not "%ld"
19:45 ๐Ÿ”— balrog also while you're at it, do == -1
19:45 ๐Ÿ”— SmileyG add jack daniels!
19:46 ๐Ÿ”— balrog alard: how to get this fixed upstream?
19:52 ๐Ÿ”— DFJustin http://gizmodo.com/5959812/john-mcafee-wanted-for-murder
19:53 ๐Ÿ”— alard balrog: So %jd does work?
19:53 ๐Ÿ”— balrog yes
19:53 ๐Ÿ”— alard Problem is, if I see that correctly, that %jd is for int while %ld is for long int.
19:54 ๐Ÿ”— alard So %jd would break with large files on 64-bit systems?
19:54 ๐Ÿ”— balrog j is intmax_t
19:54 ๐Ÿ”— balrog it's whatever is the maximum width supported on the platform
19:54 ๐Ÿ”— balrog so on 64-bit systems, it would be 64-bits
19:55 ๐Ÿ”— balrog on 32-bit systems, it would be 32-bits
19:55 ๐Ÿ”— balrog look it up ;)
19:55 ๐Ÿ”— alard I may be confused with off_t, here: http://linux.die.net/man/3/fseeko
19:56 ๐Ÿ”— balrog off_t's width may vary
19:56 ๐Ÿ”— alard Suppose you would compile Wget with large file support, so off_t is a 64-bit number. Would %jd still work?
19:57 ๐Ÿ”— balrog j is C99 though.
19:58 ๐Ÿ”— balrog not sure if wget allows that
19:58 ๐Ÿ”— balrog I think intmax_t is also C99
19:59 ๐Ÿ”— alard Maybe it's better to use wgint (a Wget type) instead of off_t.
19:59 ๐Ÿ”— balrog what is a wgint?
20:00 ๐Ÿ”— balrog I mean, how wide?
20:00 ๐Ÿ”— alard http://git.savannah.gnu.org/cgit/wget.git/tree/src/wget.h#n142
20:00 ๐Ÿ”— alard There's a filesize function in util.c that returns a wgint.
20:00 ๐Ÿ”— balrog yes, that's probably the way to go
20:00 ๐Ÿ”— balrog using that function.
20:01 ๐Ÿ”— balrog and how do you print a wgint?
20:01 ๐Ÿ”— alard No, not the function, that needs a filename.
20:01 ๐Ÿ”— balrog oh, doesn't work on a file pointer? :\
20:01 ๐Ÿ”— alard But all it does is fseek to the end, then ftello, so that's the same.
20:01 ๐Ÿ”— alard Unless there's no fseeko/ftello, but that's another case.
20:02 ๐Ÿ”— alard number_to_static_string
20:02 ๐Ÿ”— balrog ah
20:02 ๐Ÿ”— balrog yes, that's the way to go here.
20:03 ๐Ÿ”— balrog can you fix it up and get me a patch to test? I don't have the time right now
20:03 ๐Ÿ”— balrog already wasted time debugging this :(
20:03 ๐Ÿ”— alard Yes, I'll do that. ( http://git.savannah.gnu.org/cgit/wget.git/tree/src/utils.c#n1805 )
20:04 ๐Ÿ”— balrog that concisely describes the problem here ;)
20:07 ๐Ÿ”— balrog Someone's gonna have to write a script or something to fix broken warcs
20:10 ๐Ÿ”— alard https://gist.github.com/741a969548d892183646
20:12 ๐Ÿ”— alard Works for me. (But I have a sensible, modern OS. :)
20:13 ๐Ÿ”— balrog It didn't work for me on modern Debian for ARM
20:13 ๐Ÿ”— balrog ;)
20:15 ๐Ÿ”— ersi heh, "doesn't work on edgy new platform that barely no one runs"
20:15 ๐Ÿ”— ersi :p
20:16 ๐Ÿ”— ersi good that you guys look into these things though
20:16 ๐Ÿ”— ersi <3<3
20:21 ๐Ÿ”— balrog alard: is that patch against 1.14 or git?
20:22 ๐Ÿ”— alard git, but I think it works for 1.14 as well.
20:22 ๐Ÿ”— alard It only touches src/warc.c, and that hasn't changed (I think).
20:22 ๐Ÿ”— balrog well that fixed the issue on ppc
20:30 ๐Ÿ”— balrog gonna test on arm
20:30 ๐Ÿ”— balrog how long will it take to get added upstream? :)
20:41 ๐Ÿ”— balrog alard: yes, works on arm too
20:41 ๐Ÿ”— balrog next steps are getting it fixed upstream, and making a warc-fixer
20:55 ๐Ÿ”— alard balrog: Great. Shall I send the patch to the Wget people?
20:55 ๐Ÿ”— balrog alard: please do.
20:56 ๐Ÿ”— balrog explain that it's a pretty critical issue that breaks WARC on some platforms
20:56 ๐Ÿ”— alard (It will take a year to get fixed. :)
20:56 ๐Ÿ”— balrog :[
20:56 ๐Ÿ”— balrog they won't be releasing a 1.14.1 sometime soon?
20:56 ๐Ÿ”— balrog or do they only release .x versions to fix security bugs?
20:56 ๐Ÿ”— alard There was a year between 1.14 and the previous version, I think.
20:57 ๐Ÿ”— alard Don't know, we'll see.
20:57 ๐Ÿ”— alard So on which platforms is this a problem? 32 bit? No. Platforms without large-file support?
20:58 ๐Ÿ”— balrog not sure what's the trigger
20:58 ๐Ÿ”— balrog I've had it happen on debian armel and on ppc mac os x 10.5.8
20:58 ๐Ÿ”— balrog which means it probably happens on other arm platforms too, like Raspberry Pi
20:59 ๐Ÿ”— balrog (which is armhf รขย€ย” hf means hardware floating point)
21:03 ๐Ÿ”— balrog anyway how do we fix the warcs that have the wrong values?
21:04 ๐Ÿ”— ersi What should Content-Length be in a WARC?
21:04 ๐Ÿ”— ersi s/WARC/WARC Header/
21:05 ๐Ÿ”— balrog it should be the length of the WARC header, correct?
21:05 ๐Ÿ”— alard Content-Length should be the length of the record payload.
21:05 ๐Ÿ”— alard record body, sorry.
21:05 ๐Ÿ”— balrog this bug caused it to be garbage (0 or 1 in the cases we've looked at, but this is not guaranteed))
21:05 ๐Ÿ”— alard So that would be the length of the HTTP headers + the HTTP body.
21:05 ๐Ÿ”— ersi Ah, alright. I think I've seen a bug in tef's tools regarding that~ I'll investigatezor
21:05 ๐Ÿ”— chronomex content length is how far to seek to skip the record, so headers + body sounds right to me
21:06 ๐Ÿ”— balrog the annoying part is that we have *many* warcs which are broken this way, that need fixing :|
21:06 ๐Ÿ”— chronomex :| |: :| |:
21:06 ๐Ÿ”— ersi Well, it's not *that* much of a problem - since it's actually fixable :)
21:06 ๐Ÿ”— balrog since Lord_Nigh and I can't be the only ones who used PPC or ARM machines to participate
21:06 ๐Ÿ”— ersi iterate through ALL the records!
21:07 ๐Ÿ”— alard Luckily, SketchCow *is* going to do that anyway, for the mobileme data.
21:07 ๐Ÿ”— balrog do what? :P
21:07 ๐Ÿ”— ersi balrog: iterate through all the records
21:07 ๐Ÿ”— ersi might as well fix as doing that
21:08 ๐Ÿ”— ersi chug chug chug
21:08 ๐Ÿ”— alard It will make it a lot slower, though.
21:08 ๐Ÿ”— ersi It's worth it, in the long run
21:08 ๐Ÿ”— balrog I think that warcs should be validated on the receiving end when rsynced
21:09 ๐Ÿ”— balrog then errors like this would be caught more quickly
21:09 ๐Ÿ”— balrog maybe only smaller warcs
21:09 ๐Ÿ”— alard Was the Transfer-Encoding: chunked issue fixed before or during mobileme?
21:09 ๐Ÿ”— balrog Transfer-Encoding: chunked issue? what was that?
21:09 ๐Ÿ”— ersi I think that was during, but I'm not entirely sure
21:09 ๐Ÿ”— alard Very early versions of the Wget code de-chunked the response but kept the Transfer-Encoding header.
21:09 ๐Ÿ”— balrog oh, oops
21:10 ๐Ÿ”— balrog that's stil fixable, though painfully so
21:10 ๐Ÿ”— balrog what concerns me the most is when data is thrown away because of bugs
21:10 ๐Ÿ”— alard The hanzo warctools fix that (warc2warc, I believe).
21:10 ๐Ÿ”— balrog and I doubt we have that here
21:10 ๐Ÿ”— alard It's not thrown away.
21:10 ๐Ÿ”— balrog why not modify warc2warc to fix the Content-Length issues? :)
21:10 ๐Ÿ”— ersi It's just a bit retardedly stored
21:11 ๐Ÿ”— ersi There's plenty of ways to fix the WARCs, it's basically a non-issue
21:11 ๐Ÿ”— balrog it's an issue when you're trying to extract data from them
21:11 ๐Ÿ”— alard Although the megawarc script doesn't check the warc headers, just checks for gzip errors, so these warcs will mess up the megawarc.warc.gz's.
21:11 ๐Ÿ”— ersi I know it's an issue, didn't say it wasn't. mark words dude :p
21:11 ๐Ÿ”— balrog what does megawarc do, cat warcs together?
21:11 ๐Ÿ”— alard Yes.
21:12 ๐Ÿ”— alard It checks for gzip errors, if it's a valid gzip named .warc.gz it gets added to the megawarc, if not it's added to a tar file.
21:22 ๐Ÿ”— balrog alard: are you sending the patch to the mailing list?
21:23 ๐Ÿ”— alard Yes.
21:23 ๐Ÿ”— alard Have any changes?
21:23 ๐Ÿ”— balrog also, http://www.mail-archive.com/bug-wget@gnu.org/msg02220.html
21:23 ๐Ÿ”— balrog no
21:23 ๐Ÿ”— balrog but that's an interesting thing to look at and probably introduced this bug (while fixing another)
21:23 ๐Ÿ”— alard The code works fine on many platforms, but it is apparently a problem on some PowerPC and ARM systems, and maybe others as well.
21:23 ๐Ÿ”— alard There's a somewhat serious issue in the WARC-generating code: on some platforms (presumably the ones where off_t is not a 64-bit number) the Content-Length header at the top of each WARC record has an incorrect length. On these platforms it is sometimes 0, sometimes 1, but never the correct length. This makes the whole WARC file unreadable.
21:23 ๐Ÿ”— alard Existing WARC files with this problem can be repaired by replacing the value of the Content-Length header with the correct value, for each WARC record in the file. The content of the WARC records is there, it's just the Content-Length header that is wrong.
21:23 ๐Ÿ”— alard The attached patch fixes the problem in warc.c. It replaces off_t by wgint and uses the number_to_static_string function from util.c.
21:24 ๐Ÿ”— balrog are there any other places where off_t is used "raw"?
21:24 ๐Ÿ”— balrog or size_t
21:24 ๐Ÿ”— alard Oh, ah, wait, it's in the warc.h.
21:25 ๐Ÿ”— balrog funny that this compiled
21:25 ๐Ÿ”— balrog and worked
21:28 ๐Ÿ”— alard Fixed, good catch.
21:28 ๐Ÿ”— alard There's a little bit of size_t in warc.c.
21:28 ๐Ÿ”— alard But that's just used for string and buffer sizes.
21:29 ๐Ÿ”— balrog there's a little in trunc.c but I think that's ok
21:29 ๐Ÿ”— balrog of off_t
21:29 ๐Ÿ”— chronomex size_t has its place ...
21:30 ๐Ÿ”— alard Where is trunc_t?
21:30 ๐Ÿ”— balrog yeah, size_t does
21:30 ๐Ÿ”— alard trunc.c, sorry?
21:30 ๐Ÿ”— chronomex do not cast size_t :(
21:30 ๐Ÿ”— balrog util/trunc.c
21:30 ๐Ÿ”— alard I see. (Was looking in src/)
21:30 ๐Ÿ”— chronomex every time you truncate an integer god bukkakes on a kitten
21:31 ๐Ÿ”— ersi awesome!
21:31 ๐Ÿ”— balrog trunc.c is probably ok
21:32 ๐Ÿ”— balrog since it looks like a standalone util
21:32 ๐Ÿ”— alard I think that is everything.
21:33 ๐Ÿ”— balrog yup
21:37 ๐Ÿ”— alard I'm curious to see which errors this patch will introduce.
21:38 ๐Ÿ”— balrog what sort of errors could it introduce? maybe we should test it more heavily first?
21:38 ๐Ÿ”— alard I have no idea, but I thought off_t was a good solution.
21:39 ๐Ÿ”— alard This wgint seems to be how they do it in the other parts of Wget, so it's probably better.
21:39 ๐Ÿ”— balrog off_t is good but isn't portable.
21:39 ๐Ÿ”— balrog I didn't know that, but a bit of googling indicated as much.
21:39 ๐Ÿ”— alard Would this mean that your ARM system can't download files larger than 2GB?
21:40 ๐Ÿ”— balrog it shouldn't mean that, hmm
21:40 ๐Ÿ”— balrog I'm sure I was able to download 2+gb files before
21:43 ๐Ÿ”— balrog ah
21:43 ๐Ÿ”— ersi give it a try :)
21:43 ๐Ÿ”— balrog getconf LFS_CFLAGS tells you what defines you need for 64-bit off_t
21:43 ๐Ÿ”— balrog and it's -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
21:43 ๐Ÿ”— balrog I think the issue is that the system is 32-bit and the off_t is 64-bit
21:44 ๐Ÿ”— balrog and a long is 32 bits, not 64, typically
21:44 ๐Ÿ”— balrog %ld means long
21:45 ๐Ÿ”— balrog does this all make sense? :)
21:46 ๐Ÿ”— balrog long is 32 bits on 32 bit platforms, 64 bits on 64 bit platforms
21:46 ๐Ÿ”— balrog off_t is 64 bits on all platforms with the proper defines
21:46 ๐Ÿ”— balrog technically, this should have been broken if wget was compiled for 32-bit on any platform (without the patch)
21:47 ๐Ÿ”— chronomex huh.
21:47 ๐Ÿ”— chronomex and nobody noticed?
21:47 ๐Ÿ”— alard There aren't that many 2GB files.
21:48 ๐Ÿ”— balrog it doesn't have anything to do with >2gb files
21:48 ๐Ÿ”— chronomex ah
21:48 ๐Ÿ”— balrog >2gb files will download fine
21:48 ๐Ÿ”— balrog since that's another part of wget, and the proper flags are being set for 64-bit off_t
21:48 ๐Ÿ”— alard Yes, but if you have a >2gb file, on a 32-bit system, the WARC record will have an incorrect Content-Length?
21:49 ๐Ÿ”— balrog if you have anything on a 32-bit system the WARC record will have an incorrect Content-Length
21:49 ๐Ÿ”— balrog because it's truncating
21:49 ๐Ÿ”— alard Anything?
21:49 ๐Ÿ”— ersi Even if the Content-length is <32-bit?
21:50 ๐Ÿ”— ersi in value
21:50 ๐Ÿ”— alard If it really is any size on a 32-bit system, that seems like something we should have noticed, then.
21:50 ๐Ÿ”— balrog that's what it seemed, my tests were downloading http://google.com/ which is only one small page
21:50 ๐Ÿ”— balrog I'm testing on i386
21:50 ๐Ÿ”— ersi wรƒยคt, huh cool
21:51 ๐Ÿ”— alard So it wouldn't crash, but it would print the wrong Content-Length and continue?
21:51 ๐Ÿ”— balrog nope, it's fine on intel-32
21:51 ๐Ÿ”— balrog wat
21:52 ๐Ÿ”— balrog and yes, that's what it was doing
21:55 ๐Ÿ”— alard The webshots warcs are currently being parsed by the CDX indexer, so they're fine. That implies that there is either no-one with a 32-bit system working on webshots, or that it's not a problem (for those systems).
21:56 ๐Ÿ”— alard I think the warriors are 32-bits, even.
21:56 ๐Ÿ”— ersi Isn't the VM 32-bit?
21:56 ๐Ÿ”— balrog It may be that asprintf behaves differently on OS X and Linux
21:56 ๐Ÿ”— balrog Gonna test it :)
21:56 ๐Ÿ”— balrog I did some webshots but only on intel
21:56 ๐Ÿ”— balrog Hmm...
21:56 ๐Ÿ”— balrog There might be platform specific behavior of asprintf
21:56 ๐Ÿ”— alard Or some GNU-specific fix.
21:58 ๐Ÿ”— balrog Then Debian ARM should work.
22:00 ๐Ÿ”— alard Yes, you'd expect that. Unless the workaround excludes ARM.
22:00 ๐Ÿ”— balrog possible.
22:00 ๐Ÿ”— balrog this is why I hate hacks :/
22:01 ๐Ÿ”— chronomex hmmm fascinating hack https://code.google.com/p/linear-book-scanner/
22:02 ๐Ÿ”— balrog alard: on ARM, I can printf an off_t only if I use %jd
22:02 ๐Ÿ”— balrog hmm
22:03 ๐Ÿ”— balrog or %lld. so it's a long long
22:03 ๐Ÿ”— balrog that's the problem
22:03 ๐Ÿ”— balrog stuff becomes "undefined behavior" here
22:05 ๐Ÿ”— alard Somehow I find this a very tiring subject. :)
22:05 ๐Ÿ”— alard Why doesn't it just work?!
22:05 ๐Ÿ”— balrog http://en.wikipedia.org/wiki/Undefined_behavior รขย€ย” implementation-defined stuff
22:06 ๐Ÿ”— balrog which you simply cannot rely on
22:06 ๐Ÿ”— chronomex "undefined behavior" translates to "sends dirty pictures to your mum"
22:06 ๐Ÿ”— chronomex keep that in mind and you'll be ok
23:57 ๐Ÿ”— dashcloud http://www.infodocket.com/2012/11/12/gif-is-the-oxford-dictionaries-usa-2012-word-of-the-year/
23:57 ๐Ÿ”— dashcloud so, who's worse: Network Solutions or GoDaddy?
23:58 ๐Ÿ”— chronomex what, it took 20 years?

irclogger-viewer