#archiveteam 2012-03-05,Mon

↑back Search

Time Nickname Message
07:40 πŸ”— db48x alard: found an error
07:40 πŸ”— db48x Sleep and retry.
07:40 πŸ”— db48x ERROR contacting tracker. Could not mark 'es/arcoiris/metaf%edsica' done.
07:40 πŸ”— db48x Telling tracker that 'es/arcoiris/metaf%edsica' is done.
07:40 πŸ”— db48x Sleep and retry.
08:19 πŸ”— db48x I still taste garlic
08:49 πŸ”— db48x -rw-rw-r--. 1 db48x db48x 456M Mar 5 00:49 com-emachines-e5-20120305-014015.warc.gz
08:53 πŸ”— db48x oh man
08:53 πŸ”— db48x a lot of the filenames in this directory aren't unicode
08:54 πŸ”— ersi so? :)
08:55 πŸ”— db48x how will anyone know what they mean?
08:56 πŸ”— db48x -rw-rw-r--. 1 db48x db48x 50K Mar 5 00:11 h1-2-2.û?ΓͺΒΊΒ°??(??ߣܬ??)(T).htm
08:56 πŸ”— db48x completely mangled
08:56 πŸ”— ersi that's not mangled, that's you missing CJK fonts most likely
08:57 πŸ”— db48x it has no cjk characters in it
08:57 πŸ”— ersi the filename is there though
08:57 πŸ”— ersi it has atleast one, I got it displayed
08:57 πŸ”— db48x that's a coincidence of the unicode stack you and I are using to communicate
08:57 πŸ”— ersi Oh, heh. my bad
08:57 πŸ”— db48x it's modern mojibake
08:57 πŸ”— yipdw I guess you can try to run it through iconv, if you've got the filename in a log or something
08:57 πŸ”— ersi yeah. Know the source encoding?
08:58 πŸ”— yipdw hell if I know what the source is
08:58 πŸ”— db48x that's supposedly a SYRIAC LETTER TAW in there too
08:58 πŸ”— db48x which is RTL and triggers bidi rendering
08:58 πŸ”— chronomex yaaay RTL
08:58 πŸ”— db48x ersi: nope. no way to know
08:58 πŸ”— db48x the content of the file is no better
08:59 πŸ”— yipdw what's the name of the account?
08:59 πŸ”— db48x I guess you could view it in a bunch of different code pages and see what looks readable
08:59 πŸ”— yipdw or rather
08:59 πŸ”— db48x emachines
08:59 πŸ”— yipdw what's the URI
08:59 πŸ”— yipdw hmm
08:59 πŸ”— db48x com/emachines/e5
09:00 πŸ”— db48x http://www.fortunecity.com/emachines/e5/136/Educa/h-k-1.htm, for example
09:00 πŸ”— yipdw oh jesus christ what a mess
09:01 πŸ”— yipdw fortunecity is awfully slow too
09:01 πŸ”— yipdw oh
09:01 πŸ”— yipdw it's EUC-JP
09:01 πŸ”— * db48x tries some encodings
09:01 πŸ”— yipdw I think
09:01 πŸ”— yipdw wait no
09:02 πŸ”— yipdw if it IS, that'd be odd as fuck
09:02 πŸ”— db48x Firefox says EUC-KR when I do autodetection for east asia
09:02 πŸ”— joepie91 h god fortunecity
09:02 πŸ”— joepie91 oh*
09:02 πŸ”— yipdw EUC-KR makes more sense
09:02 πŸ”— yipdw lots of 404s that way, though
09:03 πŸ”— joepie91 whoa, fortunecity is shutting down? ._.
09:03 πŸ”— yipdw heh wow
09:03 πŸ”— yipdw <body bgcolor="white" text="black" link="blue" vlink="purple" alink="red" background="../Image/Γ­Β•ΒœΓͺ¸€ë°°ΓͺΒ²Β½6.gif">
09:03 πŸ”— yipdw old constructs NEVER die
09:04 πŸ”— db48x :)
09:06 πŸ”— kennethre yipdw: it's ISO-8859-1 apparently
09:07 πŸ”— kennethre accoding to my http library
09:07 πŸ”— chronomex codepages must die
09:08 πŸ”— yipdw kennethre: that page is?
09:09 πŸ”— db48x kennethre: you can't do korean in 8859-1 :)
09:09 πŸ”— kennethre yipdw: yeah
09:09 πŸ”— db48x or jp or whatever it really is
09:09 πŸ”— yipdw if I apply ISO-8859-1 to that page I get gibberish
09:09 πŸ”— kennethre this page? http://www.fortunecity.com/emachines/e5/136/Educa/h-k-1.htm
09:09 πŸ”— yipdw yes
09:09 πŸ”— db48x kennethre: that page
09:09 πŸ”— db48x the server doesn't report an encoding
09:09 πŸ”— yipdw I'm guessing EUC-KR, just because the title graphic is in Korean
09:09 πŸ”— db48x there's no meta tag
09:09 πŸ”— kennethre db48x: yeah that's chardet's best guess for the encoding
09:10 πŸ”— kennethre it may have a lot of gibberish, but all the chars are valid ;)
09:10 πŸ”— db48x because there's a lot of html in it
09:10 πŸ”— db48x so that biases it
09:10 πŸ”— db48x kennethre: :)
09:10 πŸ”— kennethre ah it's cp932
09:10 πŸ”— db48x look at all those valid bytes :)
09:11 πŸ”— kennethre a lot of japaneese sites do that
09:11 πŸ”— kennethre (Windows-31J)
09:11 πŸ”— kennethre yeah it works perfectly then
09:11 πŸ”— yipdw I'm pretty sure it's not Japanese :P
09:11 πŸ”— kennethre it is
09:11 πŸ”— yipdw well
09:12 πŸ”— yipdw http://www.fortunecity.com/emachines/e5/136/Image/jaul.gif isn't
09:12 πŸ”— yipdw :P
09:12 πŸ”— yipdw I mean
09:12 πŸ”— kennethre http://cl.ly/2b0o1D3n1q2u0D0F3M2f
09:12 πŸ”— kennethre looks damn valid to me
09:12 πŸ”— yipdw I understand it's possible for a page to be written using a Japanese encoding but with hangul in graphics
09:12 πŸ”— joepie91 fwiw, script for fortunecity works here
09:12 πŸ”— yipdw but the fact that it's hangul in the graphic suggests to me that it's Korean
09:12 πŸ”— kennethre well whatever it is, i think that's the right encoding
09:12 πŸ”— ersi yipdw: that looks korean
09:12 πŸ”— yipdw ersi: it is
09:12 πŸ”— kennethre http://cl.ly/1b2s3S181M1O1R1z1G15
09:13 πŸ”— kennethre maybe not perfect quite though
09:13 πŸ”— ersi and that looks like katakana
09:13 πŸ”— yipdw I can't connect to cl.ly for some reason
09:13 πŸ”— yipdw WHAT THE HELL INTERNET
09:13 πŸ”— ersi WHAT THE INTERNETS
09:13 πŸ”— kennethre yipdw: flush dns cache :)
09:13 πŸ”— yipdw oh there it goes
09:14 πŸ”— yipdw kennethre: re: http://cl.ly/1b2s3S181M1O1R1z1G15 -> I doubt that's valid
09:14 πŸ”— yipdw I mean, yes, they're characters
09:14 πŸ”— kennethre yeah, but close :P
09:14 πŸ”— yipdw but nobody uses half-width katakana and kanji to name things
09:14 πŸ”— kennethre never say never :P
09:14 πŸ”— yipdw true
09:15 πŸ”— yipdw the person running this site could have been an asshole :P
09:15 πŸ”— ersi douchery is allowed on the internets
09:15 πŸ”— yipdw I have one other test though
09:15 πŸ”— yipdw I do not read Korean, but I know someone who does
09:15 πŸ”— yipdw so if I give her the EUC-KR translation and it makes sense
09:15 πŸ”— yipdw well
09:15 πŸ”— yipdw yeah
09:16 πŸ”— yipdw I can say that that page -- while rendering ok under EUC-JP -- is not Japanese at all
09:16 πŸ”— yipdw same with any Chinese encoding
09:16 πŸ”— ersi fortuneshitty sure is slow now >_>
09:16 πŸ”— yipdw again, not hard proof, but I like to think that most pages on the interwebs are written with an encoding that is appropriate for the source language
09:17 πŸ”— db48x heh
09:17 πŸ”— kennethre or super shitty software that is clueless
09:17 πŸ”— yipdw it is of course always possible that "charmsol" is a dick who decided it'd be funny to upload Chinese stuff encoded in Windows-31J and add a title graphic in Korean
09:18 πŸ”— db48x you'd have to be pretty perverse to do that
09:18 πŸ”— yipdw yeah, especially since you can't encode Chinese in Windows-31J
09:19 πŸ”— yipdw I mean I guess you could use the visually similar characters between kanji and hanzi
09:19 πŸ”— yipdw but you'd probably offend both Japanese and Chinese people that wy
09:19 πŸ”— yipdw :P
09:19 πŸ”— yipdw this fucking fortunecity page could start the next major Asian war
09:20 πŸ”— db48x lol
09:20 πŸ”— db48x good thing we're keeping it around then
09:20 πŸ”— yipdw yeah
09:20 πŸ”— yipdw Han unification was too tame
09:20 πŸ”— db48x would hate for the survivors to wonder what it was all for
09:21 πŸ”— yipdw KOUDO POINTO
09:22 πŸ”— yipdw han unification is actually really nuts
09:22 πŸ”— yipdw http://en.wikipedia.org/wiki/Han_unification#Examples_of_language_dependent_characters <-- my Ubuntu system currently fails this
09:22 πŸ”— yipdw of course, that table could also be broken
09:23 πŸ”— kennethre works great on my mac
09:23 πŸ”— yipdw doesn't work on my Snow Leopard machine
09:23 πŸ”— kennethre http://cl.ly/1i0X06133l421Z2C0T3q
09:23 πŸ”— yipdw I see all the same glyphs in each column
09:23 πŸ”— kennethre yipdw: chrome supports it, not safari
09:23 πŸ”— yipdw oh
09:24 πŸ”— kennethre looks like firefox does too
09:24 πŸ”— kennethre kind of shocking actually
09:24 πŸ”— kennethre safari's ussually a step ahead w/ typeface stuff
09:24 πŸ”— yipdw yeah, why would Chrome support it and not Safari
09:24 πŸ”— yipdw yeah, Firefox works forme
09:24 πŸ”— db48x you're supposed to see different glyphs
09:24 πŸ”— db48x not all the same in each column
09:24 πŸ”— yipdw yeah
09:25 πŸ”— db48x kennethre: so on your mac it doesn't work :)
09:25 πŸ”— yipdw those glyphs in the screenshot are correct
09:25 πŸ”— yipdw I think
09:25 πŸ”— kennethre db48x: it's working properly, look closely
09:26 πŸ”— yipdw the greatest variance is going to be between Chinese and Japanese/Korean; see e.g. U+7A7A (sky)
09:26 πŸ”— yipdw the Japanese rendering should have a hook on one stroke
09:26 πŸ”— yipdw it's also drawn differently from the Chinese traditional (and simplified too)
09:26 πŸ”— yipdw er
09:26 πŸ”— yipdw yeah
09:26 πŸ”— yipdw differs between both
09:26 πŸ”— kennethre http://cl.ly/1J3I2G1u1g3r0C0Y3z3D
09:27 πŸ”— kennethre (chrome)
09:27 πŸ”— kennethre firefox: http://cl.ly/413b3s441R2K0Y0a2C3r
09:28 πŸ”— db48x oh
09:28 πŸ”— db48x I have to zoom in on your screenshot to see :)
09:32 πŸ”— yipdw holy crap it is so bedtime
09:32 πŸ”— yipdw I never thought I would stay up for a discussion of character encodings
09:32 πŸ”— yipdw how weird
09:33 πŸ”— db48x :)
09:33 πŸ”— db48x sleep well
09:33 πŸ”— yipdw 'night
09:33 πŸ”— db48x dream in ascii
09:37 πŸ”— chronomex dream in utf-8 ;)
09:37 πŸ”— tef heheh
09:44 πŸ”— nitro2k01 Technically speaking, UTF8 is just another type of extended ASCII
09:46 πŸ”— chronomex no, you're wrong.
09:49 πŸ”— nitro2k01 The lower 128 characters are identical, and any ASCII compatible system can simply transmit UTF8 higher codepoints without being aware of their meaning
09:50 πŸ”— chronomex you're wrong, and I'm content with not arguing it.
09:50 πŸ”— nitro2k01 It may not be ASCII by the standard, but for all intents and purposes, it is
09:50 πŸ”— db48x I'm trying to program and I just can't get started
09:51 πŸ”— db48x and I still taste garlic
09:51 πŸ”— chronomex what time is it for you?
09:51 πŸ”— db48x interesting question
09:52 πŸ”— db48x it's 1:53am at my location
09:52 πŸ”— db48x but I woke up at 4pm, so for me it's lunch time
09:52 πŸ”— chronomex greetings, fellow pacific standard tribesman
09:52 πŸ”— db48x :D
09:53 πŸ”— db48x haven't met many others who have read that
09:54 πŸ”— db48x actually, I'm a plant
09:55 πŸ”— db48x my home timezone is on mars, actually
09:55 πŸ”— chronomex uhhuh
09:55 πŸ”— * chronomex &
09:56 πŸ”— db48x I slip about an hour a day relative to earth-time
09:58 πŸ”— db48x I need a burger
10:00 πŸ”— db48x bbl
16:07 πŸ”— swebb_ http://www.theverge.com/2012/3/5/2845560/physical-archive-of-the-internet-archive
16:24 πŸ”— Schbirid wow, that is some terrible webdesign
16:30 πŸ”— Schbirid Nemo_bis: damn, i did not expect .tar.xz to kick 7z's butt so much. it is not even considerably slower. damn
16:30 πŸ”— Schbirid well, not gonna re zip them all now
16:57 πŸ”— db48x Hydriz is going to outstrip us all
17:59 πŸ”— Schbirid hm, seems like TARring before compressing is actually what makes them so small
17:59 πŸ”— Coderjoe it is called solid compression. 7z has an option for it, but I am pretty sure it doesn't do it by default
17:59 πŸ”— kennethre then what does a tar.7z look like?
18:00 πŸ”— Coderjoe without solid compression, each file is compressed separately. with solid compression, each file can refer to data from previous files.
18:00 πŸ”— Schbirid oh damn
18:01 πŸ”— Dark_Star solid compression is only cool if you want to pack/unpack the archive as a whole. if you need one single file from it, you're screwed most of the time ;-)
18:01 πŸ”— Coderjoe (and anything like tar.gz, tar.bz2, tar.xx is essentially solid, because you're packing all the files into a single file first, and then doing the compression)
18:02 πŸ”— Coderjoe Dark_Star: I'd been playing with some ideas that would make it less painful for gz and bz2
18:02 πŸ”— Coderjoe I have no plans to much with lzma, though
18:02 πŸ”— Coderjoe er, s/much/muck/
18:03 πŸ”— Dark_Star hm, you would still need to unpack the first 99% of the archive if the file you need is in the last 1%...
18:03 πŸ”— Schbirid hm, this is 1.8GB in total. might be half that with solid
18:03 πŸ”— Schbirid not worth it imo
18:03 πŸ”— Schbirid (archive of http://forumplanet.gamespy.com/ )
18:04 πŸ”— dnova I think it's worth saving half of the space
18:08 πŸ”— Coderjoe Dark_Star: not really.
18:08 πŸ”— Coderjoe well, you would need to decompress the archive at least once in order to index anything
18:20 πŸ”— Dark_Star so the gz/bz2 algos don't use the full data before point X but only the last n kb/mb? I admit I didn't look that closely into how they work internally :)
18:21 πŸ”— Coderjoe even lzma has a limited window, but the window is much larger
18:23 πŸ”— Coderjoe bz2 splits the data into blocks that it can fit in about 900000 bytes (though it may put more raw input into a block due to doing a simple RLE compression at that stage. the creator says it was a mistake. but it is what allows bz2 to compress several MB of the same character down to a few KB)
18:23 πŸ”— Coderjoe and then each block is independantly compressed
18:24 πŸ”— Coderjoe deflate (the algorithm behind gz) has a history of up to 16KB, iirc
18:24 πŸ”— Coderjoe might be 32KB
18:27 πŸ”— Coderjoe and deflate also has the ability to close a block and open a new block. each block can be one of: stored (with a maximum length of 65535), static huffman, or dynamic huffman. the length of the huffman block is pretty much unlimited, but it usually is preferable to start a new block in order to adapt your huffman codes to changes in the input
18:47 πŸ”— Dark_Star ah makes sense. then you can simply spit out something like "file-marks" during the compression and (e.g.) append them to the bz2 file to make it almost-randomly-accessible, similar to what ranlib does to .a files... nice idea, I wonder why nobody implemented it in tar (or any other compressor) yet
18:50 πŸ”— Coderjoe well, putting it at the end of the tar file is useless because you would need to decompress everything to get there.
18:53 πŸ”— Coderjoe and appending it after the compression will make the standard decompressors unhappy
18:56 πŸ”— Dark_Star you should of course put it at the end uncompressed... AFAIK at least bz2 ignores junk after the end of the archive, and you could add the size of the directory structure as the very last dword so that you can seek to the end of the compressed stream
18:58 πŸ”— Coderjoe I'd rather do it as a separate file, since it can be regenerated from the main file easilly enough (though might be somewhat computationally expensive. just don't lose the index file and you don't have to regenerate it again.)
19:03 πŸ”— Dark_Star but having everything in one file has its benefits too... renaming, copying, uploading, etc...
19:08 πŸ”— Schbirid hm, "7z a -ms=on" did give me the same size as if i had not use -ms=on. for a directory tree with lots of files, no tarball
19:10 πŸ”— Schbirid SketchCow: Nemo_bis and i downloaded all the forums from http://forumplanet.gamespy.com/ . i got one 7z per forum (77, ~1.7GB). should i just upload it to archive.org anywhere and you move it to the archiveteam collection if you want?
19:11 πŸ”— Schbirid before anyone starts counting. GameSpy Forums & GameSpy Comrade are both /gamespy/
19:20 πŸ”— chronomex you can move an item into a collection after it's created, so that would be fine I think
19:25 πŸ”— kennethre SketchCow: we're in the news together :P http://thechangelog.com/post/18793520674/dive-into-mark-a-mirror-of-mark-pilgrims-github
19:31 πŸ”— Nemo_bis I think it's just a different compression level. With 7z you have to play a lot with many parameters, probably .xz just guessed the best one in that case.
19:43 πŸ”— soultcer Nemo_bis, xz also lets you set most compression parameters
19:45 πŸ”— Nemo_bis soultcer, yes, I'm saying that perhaps its defaults are the best for this case
19:48 πŸ”— soultcer If you try with the highest parameters be prepared to need a gigabyte of ram for compressing ;-)
19:50 πŸ”— Nemo_bis RAM is there slacking
23:26 πŸ”— Coderjoe wanted: magnetic shielding boxes for storing various magnetic media
23:28 πŸ”— chronomex would a ferrous ammo box work?
23:28 πŸ”— Coderjoe hmm
23:28 πŸ”— Coderjoe good point
23:28 πŸ”— Coderjoe I should probably check out some military surplus places
23:34 πŸ”— yipdw I get the feeling that the answer to a large number of life's odd requests is "military surplus"
23:35 πŸ”— Coderjoe not terribly efficient for storing things like VHS tapes, but usable, especially to shield from magnetic fields
23:57 πŸ”— chronomex 15:35:39 < yipdw> I get the feeling that the answer to a large number of life's odd requests is "military surplus"
23:57 πŸ”— chronomex the military does have some odd needs

irclogger-viewer