[07:40] alard: found an error [07:40] Sleep and retry. [07:40] ERROR contacting tracker. Could not mark 'es/arcoiris/metaf%edsica' done. [07:40] Telling tracker that 'es/arcoiris/metaf%edsica' is done. [07:40] Sleep and retry. [08:19] I still taste garlic [08:49] -rw-rw-r--. 1 db48x db48x 456M Mar 5 00:49 com-emachines-e5-20120305-014015.warc.gz [08:53] oh man [08:53] a lot of the filenames in this directory aren't unicode [08:54] so? :) [08:55] how will anyone know what they mean? [08:56] -rw-rw-r--. 1 db48x db48x 50K Mar 5 00:11 h1-2-2.Ã»?êº°??(??ß£Ü¬??)(T).htm [08:56] completely mangled [08:56] that's not mangled, that's you missing CJK fonts most likely [08:57] it has no cjk characters in it [08:57] the filename is there though [08:57] it has atleast one, I got it displayed [08:57] that's a coincidence of the unicode stack you and I are using to communicate [08:57] Oh, heh. my bad [08:57] it's modern mojibake [08:57] I guess you can try to run it through iconv, if you've got the filename in a log or something [08:57] yeah. Know the source encoding? [08:58] hell if I know what the source is [08:58] that's supposedly a SYRIAC LETTER TAW in there too [08:58] which is RTL and triggers bidi rendering [08:58] yaaay RTL [08:58] ersi: nope. no way to know [08:58] the content of the file is no better [08:59] what's the name of the account? [08:59] I guess you could view it in a bunch of different code pages and see what looks readable [08:59] or rather [08:59] emachines [08:59] what's the URI [08:59] hmm [08:59] com/emachines/e5 [09:00] http://www.fortunecity.com/emachines/e5/136/Educa/h-k-1.htm, for example [09:00] oh jesus christ what a mess [09:01] fortunecity is awfully slow too [09:01] oh [09:01] it's EUC-JP [09:01] * db48x tries some encodings [09:01] I think [09:01] wait no [09:02] if it IS, that'd be odd as fuck [09:02] Firefox says EUC-KR when I do autodetection for east asia [09:02] h god fortunecity [09:02] oh* [09:02] EUC-KR makes more sense [09:02] lots of 404s that way, though [09:03] whoa, fortunecity is shutting down? ._. [09:03] heh wow [09:03] [09:03] old constructs NEVER die [09:04] :) [09:06] yipdw: it's ISO-8859-1 apparently [09:07] accoding to my http library [09:07] codepages must die [09:08] kennethre: that page is? [09:09] kennethre: you can't do korean in 8859-1 :) [09:09] yipdw: yeah [09:09] or jp or whatever it really is [09:09] if I apply ISO-8859-1 to that page I get gibberish [09:09] this page? http://www.fortunecity.com/emachines/e5/136/Educa/h-k-1.htm [09:09] yes [09:09] kennethre: that page [09:09] the server doesn't report an encoding [09:09] I'm guessing EUC-KR, just because the title graphic is in Korean [09:09] there's no meta tag [09:09] db48x: yeah that's chardet's best guess for the encoding [09:10] it may have a lot of gibberish, but all the chars are valid ;) [09:10] because there's a lot of html in it [09:10] so that biases it [09:10] kennethre: :) [09:10] ah it's cp932 [09:10] look at all those valid bytes :) [09:11] a lot of japaneese sites do that [09:11] (Windows-31J) [09:11] yeah it works perfectly then [09:11] I'm pretty sure it's not Japanese :P [09:11] it is [09:11] well [09:12] http://www.fortunecity.com/emachines/e5/136/Image/jaul.gif isn't [09:12] :P [09:12] I mean [09:12] http://cl.ly/2b0o1D3n1q2u0D0F3M2f [09:12] looks damn valid to me [09:12] I understand it's possible for a page to be written using a Japanese encoding but with hangul in graphics [09:12] fwiw, script for fortunecity works here [09:12] but the fact that it's hangul in the graphic suggests to me that it's Korean [09:12] well whatever it is, i think that's the right encoding [09:12] yipdw: that looks korean [09:12] ersi: it is [09:12] http://cl.ly/1b2s3S181M1O1R1z1G15 [09:13] maybe not perfect quite though [09:13] and that looks like katakana [09:13] I can't connect to cl.ly for some reason [09:13] WHAT THE HELL INTERNET [09:13] WHAT THE INTERNETS [09:13] yipdw: flush dns cache :) [09:13] oh there it goes [09:14] kennethre: re: http://cl.ly/1b2s3S181M1O1R1z1G15 -> I doubt that's valid [09:14] I mean, yes, they're characters [09:14] yeah, but close :P [09:14] but nobody uses half-width katakana and kanji to name things [09:14] never say never :P [09:14] true [09:15] the person running this site could have been an asshole :P [09:15] douchery is allowed on the internets [09:15] I have one other test though [09:15] I do not read Korean, but I know someone who does [09:15] so if I give her the EUC-KR translation and it makes sense [09:15] well [09:15] yeah [09:16] I can say that that page -- while rendering ok under EUC-JP -- is not Japanese at all [09:16] same with any Chinese encoding [09:16] fortuneshitty sure is slow now >_> [09:16] again, not hard proof, but I like to think that most pages on the interwebs are written with an encoding that is appropriate for the source language [09:17] heh [09:17] or super shitty software that is clueless [09:17] it is of course always possible that "charmsol" is a dick who decided it'd be funny to upload Chinese stuff encoded in Windows-31J and add a title graphic in Korean [09:18] you'd have to be pretty perverse to do that [09:18] yeah, especially since you can't encode Chinese in Windows-31J [09:19] I mean I guess you could use the visually similar characters between kanji and hanzi [09:19] but you'd probably offend both Japanese and Chinese people that wy [09:19] :P [09:19] this fucking fortunecity page could start the next major Asian war [09:20] lol [09:20] good thing we're keeping it around then [09:20] yeah [09:20] Han unification was too tame [09:20] would hate for the survivors to wonder what it was all for [09:21] KOUDO POINTO [09:22] han unification is actually really nuts [09:22] http://en.wikipedia.org/wiki/Han_unification#Examples_of_language_dependent_characters <-- my Ubuntu system currently fails this [09:22] of course, that table could also be broken [09:23] works great on my mac [09:23] doesn't work on my Snow Leopard machine [09:23] http://cl.ly/1i0X06133l421Z2C0T3q [09:23] I see all the same glyphs in each column [09:23] yipdw: chrome supports it, not safari [09:23] oh [09:24] looks like firefox does too [09:24] kind of shocking actually [09:24] safari's ussually a step ahead w/ typeface stuff [09:24] yeah, why would Chrome support it and not Safari [09:24] yeah, Firefox works forme [09:24] you're supposed to see different glyphs [09:24] not all the same in each column [09:24] yeah [09:25] kennethre: so on your mac it doesn't work :) [09:25] those glyphs in the screenshot are correct [09:25] I think [09:25] db48x: it's working properly, look closely [09:26] the greatest variance is going to be between Chinese and Japanese/Korean; see e.g. U+7A7A (sky) [09:26] the Japanese rendering should have a hook on one stroke [09:26] it's also drawn differently from the Chinese traditional (and simplified too) [09:26] er [09:26] yeah [09:26] differs between both [09:26] http://cl.ly/1J3I2G1u1g3r0C0Y3z3D [09:27] (chrome) [09:27] firefox: http://cl.ly/413b3s441R2K0Y0a2C3r [09:28] oh [09:28] I have to zoom in on your screenshot to see :) [09:32] holy crap it is so bedtime [09:32] I never thought I would stay up for a discussion of character encodings [09:32] how weird [09:33] :) [09:33] sleep well [09:33] 'night [09:33] dream in ascii [09:37] dream in utf-8 ;) [09:37] heheh [09:44] Technically speaking, UTF8 is just another type of extended ASCII [09:46] no, you're wrong. [09:49] The lower 128 characters are identical, and any ASCII compatible system can simply transmit UTF8 higher codepoints without being aware of their meaning [09:50] you're wrong, and I'm content with not arguing it. [09:50] It may not be ASCII by the standard, but for all intents and purposes, it is [09:50] I'm trying to program and I just can't get started [09:51] and I still taste garlic [09:51] what time is it for you? [09:51] interesting question [09:52] it's 1:53am at my location [09:52] but I woke up at 4pm, so for me it's lunch time [09:52] greetings, fellow pacific standard tribesman [09:52] :D [09:53] haven't met many others who have read that [09:54] actually, I'm a plant [09:55] my home timezone is on mars, actually [09:55] uhhuh [09:55] * chronomex & [09:56] I slip about an hour a day relative to earth-time [09:58] I need a burger [10:00] bbl [16:07] http://www.theverge.com/2012/3/5/2845560/physical-archive-of-the-internet-archive [16:24] wow, that is some terrible webdesign [16:30] Nemo_bis: damn, i did not expect .tar.xz to kick 7z's butt so much. it is not even considerably slower. damn [16:30] well, not gonna re zip them all now [16:57] Hydriz is going to outstrip us all [17:59] hm, seems like TARring before compressing is actually what makes them so small [17:59] it is called solid compression. 7z has an option for it, but I am pretty sure it doesn't do it by default [17:59] then what does a tar.7z look like? [18:00] without solid compression, each file is compressed separately. with solid compression, each file can refer to data from previous files. [18:00] oh damn [18:01] solid compression is only cool if you want to pack/unpack the archive as a whole. if you need one single file from it, you're screwed most of the time ;-) [18:01] (and anything like tar.gz, tar.bz2, tar.xx is essentially solid, because you're packing all the files into a single file first, and then doing the compression) [18:02] Dark_Star: I'd been playing with some ideas that would make it less painful for gz and bz2 [18:02] I have no plans to much with lzma, though [18:02] er, s/much/muck/ [18:03] hm, you would still need to unpack the first 99% of the archive if the file you need is in the last 1%... [18:03] hm, this is 1.8GB in total. might be half that with solid [18:03] not worth it imo [18:03] (archive of http://forumplanet.gamespy.com/ ) [18:04] I think it's worth saving half of the space [18:08] Dark_Star: not really. [18:08] well, you would need to decompress the archive at least once in order to index anything [18:20] so the gz/bz2 algos don't use the full data before point X but only the last n kb/mb? I admit I didn't look that closely into how they work internally :) [18:21] even lzma has a limited window, but the window is much larger [18:23] bz2 splits the data into blocks that it can fit in about 900000 bytes (though it may put more raw input into a block due to doing a simple RLE compression at that stage. the creator says it was a mistake. but it is what allows bz2 to compress several MB of the same character down to a few KB) [18:23] and then each block is independantly compressed [18:24] deflate (the algorithm behind gz) has a history of up to 16KB, iirc [18:24] might be 32KB [18:27] and deflate also has the ability to close a block and open a new block. each block can be one of: stored (with a maximum length of 65535), static huffman, or dynamic huffman. the length of the huffman block is pretty much unlimited, but it usually is preferable to start a new block in order to adapt your huffman codes to changes in the input [18:47] ah makes sense. then you can simply spit out something like "file-marks" during the compression and (e.g.) append them to the bz2 file to make it almost-randomly-accessible, similar to what ranlib does to .a files... nice idea, I wonder why nobody implemented it in tar (or any other compressor) yet [18:50] well, putting it at the end of the tar file is useless because you would need to decompress everything to get there. [18:53] and appending it after the compression will make the standard decompressors unhappy [18:56] you should of course put it at the end uncompressed... AFAIK at least bz2 ignores junk after the end of the archive, and you could add the size of the directory structure as the very last dword so that you can seek to the end of the compressed stream [18:58] I'd rather do it as a separate file, since it can be regenerated from the main file easilly enough (though might be somewhat computationally expensive. just don't lose the index file and you don't have to regenerate it again.) [19:03] but having everything in one file has its benefits too... renaming, copying, uploading, etc... [19:08] hm, "7z a -ms=on" did give me the same size as if i had not use -ms=on. for a directory tree with lots of files, no tarball [19:10] SketchCow: Nemo_bis and i downloaded all the forums from http://forumplanet.gamespy.com/ . i got one 7z per forum (77, ~1.7GB). should i just upload it to archive.org anywhere and you move it to the archiveteam collection if you want? [19:11] before anyone starts counting. GameSpy Forums & GameSpy Comrade are both /gamespy/ [19:20] you can move an item into a collection after it's created, so that would be fine I think [19:25] SketchCow: we're in the news together :P http://thechangelog.com/post/18793520674/dive-into-mark-a-mirror-of-mark-pilgrims-github [19:31] I think it's just a different compression level. With 7z you have to play a lot with many parameters, probably .xz just guessed the best one in that case. [19:43] Nemo_bis, xz also lets you set most compression parameters [19:45] soultcer, yes, I'm saying that perhaps its defaults are the best for this case [19:48] If you try with the highest parameters be prepared to need a gigabyte of ram for compressing ;-) [19:50] RAM is there slacking [23:26] wanted: magnetic shielding boxes for storing various magnetic media [23:28] would a ferrous ammo box work? [23:28] hmm [23:28] good point [23:28] I should probably check out some military surplus places [23:34] I get the feeling that the answer to a large number of life's odd requests is "military surplus" [23:35] not terribly efficient for storing things like VHS tapes, but usable, especially to shield from magnetic fields [23:57] 15:35:39 < yipdw> I get the feeling that the answer to a large number of life's odd requests is "military surplus" [23:57] the military does have some odd needs