| Time | Nickname | Message | 
    
        | 10:36
            
                🔗 | db48x | https://gist.github.com/1931814 | 
    
        | 10:42
            
                🔗 | db48x | so those two missing files were skipped because they're not actually in the correct month | 
    
        | 10:43
            
                🔗 | db48x | the file size column does match the string "200409", however, so the estimate was off | 
    
        | 10:45
            
                🔗 | * | db48x sighs | 
    
        | 10:45
            
                🔗 | db48x | the svn history isn't very useful :) | 
    
        | 11:23
            
                🔗 | emijrp | this morning i woke up thinking in a possible bug | 
    
        | 11:24
            
                🔗 | emijrp | long filenames cant be saved | 
    
        | 11:24
            
                🔗 | emijrp | mediawiki allows very long filenames, i had issues with that in dumpgenerator.py | 
    
        | 11:24
            
                🔗 | emijrp | please, stop your wikimedia commons scripts, until i can make some tests | 
    
        | 11:25
            
                🔗 | emijrp | db48x: Nemo_bis | 
    
        | 12:12
            
                🔗 | emijrp | probably, also errors with names including '?' | 
    
        | 12:30
            
                🔗 | Nemo_bis | emijrp, ok, stopping | 
    
        | 12:31
            
                🔗 | Nemo_bis | emijrp, this is what I have now: http://p.defau.lt/?2oYOTUZwuvTRDtBXsFAtxg | 
    
        | 12:31
            
                🔗 | soultcer | You could use the namehash as filename | 
    
        | 12:33
            
                🔗 | emijrp | in dumpgenerator.py i use filename if it is less than 100 chars, or | 
    
        | 12:33
            
                🔗 | emijrp | filename[:100] + hash (filename) if it is long | 
    
        | 12:34
            
                🔗 | emijrp | just using hash as filename is not descriptive for browsing | 
    
        | 12:35
            
                🔗 | emijrp | Nemo_bis: great, very fast, but i guess we need to re-generate all those, after fixing these bugs | 
    
        | 12:35
            
                🔗 | Nemo_bis | emijrp, :/ | 
    
        | 12:35
            
                🔗 | emijrp | im sorry, but i said yesterday we are still under testing | 
    
        | 12:35
            
                🔗 | Nemo_bis | I think some form of check would be useful and needed anyway, to redownload wrong images | 
    
        | 12:35
            
                🔗 | Nemo_bis | Sure, I don't mind. Just this ^ | 
    
        | 12:43
            
                🔗 | ersi | emijrp: I'd suggest using some ID for filenames, ie a hash digest and that you save the original filename in metadata | 
    
        | 12:43
            
                🔗 | ersi | But really, what's the problem with long filenames? | 
    
        | 12:44
            
                🔗 | emijrp | ubuntu (and others i guess) doesnt allow filenames greater than 143 chars | 
    
        | 12:44
            
                🔗 | emijrp | but you can upload longer filenames to mediawiki | 
    
        | 12:51
            
                🔗 | emijrp | by the way, i think we can create 1 item per year | 
    
        | 12:51
            
                🔗 | emijrp | IA items | 
    
        | 12:56
            
                🔗 | ersi | It depends on the file system, not the operating system. You mean EXT4? I think that's the default filesystem since Ubuntu 10.04 | 
    
        | 12:57
            
                🔗 | ersi | Max filename length 256 bytes (characters) | 
    
        | 12:57
            
                🔗 | emijrp | yes, but we want "global" compatiblity | 
    
        | 12:57
            
                🔗 | emijrp | please who download the zips, have to unpack all dont get nasty errors | 
    
        | 12:57
            
                🔗 | ersi | then in my opinion, you can only go with hashes + real filename in metadata | 
    
        | 12:58
            
                🔗 | ersi | then it'd even work on shitty fat filesystems | 
    
        | 12:58
            
                🔗 | emijrp | please = people | 
    
        | 12:59
            
                🔗 | emijrp | but hashes as filenames, or hash + fileextension ? | 
    
        | 12:59
            
                🔗 | emijrp | if iuse just hash, you dont know if it is a image, a odp, a video, etc | 
    
        | 13:00
            
                🔗 | ersi | true | 
    
        | 13:00
            
                🔗 | ersi | hash+file extention then :] | 
    
        | 13:00
            
                🔗 | emijrp | anyway, i prefer the dumpgenerator.py method | 
    
        | 13:01
            
                🔗 | emijrp | first 100 chars + hash (32 chars) + extension | 
    
        | 13:01
            
                🔗 | emijrp | that is descriptive and fix the long filenames bug | 
    
        | 13:01
            
                🔗 | emijrp | and you have info about the file type | 
    
        | 13:02
            
                🔗 | emijrp | what is the limit of files by Ia item? | 
    
        | 13:03
            
                🔗 | emijrp | yearly items = 365 zips + 365 .csv | 
    
        | 13:03
            
                🔗 | ersi | 132 characters + 4-10 characters as file ext? | 
    
        | 13:03
            
                🔗 | emijrp | yes | 
    
        | 13:04
            
                🔗 | emijrp | hash = hash over original filename | 
    
        | 13:10
            
                🔗 | db48x | you'll need to process the filenames a little more than that | 
    
        | 13:10
            
                🔗 | db48x | most unix filesystems allow any character except / in a name, FAT and NTFS are a little more restrictive | 
    
        | 13:11
            
                🔗 | emijrp | you mean unicode chars? | 
    
        | 13:11
            
                🔗 | db48x | you can't use ?, :, \, /, and half a dozen others on FAT or NTFS | 
    
        | 13:12
            
                🔗 | emijrp | those are not allowed in mediawiki | 
    
        | 13:12
            
                🔗 | emijrp | my mistake | 
    
        | 13:12
            
                🔗 | emijrp | yes, only : not allowed | 
    
        | 13:12
            
                🔗 | db48x | \/:*?"<>| have to be eliminated | 
    
        | 13:13
            
                🔗 | emijrp | i can use just a filename hash as filename, but, man, that is not usable | 
    
        | 13:13
            
                🔗 | db48x | yea :( | 
    
        | 13:13
            
                🔗 | emijrp | you have to browse with a open csv window | 
    
        | 13:14
            
                🔗 | db48x | could build an html index | 
    
        | 13:14
            
                🔗 | db48x | but still not as good | 
    
        | 13:15
            
                🔗 | db48x | I've suggested before that we should create filesystem images instead of collections of files :) | 
    
        | 13:18
            
                🔗 | emijrp | if we discard fat and ntfs, we can use first 100 chars + original filename hash (32 chars) + extension | 
    
        | 13:19
            
                🔗 | db48x | yea | 
    
        | 13:19
            
                🔗 | soultcer | If the internet archive used tar files you could just use the python tarfile extension and write the file directly from python. Would work on all platforms with zero unicode issues | 
    
        | 13:19
            
                🔗 | emijrp | but if filename is less than 100, i dont change the filename | 
    
        | 13:19
            
                🔗 | db48x | you could urlencode the first 50 characters, plus the filename hash and the extension | 
    
        | 13:20
            
                🔗 | db48x | no, 50 would be too big in the worst case | 
    
        | 13:21
            
                🔗 | emijrp | yes, but a guy who donwload the .tar in windows, what? | 
    
        | 13:21
            
                🔗 | soultcer | Uses 7zip no problem | 
    
        | 13:21
            
                🔗 | emijrp | he cant untar | 
    
        | 13:21
            
                🔗 | db48x | and anyway urlencoded names are unreadable in the majority of languages | 
    
        | 13:21
            
                🔗 | db48x | emijrp: almost any unzip program will also open tarballs | 
    
        | 13:22
            
                🔗 | db48x | you could just replace those 9 characters with their urlencoded forms | 
    
        | 13:22
            
                🔗 | soultcer | The POSIX.1-2001 standard fully supports unicode and long filenames and any other kind of metadata you can think of | 
    
        | 13:22
            
                🔗 | emijrp | i mean, you can explore .tar from IA web interface, but on windows you cant untar files using arabic symbols + ? + / + * | 
    
        | 13:23
            
                🔗 | soultcer | I still don't understand why the Internet Archive would prefer ZIP anyway. Tar is pretty much the gold standard in archiving files. | 
    
        | 13:25
            
                🔗 | emijrp | becase zip compress and tar no'? | 
    
        | 13:26
            
                🔗 | soultcer | You can compress a tar archive with any compressor you want (compress, gzip, bzip2, lzma, lzma2/xz, ...) | 
    
        | 13:26
            
                🔗 | soultcer | Zip compresses each file individually, giving you almost no savings, especially on the images | 
    
        | 13:27
            
                🔗 | soultcer | Compressing the tar archive yields a much higher compression ratio since all the metadata files contain a lot of the same data | 
    
        | 13:27
            
                🔗 | emijrp | yes, but you cant browse tars | 
    
        | 13:27
            
                🔗 | emijrp | compressed tars | 
    
        | 13:28
            
                🔗 | emijrp | i understood from my question to Jason that the desired format for browsing is zip | 
    
        | 13:28
            
                🔗 | emijrp | not tar | 
    
        | 13:28
            
                🔗 | emijrp | zip = browsable + compress | 
    
        | 13:28
            
                🔗 | emijrp | tar = browsable | 
    
        | 13:28
            
                🔗 | soultcer | zip = browsable + compress + bad unicode support + filename problems | 
    
        | 13:29
            
                🔗 | soultcer | tar = browsable + unicode support + long filenames | 
    
        | 13:29
            
                🔗 | soultcer | The compression is pretty much useless on pictures (though it helps for metadata and svg=xml pictures) | 
    
        | 13:29
            
                🔗 | emijrp | i read here yesterday that zip fixed their unicode issues | 
    
        | 13:30
            
                🔗 | soultcer | Partially. The official Winzip standard knows of a flag which is set to "on" when the filenames are encoded in unicode. The unix version of zip adds a special metadata field for files which contains the unicode filename | 
    
        | 13:31
            
                🔗 | soultcer | The downside is, that the python zip module supports neither of those extensions, while the python tar module supports unicode just fine. | 
    
        | 13:31
            
                🔗 | emijrp | i dont use python to save files, just wget + culr | 
    
        | 13:32
            
                🔗 | emijrp | wikimedia eservers dont like urllib | 
    
        | 13:32
            
                🔗 | soultcer | That's my point: If you use python instead of wget, curl and zip commands, you could get around all those unicode and filename length issues | 
    
        | 13:33
            
                🔗 | soultcer | If you can give me an example where urllib doesn't work, I'd actually be curious to investigate, even if you don't decide on using urllib. | 
    
        | 13:33
            
                🔗 | emijrp | ok | 
    
        | 13:33
            
                🔗 | emijrp | just try to save http://upload.wikimedia.org/wikipedia/commons/5/50/%D7%A9%D7%A0%D7%99%D7%A6%D7%9C%D7%A8.jpg using urllib or urllib2 | 
    
        | 13:35
            
                🔗 | emijrp | anyway, if we use tar, that allows unicode and 10000 chars filenames, how the hell i unpack that tar in my ubuntu? it doesnt like 143+ chars filenames | 
    
        | 13:40
            
                🔗 | soultcer | Get a better filesystem? | 
    
        | 13:41
            
                🔗 | soultcer | There must be some way to preserve the filenames. If you like to stick to zip, you could just write a list of "original filename:filename in zip archive" and put that list into each zip file? | 
    
        | 13:42
            
                🔗 | emijrp | obviously if i rename files, i wont delete the originalname -> newname | 
    
        | 13:42
            
                🔗 | emijrp | that goes to a .csv file | 
    
        | 13:42
            
                🔗 | soultcer | Okay, guess that's fine too. | 
    
        | 13:43
            
                🔗 | soultcer | Btw, if you use Ubuntu with ext3 or ext4 the maximum filename length will be 255 bytes, same as for ZFS | 
    
        | 13:43
            
                🔗 | soultcer | And since Mediawiki commons uses ZFS as well their filenames shouldn't be longer than 255 bytes either? | 
    
        | 13:44
            
                🔗 | emijrp | i use ext4 | 
    
        | 13:44
            
                🔗 | emijrp | but when i try to download this with wget, it crashes http://upload.wikimedia.org/wikipedia/commons/d/d8/Ferenc_JOACHIM_%281882-1964%29%2C_Hungarian_%28Magyar%29_artist_painter%2C_his_wife_Margit_GRAF%2C_their_three_children_Piroska%2C_Ferenc_G_and_Attila%2C_and_two_of_nine_grandchildren%2C_1944%2C_Budapest%2C_Hungary..jpg | 
    
        | 13:44
            
                🔗 | emijrp | try it | 
    
        | 13:45
            
                🔗 | soultcer | Works fine for me with wget on ext4 file system | 
    
        | 13:45
            
                🔗 | emijrp | not for me | 
    
        | 13:45
            
                🔗 | db48x | worked for me | 
    
        | 13:46
            
                🔗 | soultcer | Weird, but assuming that this also happens on other Ubuntu installs it is better to limit filenames somehow | 
    
        | 13:46
            
                🔗 | db48x | GNU Wget 1.12-2507-dirty and ext4 | 
    
        | 13:48
            
                🔗 | db48x | I guess I should go to work | 
    
        | 13:50
            
                🔗 | emijrp | i dont understand why it fails on my comp | 
    
        | 13:53
            
                🔗 | emijrp | do you know how to sha1sum in base 36? | 
    
        | 14:02
            
                🔗 | db48x | you don't do sha1sum in any particular base | 
    
        | 14:02
            
                🔗 | db48x | the result of the hashing alorithm is just a number | 
    
        | 14:02
            
                🔗 | db48x | the sha1sum outputs that number in hexidecimal | 
    
        | 14:03
            
                🔗 | db48x | you have that same number in base 36, so just convert | 
    
        | 14:07
            
                🔗 | db48x | well, be back later | 
    
        | 14:09
            
                🔗 | ersi | emijrp: I can download that URL with wget just fine | 
    
        | 14:10
            
                🔗 | ersi | 2f4ffd812a553a00cdfed2f2ec51f1f92baa0272  Ferenc_JOACHIM_(1882-1964),_Hungarian_(Magyar)_artist_painter,_his_wife_Margit_GRAF,_their_three_children_Piroska,_Ferenc_G_and_Attila,_and_two_of_nine_grandchildren,_1944,_Budapest,_Hungary..jpg | 
    
        | 14:10
            
                🔗 | ersi | sha1sum on the file~ | 
    
        | 14:25
            
                🔗 | emijrp | can you download this with wget too Nemo_bis ? http://upload.wikimedia.org/wikipedia/commons/d/d8/Ferenc_JOACHIM_%281882-1964%29%2C_Hungarian_%28Magyar%29_artist_painter%2C_his_wife_Margit_GRAF%2C_their_three_children_Piroska%2C_Ferenc_G_and_Attila%2C_and_two_of_nine_grandchildren%2C_1944%2C_Budapest%2C_Hungary..jpg | 
    
        | 14:32
            
                🔗 | ersi | emijrp: I could | 
    
        | 14:32
            
                🔗 | ersi | oh, it's the same pic as before - nevermind | 
    
        | 14:45
            
                🔗 | emijrp | im working on the long filenames and '?' chars bug, | 
    
        | 15:07
            
                🔗 | emijrp | testing the new version. ... | 
    
        | 15:17
            
                🔗 | emijrp | example of image with * http://commons.wikimedia.org/wiki/File:*snore*_-_Flickr_-_exfordy.jpg | 
    
        | 15:40
            
                🔗 | emijrp | are there file extension larger than 4 chars? | 
    
        | 15:42
            
                🔗 | emijrp | http://ubuntuforums.org/showthread.php?t=1173541 | 
    
        | 16:02
            
                🔗 | emijrp | ok | 
    
        | 16:02
            
                🔗 | emijrp | the new version is out | 
    
        | 16:02
            
                🔗 | emijrp | now .desc files are .xml | 
    
        | 16:03
            
                🔗 | emijrp | filenames are truncated if length > 100, to this format: first 100 chars + hash (full filename) + extension | 
    
        | 16:03
            
                🔗 | emijrp | script tested with  arabic, hebrew, and weird filenames containing * ? | 
    
        | 16:04
            
                🔗 | emijrp | oh, and now it creates the .zip and a .csv for every day | 
    
        | 16:04
            
                🔗 | emijrp | the .csv contain all the metadata from the feed list for that day, including the renaming info | 
    
        | 16:04
            
                🔗 | emijrp | http://code.google.com/p/wikiteam/source/browse/trunk/commonsdownloader.py | 
    
        | 16:05
            
                🔗 | emijrp | i think we can download the 2004 year (it starts in 2004-09-07), and try to upload all that into IA | 
    
        | 16:05
            
                🔗 | emijrp | make an item for 2004, create the browsable links, etc, and look if all is fine | 
    
        | 16:05
            
                🔗 | emijrp | then, launch the script for 2005, 2006.. | 
    
        | 16:06
            
                🔗 | emijrp | Nemo_bis: downloaded yesterday 2004, it is less than 1GB (all days), so, we can make this test between today and tomorrow | 
    
        | 16:15
            
                🔗 | emijrp | redownload the code, i have modfied a line a second ago | 
    
        | 17:26
            
                🔗 | Nemo_bis | hm, back | 
    
        | 17:49
            
                🔗 | Thomas-ED | hi guys | 
    
        | 17:49
            
                🔗 | Thomas-ED | Nemo_bis, are you there? | 
    
        | 17:49
            
                🔗 | Nemo_bis | yes | 
    
        | 17:50
            
                🔗 | Thomas-ED | alright | 
    
        | 17:50
            
                🔗 | Thomas-ED | so, i'm a sysadmin for encyclopedia dramatica, we're gonna start offering image dumps | 
    
        | 17:50
            
                🔗 | Thomas-ED | and possibly page dumps | 
    
        | 17:50
            
                🔗 | Nemo_bis | good | 
    
        | 17:50
            
                🔗 | Thomas-ED | we have around 30GB of images | 
    
        | 17:50
            
                🔗 | Thomas-ED | how do we go about this? | 
    
        | 17:50
            
                🔗 | Nemo_bis | it's not that hard | 
    
        | 17:51
            
                🔗 | Thomas-ED | i mean | 
    
        | 17:51
            
                🔗 | Nemo_bis | For the images, you could just make a tar of your image directory and put it on archive.org | 
    
        | 17:51
            
                🔗 | Thomas-ED | but | 
    
        | 17:51
            
                🔗 | Thomas-ED | we can provide the dumps on our servers | 
    
        | 17:51
            
                🔗 | Thomas-ED | do you guys download it? or do we upload it? | 
    
        | 17:51
            
                🔗 | Nemo_bis | it's better if you upload directly | 
    
        | 17:51
            
                🔗 | Nemo_bis | I can give you the exact command if you don't know IA's s3 interface | 
    
        | 17:56
            
                🔗 | Nemo_bis | Thomas-ED, text is usually more interesting, there's https://www.mediawiki.org/wiki/Manual:DumpBackup.php for it | 
    
        | 17:56
            
                🔗 | Thomas-ED | so nobody would want the images? | 
    
        | 17:57
            
                🔗 | Thomas-ED | were you responsible for dumping ed.com? | 
    
        | 17:57
            
                🔗 | Thomas-ED | we started .ch from those dumps :P | 
    
        | 18:00
            
                🔗 | Nemo_bis | Thomas-ED, no, I didn't do that | 
    
        | 18:00
            
                🔗 | Nemo_bis | the images are good for archival purposes, but the text is more important usually | 
    
        | 18:01
            
                🔗 | Nemo_bis | also, without text one can't use the images, because one doesn't know what they are and what's their licensing status :) | 
    
        | 18:02
            
                🔗 | Thomas-ED | well | 
    
        | 18:02
            
                🔗 | Thomas-ED | i don't know if you're familiar with ED or anything | 
    
        | 18:02
            
                🔗 | Thomas-ED | we don't really care about license or copyright, none of that info is available for our images | 
    
        | 18:02
            
                🔗 | Nemo_bis | yes, that's why I said "usually" :p | 
    
        | 18:02
            
                🔗 | Nemo_bis | some description will probably be there anyway | 
    
        | 18:03
            
                🔗 | Nemo_bis | the images tar is very easy to do, but the dump script should work smoothly enough as well | 
    
        | 18:06
            
                🔗 | Thomas-ED | can do the text, but the images will happen later as it will cause alot of I/O and we are in peak right now | 
    
        | 18:11
            
                🔗 | Thomas-ED | Nemo_bis, we're gonna offer HTTP dloads to monthly text dumps, and bittorrent aswell | 
    
        | 18:11
            
                🔗 | Thomas-ED | you think that will suffice? cba to upload anywhere. | 
    
        | 18:11
            
                🔗 | Thomas-ED | people can grab from us | 
    
        | 18:11
            
                🔗 | Nemo_bis | yep | 
    
        | 18:11
            
                🔗 | Nemo_bis | just tell us when you publish the dumps | 
    
        | 18:11
            
                🔗 | Nemo_bis | 7z is the best for the full history dumps | 
    
        | 18:12
            
                🔗 | Nemo_bis | those will probably be fairly small, so easy to download and reupload | 
    
        | 18:15
            
                🔗 | Thomas-ED | yeah we are dumping every page and revision | 
    
        | 18:15
            
                🔗 | Thomas-ED | so its gonna be quite big i think | 
    
        | 18:16
            
                🔗 | Thomas-ED | gonna offer RSS with the BT tracker | 
    
        | 18:16
            
                🔗 | Thomas-ED | so | 
    
        | 18:16
            
                🔗 | Thomas-ED | people can automate the dloads of our dumps | 
    
        | 18:17
            
                🔗 | Nemo_bis | Thomas-ED, ok, good | 
    
        | 20:07
            
                🔗 | Thomas-ED | Nemo_bis, | 
    
        | 20:07
            
                🔗 | Thomas-ED | http://dl.srsdl.com/public/ed/text-28-02-12.xml.gz | 
    
        | 20:07
            
                🔗 | Thomas-ED | http://dl.srsdl.com/public/ed/text-28-02-12.xml.gz.torrent | 
    
        | 20:07
            
                🔗 | Nemo_bis | cool | 
    
        | 20:07
            
                🔗 | Thomas-ED | http://tracker.srsdl.com/ed.php | 
    
        | 20:07
            
                🔗 | Thomas-ED | will publish monthly on that | 
    
        | 20:08
            
                🔗 | Nemo_bis | Thomas-ED, did you try 7z? | 
    
        | 20:09
            
                🔗 | Thomas-ED | nah we're just gonna do that lol | 
    
        | 20:09
            
                🔗 | Nemo_bis | or do you have more bandwidth than CPU to waste? | 
    
        | 20:09
            
                🔗 | Nemo_bis | ok | 
    
        | 20:10
            
                🔗 | chronomex | the best way I've found to compress wiki dumps is .rcs.gz | 
    
        | 20:10
            
                🔗 | chronomex | RCS alone gets areound the level of .xml.7z | 
    
        | 20:10
            
                🔗 | chronomex | tho extraction into a wiki isnt easy from .rcs | 
    
        | 20:11
            
                🔗 | Nemo_bis | chronomex, what dumps did you try? | 
    
        | 20:11
            
                🔗 | Nemo_bis | I guess it varies wildly | 
    
        | 20:12
            
                🔗 | Thomas-ED | to be honest, we just cant be bothered, we've made them available, so there :P | 
    
        | 20:12
            
                🔗 | chronomex | I tried subsets of wikipedia, and a full dump of wikiti. | 
    
        | 20:13
            
                🔗 | chronomex | works best with wikis that have lots of small changes in big articles | 
    
        | 21:17
            
                🔗 | Nemo_bis | Thomas-ED, gzip: text-28-02-12.xml.gz: not in gzip format --> ?? | 
    
        | 21:18
            
                🔗 | chronomex | what does `file` say? | 
    
        | 21:20
            
                🔗 | Nemo_bis | chronomex, text-28-02-12.xml.gz: HTML document text | 
    
        | 21:20
            
                🔗 | chronomex | hrm. | 
    
        | 21:21
            
                🔗 | Nemo_bis | looks like it's just an xml | 
    
        | 21:23
            
                🔗 | Nemo_bis | $ head text-28-02-12.xml.gz | 
    
        | 21:23
            
                🔗 | Nemo_bis | <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.4/ http://www.mediawiki.org/xml/export-0.4.xsd" version="0.4" xml:lang="en"> | 
    
        | 21:23
            
                🔗 | Nemo_bis | etc. | 
    
        | 21:24
            
                🔗 | Nemo_bis | was indeed quite huge to be compressed |