[10:36] https://gist.github.com/1931814 [10:42] so those two missing files were skipped because they're not actually in the correct month [10:43] the file size column does match the string "200409", however, so the estimate was off [10:45] * db48x sighs [10:45] the svn history isn't very useful :) [11:23] this morning i woke up thinking in a possible bug [11:24] long filenames cant be saved [11:24] mediawiki allows very long filenames, i had issues with that in dumpgenerator.py [11:24] please, stop your wikimedia commons scripts, until i can make some tests [11:25] db48x: Nemo_bis [12:12] probably, also errors with names including '?' [12:30] emijrp, ok, stopping [12:31] emijrp, this is what I have now: http://p.defau.lt/?2oYOTUZwuvTRDtBXsFAtxg [12:31] You could use the namehash as filename [12:33] in dumpgenerator.py i use filename if it is less than 100 chars, or [12:33] filename[:100] + hash (filename) if it is long [12:34] just using hash as filename is not descriptive for browsing [12:35] Nemo_bis: great, very fast, but i guess we need to re-generate all those, after fixing these bugs [12:35] emijrp, :/ [12:35] im sorry, but i said yesterday we are still under testing [12:35] I think some form of check would be useful and needed anyway, to redownload wrong images [12:35] Sure, I don't mind. Just this ^ [12:43] emijrp: I'd suggest using some ID for filenames, ie a hash digest and that you save the original filename in metadata [12:43] But really, what's the problem with long filenames? [12:44] ubuntu (and others i guess) doesnt allow filenames greater than 143 chars [12:44] but you can upload longer filenames to mediawiki [12:51] by the way, i think we can create 1 item per year [12:51] IA items [12:56] It depends on the file system, not the operating system. You mean EXT4? I think that's the default filesystem since Ubuntu 10.04 [12:57] Max filename length 256 bytes (characters) [12:57] yes, but we want "global" compatiblity [12:57] please who download the zips, have to unpack all dont get nasty errors [12:57] then in my opinion, you can only go with hashes + real filename in metadata [12:58] then it'd even work on shitty fat filesystems [12:58] please = people [12:59] but hashes as filenames, or hash + fileextension ? [12:59] if iuse just hash, you dont know if it is a image, a odp, a video, etc [13:00] true [13:00] hash+file extention then :] [13:00] anyway, i prefer the dumpgenerator.py method [13:01] first 100 chars + hash (32 chars) + extension [13:01] that is descriptive and fix the long filenames bug [13:01] and you have info about the file type [13:02] what is the limit of files by Ia item? [13:03] yearly items = 365 zips + 365 .csv [13:03] 132 characters + 4-10 characters as file ext? [13:03] yes [13:04] hash = hash over original filename [13:10] you'll need to process the filenames a little more than that [13:10] most unix filesystems allow any character except / in a name, FAT and NTFS are a little more restrictive [13:11] you mean unicode chars? [13:11] you can't use ?, :, \, /, and half a dozen others on FAT or NTFS [13:12] those are not allowed in mediawiki [13:12] my mistake [13:12] yes, only : not allowed [13:12] \/:*?"<>| have to be eliminated [13:13] i can use just a filename hash as filename, but, man, that is not usable [13:13] yea :( [13:13] you have to browse with a open csv window [13:14] could build an html index [13:14] but still not as good [13:15] I've suggested before that we should create filesystem images instead of collections of files :) [13:18] if we discard fat and ntfs, we can use first 100 chars + original filename hash (32 chars) + extension [13:19] yea [13:19] If the internet archive used tar files you could just use the python tarfile extension and write the file directly from python. Would work on all platforms with zero unicode issues [13:19] but if filename is less than 100, i dont change the filename [13:19] you could urlencode the first 50 characters, plus the filename hash and the extension [13:20] no, 50 would be too big in the worst case [13:21] yes, but a guy who donwload the .tar in windows, what? [13:21] Uses 7zip no problem [13:21] he cant untar [13:21] and anyway urlencoded names are unreadable in the majority of languages [13:21] emijrp: almost any unzip program will also open tarballs [13:22] you could just replace those 9 characters with their urlencoded forms [13:22] The POSIX.1-2001 standard fully supports unicode and long filenames and any other kind of metadata you can think of [13:22] i mean, you can explore .tar from IA web interface, but on windows you cant untar files using arabic symbols + ? + / + * [13:23] I still don't understand why the Internet Archive would prefer ZIP anyway. Tar is pretty much the gold standard in archiving files. [13:25] becase zip compress and tar no'? [13:26] You can compress a tar archive with any compressor you want (compress, gzip, bzip2, lzma, lzma2/xz, ...) [13:26] Zip compresses each file individually, giving you almost no savings, especially on the images [13:27] Compressing the tar archive yields a much higher compression ratio since all the metadata files contain a lot of the same data [13:27] yes, but you cant browse tars [13:27] compressed tars [13:28] i understood from my question to Jason that the desired format for browsing is zip [13:28] not tar [13:28] zip = browsable + compress [13:28] tar = browsable [13:28] zip = browsable + compress + bad unicode support + filename problems [13:29] tar = browsable + unicode support + long filenames [13:29] The compression is pretty much useless on pictures (though it helps for metadata and svg=xml pictures) [13:29] i read here yesterday that zip fixed their unicode issues [13:30] Partially. The official Winzip standard knows of a flag which is set to "on" when the filenames are encoded in unicode. The unix version of zip adds a special metadata field for files which contains the unicode filename [13:31] The downside is, that the python zip module supports neither of those extensions, while the python tar module supports unicode just fine. [13:31] i dont use python to save files, just wget + culr [13:32] wikimedia eservers dont like urllib [13:32] That's my point: If you use python instead of wget, curl and zip commands, you could get around all those unicode and filename length issues [13:33] If you can give me an example where urllib doesn't work, I'd actually be curious to investigate, even if you don't decide on using urllib. [13:33] ok [13:33] just try to save http://upload.wikimedia.org/wikipedia/commons/5/50/%D7%A9%D7%A0%D7%99%D7%A6%D7%9C%D7%A8.jpg using urllib or urllib2 [13:35] anyway, if we use tar, that allows unicode and 10000 chars filenames, how the hell i unpack that tar in my ubuntu? it doesnt like 143+ chars filenames [13:40] Get a better filesystem? [13:41] There must be some way to preserve the filenames. If you like to stick to zip, you could just write a list of "original filename:filename in zip archive" and put that list into each zip file? [13:42] obviously if i rename files, i wont delete the originalname -> newname [13:42] that goes to a .csv file [13:42] Okay, guess that's fine too. [13:43] Btw, if you use Ubuntu with ext3 or ext4 the maximum filename length will be 255 bytes, same as for ZFS [13:43] And since Mediawiki commons uses ZFS as well their filenames shouldn't be longer than 255 bytes either? [13:44] i use ext4 [13:44] but when i try to download this with wget, it crashes http://upload.wikimedia.org/wikipedia/commons/d/d8/Ferenc_JOACHIM_%281882-1964%29%2C_Hungarian_%28Magyar%29_artist_painter%2C_his_wife_Margit_GRAF%2C_their_three_children_Piroska%2C_Ferenc_G_and_Attila%2C_and_two_of_nine_grandchildren%2C_1944%2C_Budapest%2C_Hungary..jpg [13:44] try it [13:45] Works fine for me with wget on ext4 file system [13:45] not for me [13:45] worked for me [13:46] Weird, but assuming that this also happens on other Ubuntu installs it is better to limit filenames somehow [13:46] GNU Wget 1.12-2507-dirty and ext4 [13:48] I guess I should go to work [13:50] i dont understand why it fails on my comp [13:53] do you know how to sha1sum in base 36? [14:02] you don't do sha1sum in any particular base [14:02] the result of the hashing alorithm is just a number [14:02] the sha1sum outputs that number in hexidecimal [14:03] you have that same number in base 36, so just convert [14:07] well, be back later [14:09] emijrp: I can download that URL with wget just fine [14:10] 2f4ffd812a553a00cdfed2f2ec51f1f92baa0272 Ferenc_JOACHIM_(1882-1964),_Hungarian_(Magyar)_artist_painter,_his_wife_Margit_GRAF,_their_three_children_Piroska,_Ferenc_G_and_Attila,_and_two_of_nine_grandchildren,_1944,_Budapest,_Hungary..jpg [14:10] sha1sum on the file~ [14:25] can you download this with wget too Nemo_bis ? http://upload.wikimedia.org/wikipedia/commons/d/d8/Ferenc_JOACHIM_%281882-1964%29%2C_Hungarian_%28Magyar%29_artist_painter%2C_his_wife_Margit_GRAF%2C_their_three_children_Piroska%2C_Ferenc_G_and_Attila%2C_and_two_of_nine_grandchildren%2C_1944%2C_Budapest%2C_Hungary..jpg [14:32] emijrp: I could [14:32] oh, it's the same pic as before - nevermind [14:45] im working on the long filenames and '?' chars bug, [15:07] testing the new version. ... [15:17] example of image with * http://commons.wikimedia.org/wiki/File:*snore*_-_Flickr_-_exfordy.jpg [15:40] are there file extension larger than 4 chars? [15:42] http://ubuntuforums.org/showthread.php?t=1173541 [16:02] ok [16:02] the new version is out [16:02] now .desc files are .xml [16:03] filenames are truncated if length > 100, to this format: first 100 chars + hash (full filename) + extension [16:03] script tested with arabic, hebrew, and weird filenames containing * ? [16:04] oh, and now it creates the .zip and a .csv for every day [16:04] the .csv contain all the metadata from the feed list for that day, including the renaming info [16:04] http://code.google.com/p/wikiteam/source/browse/trunk/commonsdownloader.py [16:05] i think we can download the 2004 year (it starts in 2004-09-07), and try to upload all that into IA [16:05] make an item for 2004, create the browsable links, etc, and look if all is fine [16:05] then, launch the script for 2005, 2006.. [16:06] Nemo_bis: downloaded yesterday 2004, it is less than 1GB (all days), so, we can make this test between today and tomorrow [16:15] redownload the code, i have modfied a line a second ago [17:26] hm, back [17:49] hi guys [17:49] Nemo_bis, are you there? [17:49] yes [17:50] alright [17:50] so, i'm a sysadmin for encyclopedia dramatica, we're gonna start offering image dumps [17:50] and possibly page dumps [17:50] good [17:50] we have around 30GB of images [17:50] how do we go about this? [17:50] it's not that hard [17:51] i mean [17:51] For the images, you could just make a tar of your image directory and put it on archive.org [17:51] but [17:51] we can provide the dumps on our servers [17:51] do you guys download it? or do we upload it? [17:51] it's better if you upload directly [17:51] I can give you the exact command if you don't know IA's s3 interface [17:56] Thomas-ED, text is usually more interesting, there's https://www.mediawiki.org/wiki/Manual:DumpBackup.php for it [17:56] so nobody would want the images? [17:57] were you responsible for dumping ed.com? [17:57] we started .ch from those dumps :P [18:00] Thomas-ED, no, I didn't do that [18:00] the images are good for archival purposes, but the text is more important usually [18:01] also, without text one can't use the images, because one doesn't know what they are and what's their licensing status :) [18:02] well [18:02] i don't know if you're familiar with ED or anything [18:02] we don't really care about license or copyright, none of that info is available for our images [18:02] yes, that's why I said "usually" :p [18:02] some description will probably be there anyway [18:03] the images tar is very easy to do, but the dump script should work smoothly enough as well [18:06] can do the text, but the images will happen later as it will cause alot of I/O and we are in peak right now [18:11] Nemo_bis, we're gonna offer HTTP dloads to monthly text dumps, and bittorrent aswell [18:11] you think that will suffice? cba to upload anywhere. [18:11] people can grab from us [18:11] yep [18:11] just tell us when you publish the dumps [18:11] 7z is the best for the full history dumps [18:12] those will probably be fairly small, so easy to download and reupload [18:15] yeah we are dumping every page and revision [18:15] so its gonna be quite big i think [18:16] gonna offer RSS with the BT tracker [18:16] so [18:16] people can automate the dloads of our dumps [18:17] Thomas-ED, ok, good [20:07] Nemo_bis, [20:07] http://dl.srsdl.com/public/ed/text-28-02-12.xml.gz [20:07] http://dl.srsdl.com/public/ed/text-28-02-12.xml.gz.torrent [20:07] cool [20:07] http://tracker.srsdl.com/ed.php [20:07] will publish monthly on that [20:08] Thomas-ED, did you try 7z? [20:09] nah we're just gonna do that lol [20:09] or do you have more bandwidth than CPU to waste? [20:09] ok [20:10] the best way I've found to compress wiki dumps is .rcs.gz [20:10] RCS alone gets areound the level of .xml.7z [20:10] tho extraction into a wiki isnt easy from .rcs [20:11] chronomex, what dumps did you try? [20:11] I guess it varies wildly [20:12] to be honest, we just cant be bothered, we've made them available, so there :P [20:12] I tried subsets of wikipedia, and a full dump of wikiti. [20:13] works best with wikis that have lots of small changes in big articles [21:17] Thomas-ED, gzip: text-28-02-12.xml.gz: not in gzip format --> ?? [21:18] what does `file` say? [21:20] chronomex, text-28-02-12.xml.gz: HTML document text [21:20] hrm. [21:21] looks like it's just an xml [21:23] $ head text-28-02-12.xml.gz [21:23] [21:23] etc. [21:24] was indeed quite huge to be compressed