#wikiteam 2012-02-28,Tue

↑back Search

Time Nickname Message
10:36 🔗 db48x https://gist.github.com/1931814
10:42 🔗 db48x so those two missing files were skipped because they're not actually in the correct month
10:43 🔗 db48x the file size column does match the string "200409", however, so the estimate was off
10:45 🔗 * db48x sighs
10:45 🔗 db48x the svn history isn't very useful :)
11:23 🔗 emijrp this morning i woke up thinking in a possible bug
11:24 🔗 emijrp long filenames cant be saved
11:24 🔗 emijrp mediawiki allows very long filenames, i had issues with that in dumpgenerator.py
11:24 🔗 emijrp please, stop your wikimedia commons scripts, until i can make some tests
11:25 🔗 emijrp db48x: Nemo_bis
12:12 🔗 emijrp probably, also errors with names including '?'
12:30 🔗 Nemo_bis emijrp, ok, stopping
12:31 🔗 Nemo_bis emijrp, this is what I have now: http://p.defau.lt/?2oYOTUZwuvTRDtBXsFAtxg
12:31 🔗 soultcer You could use the namehash as filename
12:33 🔗 emijrp in dumpgenerator.py i use filename if it is less than 100 chars, or
12:33 🔗 emijrp filename[:100] + hash (filename) if it is long
12:34 🔗 emijrp just using hash as filename is not descriptive for browsing
12:35 🔗 emijrp Nemo_bis: great, very fast, but i guess we need to re-generate all those, after fixing these bugs
12:35 🔗 Nemo_bis emijrp, :/
12:35 🔗 emijrp im sorry, but i said yesterday we are still under testing
12:35 🔗 Nemo_bis I think some form of check would be useful and needed anyway, to redownload wrong images
12:35 🔗 Nemo_bis Sure, I don't mind. Just this ^
12:43 🔗 ersi emijrp: I'd suggest using some ID for filenames, ie a hash digest and that you save the original filename in metadata
12:43 🔗 ersi But really, what's the problem with long filenames?
12:44 🔗 emijrp ubuntu (and others i guess) doesnt allow filenames greater than 143 chars
12:44 🔗 emijrp but you can upload longer filenames to mediawiki
12:51 🔗 emijrp by the way, i think we can create 1 item per year
12:51 🔗 emijrp IA items
12:56 🔗 ersi It depends on the file system, not the operating system. You mean EXT4? I think that's the default filesystem since Ubuntu 10.04
12:57 🔗 ersi Max filename length 256 bytes (characters)
12:57 🔗 emijrp yes, but we want "global" compatiblity
12:57 🔗 emijrp please who download the zips, have to unpack all dont get nasty errors
12:57 🔗 ersi then in my opinion, you can only go with hashes + real filename in metadata
12:58 🔗 ersi then it'd even work on shitty fat filesystems
12:58 🔗 emijrp please = people
12:59 🔗 emijrp but hashes as filenames, or hash + fileextension ?
12:59 🔗 emijrp if iuse just hash, you dont know if it is a image, a odp, a video, etc
13:00 🔗 ersi true
13:00 🔗 ersi hash+file extention then :]
13:00 🔗 emijrp anyway, i prefer the dumpgenerator.py method
13:01 🔗 emijrp first 100 chars + hash (32 chars) + extension
13:01 🔗 emijrp that is descriptive and fix the long filenames bug
13:01 🔗 emijrp and you have info about the file type
13:02 🔗 emijrp what is the limit of files by Ia item?
13:03 🔗 emijrp yearly items = 365 zips + 365 .csv
13:03 🔗 ersi 132 characters + 4-10 characters as file ext?
13:03 🔗 emijrp yes
13:04 🔗 emijrp hash = hash over original filename
13:10 🔗 db48x you'll need to process the filenames a little more than that
13:10 🔗 db48x most unix filesystems allow any character except / in a name, FAT and NTFS are a little more restrictive
13:11 🔗 emijrp you mean unicode chars?
13:11 🔗 db48x you can't use ?, :, \, /, and half a dozen others on FAT or NTFS
13:12 🔗 emijrp those are not allowed in mediawiki
13:12 🔗 emijrp my mistake
13:12 🔗 emijrp yes, only : not allowed
13:12 🔗 db48x \/:*?"<>| have to be eliminated
13:13 🔗 emijrp i can use just a filename hash as filename, but, man, that is not usable
13:13 🔗 db48x yea :(
13:13 🔗 emijrp you have to browse with a open csv window
13:14 🔗 db48x could build an html index
13:14 🔗 db48x but still not as good
13:15 🔗 db48x I've suggested before that we should create filesystem images instead of collections of files :)
13:18 🔗 emijrp if we discard fat and ntfs, we can use first 100 chars + original filename hash (32 chars) + extension
13:19 🔗 db48x yea
13:19 🔗 soultcer If the internet archive used tar files you could just use the python tarfile extension and write the file directly from python. Would work on all platforms with zero unicode issues
13:19 🔗 emijrp but if filename is less than 100, i dont change the filename
13:19 🔗 db48x you could urlencode the first 50 characters, plus the filename hash and the extension
13:20 🔗 db48x no, 50 would be too big in the worst case
13:21 🔗 emijrp yes, but a guy who donwload the .tar in windows, what?
13:21 🔗 soultcer Uses 7zip no problem
13:21 🔗 emijrp he cant untar
13:21 🔗 db48x and anyway urlencoded names are unreadable in the majority of languages
13:21 🔗 db48x emijrp: almost any unzip program will also open tarballs
13:22 🔗 db48x you could just replace those 9 characters with their urlencoded forms
13:22 🔗 soultcer The POSIX.1-2001 standard fully supports unicode and long filenames and any other kind of metadata you can think of
13:22 🔗 emijrp i mean, you can explore .tar from IA web interface, but on windows you cant untar files using arabic symbols + ? + / + *
13:23 🔗 soultcer I still don't understand why the Internet Archive would prefer ZIP anyway. Tar is pretty much the gold standard in archiving files.
13:25 🔗 emijrp becase zip compress and tar no'?
13:26 🔗 soultcer You can compress a tar archive with any compressor you want (compress, gzip, bzip2, lzma, lzma2/xz, ...)
13:26 🔗 soultcer Zip compresses each file individually, giving you almost no savings, especially on the images
13:27 🔗 soultcer Compressing the tar archive yields a much higher compression ratio since all the metadata files contain a lot of the same data
13:27 🔗 emijrp yes, but you cant browse tars
13:27 🔗 emijrp compressed tars
13:28 🔗 emijrp i understood from my question to Jason that the desired format for browsing is zip
13:28 🔗 emijrp not tar
13:28 🔗 emijrp zip = browsable + compress
13:28 🔗 emijrp tar = browsable
13:28 🔗 soultcer zip = browsable + compress + bad unicode support + filename problems
13:29 🔗 soultcer tar = browsable + unicode support + long filenames
13:29 🔗 soultcer The compression is pretty much useless on pictures (though it helps for metadata and svg=xml pictures)
13:29 🔗 emijrp i read here yesterday that zip fixed their unicode issues
13:30 🔗 soultcer Partially. The official Winzip standard knows of a flag which is set to "on" when the filenames are encoded in unicode. The unix version of zip adds a special metadata field for files which contains the unicode filename
13:31 🔗 soultcer The downside is, that the python zip module supports neither of those extensions, while the python tar module supports unicode just fine.
13:31 🔗 emijrp i dont use python to save files, just wget + culr
13:32 🔗 emijrp wikimedia eservers dont like urllib
13:32 🔗 soultcer That's my point: If you use python instead of wget, curl and zip commands, you could get around all those unicode and filename length issues
13:33 🔗 soultcer If you can give me an example where urllib doesn't work, I'd actually be curious to investigate, even if you don't decide on using urllib.
13:33 🔗 emijrp ok
13:33 🔗 emijrp just try to save http://upload.wikimedia.org/wikipedia/commons/5/50/%D7%A9%D7%A0%D7%99%D7%A6%D7%9C%D7%A8.jpg using urllib or urllib2
13:35 🔗 emijrp anyway, if we use tar, that allows unicode and 10000 chars filenames, how the hell i unpack that tar in my ubuntu? it doesnt like 143+ chars filenames
13:40 🔗 soultcer Get a better filesystem?
13:41 🔗 soultcer There must be some way to preserve the filenames. If you like to stick to zip, you could just write a list of "original filename:filename in zip archive" and put that list into each zip file?
13:42 🔗 emijrp obviously if i rename files, i wont delete the originalname -> newname
13:42 🔗 emijrp that goes to a .csv file
13:42 🔗 soultcer Okay, guess that's fine too.
13:43 🔗 soultcer Btw, if you use Ubuntu with ext3 or ext4 the maximum filename length will be 255 bytes, same as for ZFS
13:43 🔗 soultcer And since Mediawiki commons uses ZFS as well their filenames shouldn't be longer than 255 bytes either?
13:44 🔗 emijrp i use ext4
13:44 🔗 emijrp but when i try to download this with wget, it crashes http://upload.wikimedia.org/wikipedia/commons/d/d8/Ferenc_JOACHIM_%281882-1964%29%2C_Hungarian_%28Magyar%29_artist_painter%2C_his_wife_Margit_GRAF%2C_their_three_children_Piroska%2C_Ferenc_G_and_Attila%2C_and_two_of_nine_grandchildren%2C_1944%2C_Budapest%2C_Hungary..jpg
13:44 🔗 emijrp try it
13:45 🔗 soultcer Works fine for me with wget on ext4 file system
13:45 🔗 emijrp not for me
13:45 🔗 db48x worked for me
13:46 🔗 soultcer Weird, but assuming that this also happens on other Ubuntu installs it is better to limit filenames somehow
13:46 🔗 db48x GNU Wget 1.12-2507-dirty and ext4
13:48 🔗 db48x I guess I should go to work
13:50 🔗 emijrp i dont understand why it fails on my comp
13:53 🔗 emijrp do you know how to sha1sum in base 36?
14:02 🔗 db48x you don't do sha1sum in any particular base
14:02 🔗 db48x the result of the hashing alorithm is just a number
14:02 🔗 db48x the sha1sum outputs that number in hexidecimal
14:03 🔗 db48x you have that same number in base 36, so just convert
14:07 🔗 db48x well, be back later
14:09 🔗 ersi emijrp: I can download that URL with wget just fine
14:10 🔗 ersi 2f4ffd812a553a00cdfed2f2ec51f1f92baa0272 Ferenc_JOACHIM_(1882-1964),_Hungarian_(Magyar)_artist_painter,_his_wife_Margit_GRAF,_their_three_children_Piroska,_Ferenc_G_and_Attila,_and_two_of_nine_grandchildren,_1944,_Budapest,_Hungary..jpg
14:10 🔗 ersi sha1sum on the file~
14:25 🔗 emijrp can you download this with wget too Nemo_bis ? http://upload.wikimedia.org/wikipedia/commons/d/d8/Ferenc_JOACHIM_%281882-1964%29%2C_Hungarian_%28Magyar%29_artist_painter%2C_his_wife_Margit_GRAF%2C_their_three_children_Piroska%2C_Ferenc_G_and_Attila%2C_and_two_of_nine_grandchildren%2C_1944%2C_Budapest%2C_Hungary..jpg
14:32 🔗 ersi emijrp: I could
14:32 🔗 ersi oh, it's the same pic as before - nevermind
14:45 🔗 emijrp im working on the long filenames and '?' chars bug,
15:07 🔗 emijrp testing the new version. ...
15:17 🔗 emijrp example of image with * http://commons.wikimedia.org/wiki/File:*snore*_-_Flickr_-_exfordy.jpg
15:40 🔗 emijrp are there file extension larger than 4 chars?
15:42 🔗 emijrp http://ubuntuforums.org/showthread.php?t=1173541
16:02 🔗 emijrp ok
16:02 🔗 emijrp the new version is out
16:02 🔗 emijrp now .desc files are .xml
16:03 🔗 emijrp filenames are truncated if length > 100, to this format: first 100 chars + hash (full filename) + extension
16:03 🔗 emijrp script tested with arabic, hebrew, and weird filenames containing * ?
16:04 🔗 emijrp oh, and now it creates the .zip and a .csv for every day
16:04 🔗 emijrp the .csv contain all the metadata from the feed list for that day, including the renaming info
16:04 🔗 emijrp http://code.google.com/p/wikiteam/source/browse/trunk/commonsdownloader.py
16:05 🔗 emijrp i think we can download the 2004 year (it starts in 2004-09-07), and try to upload all that into IA
16:05 🔗 emijrp make an item for 2004, create the browsable links, etc, and look if all is fine
16:05 🔗 emijrp then, launch the script for 2005, 2006..
16:06 🔗 emijrp Nemo_bis: downloaded yesterday 2004, it is less than 1GB (all days), so, we can make this test between today and tomorrow
16:15 🔗 emijrp redownload the code, i have modfied a line a second ago
17:26 🔗 Nemo_bis hm, back
17:49 🔗 Thomas-ED hi guys
17:49 🔗 Thomas-ED Nemo_bis, are you there?
17:49 🔗 Nemo_bis yes
17:50 🔗 Thomas-ED alright
17:50 🔗 Thomas-ED so, i'm a sysadmin for encyclopedia dramatica, we're gonna start offering image dumps
17:50 🔗 Thomas-ED and possibly page dumps
17:50 🔗 Nemo_bis good
17:50 🔗 Thomas-ED we have around 30GB of images
17:50 🔗 Thomas-ED how do we go about this?
17:50 🔗 Nemo_bis it's not that hard
17:51 🔗 Thomas-ED i mean
17:51 🔗 Nemo_bis For the images, you could just make a tar of your image directory and put it on archive.org
17:51 🔗 Thomas-ED but
17:51 🔗 Thomas-ED we can provide the dumps on our servers
17:51 🔗 Thomas-ED do you guys download it? or do we upload it?
17:51 🔗 Nemo_bis it's better if you upload directly
17:51 🔗 Nemo_bis I can give you the exact command if you don't know IA's s3 interface
17:56 🔗 Nemo_bis Thomas-ED, text is usually more interesting, there's https://www.mediawiki.org/wiki/Manual:DumpBackup.php for it
17:56 🔗 Thomas-ED so nobody would want the images?
17:57 🔗 Thomas-ED were you responsible for dumping ed.com?
17:57 🔗 Thomas-ED we started .ch from those dumps :P
18:00 🔗 Nemo_bis Thomas-ED, no, I didn't do that
18:00 🔗 Nemo_bis the images are good for archival purposes, but the text is more important usually
18:01 🔗 Nemo_bis also, without text one can't use the images, because one doesn't know what they are and what's their licensing status :)
18:02 🔗 Thomas-ED well
18:02 🔗 Thomas-ED i don't know if you're familiar with ED or anything
18:02 🔗 Thomas-ED we don't really care about license or copyright, none of that info is available for our images
18:02 🔗 Nemo_bis yes, that's why I said "usually" :p
18:02 🔗 Nemo_bis some description will probably be there anyway
18:03 🔗 Nemo_bis the images tar is very easy to do, but the dump script should work smoothly enough as well
18:06 🔗 Thomas-ED can do the text, but the images will happen later as it will cause alot of I/O and we are in peak right now
18:11 🔗 Thomas-ED Nemo_bis, we're gonna offer HTTP dloads to monthly text dumps, and bittorrent aswell
18:11 🔗 Thomas-ED you think that will suffice? cba to upload anywhere.
18:11 🔗 Thomas-ED people can grab from us
18:11 🔗 Nemo_bis yep
18:11 🔗 Nemo_bis just tell us when you publish the dumps
18:11 🔗 Nemo_bis 7z is the best for the full history dumps
18:12 🔗 Nemo_bis those will probably be fairly small, so easy to download and reupload
18:15 🔗 Thomas-ED yeah we are dumping every page and revision
18:15 🔗 Thomas-ED so its gonna be quite big i think
18:16 🔗 Thomas-ED gonna offer RSS with the BT tracker
18:16 🔗 Thomas-ED so
18:16 🔗 Thomas-ED people can automate the dloads of our dumps
18:17 🔗 Nemo_bis Thomas-ED, ok, good
20:07 🔗 Thomas-ED Nemo_bis,
20:07 🔗 Thomas-ED http://dl.srsdl.com/public/ed/text-28-02-12.xml.gz
20:07 🔗 Thomas-ED http://dl.srsdl.com/public/ed/text-28-02-12.xml.gz.torrent
20:07 🔗 Nemo_bis cool
20:07 🔗 Thomas-ED http://tracker.srsdl.com/ed.php
20:07 🔗 Thomas-ED will publish monthly on that
20:08 🔗 Nemo_bis Thomas-ED, did you try 7z?
20:09 🔗 Thomas-ED nah we're just gonna do that lol
20:09 🔗 Nemo_bis or do you have more bandwidth than CPU to waste?
20:09 🔗 Nemo_bis ok
20:10 🔗 chronomex the best way I've found to compress wiki dumps is .rcs.gz
20:10 🔗 chronomex RCS alone gets areound the level of .xml.7z
20:10 🔗 chronomex tho extraction into a wiki isnt easy from .rcs
20:11 🔗 Nemo_bis chronomex, what dumps did you try?
20:11 🔗 Nemo_bis I guess it varies wildly
20:12 🔗 Thomas-ED to be honest, we just cant be bothered, we've made them available, so there :P
20:12 🔗 chronomex I tried subsets of wikipedia, and a full dump of wikiti.
20:13 🔗 chronomex works best with wikis that have lots of small changes in big articles
21:17 🔗 Nemo_bis Thomas-ED, gzip: text-28-02-12.xml.gz: not in gzip format --> ??
21:18 🔗 chronomex what does `file` say?
21:20 🔗 Nemo_bis chronomex, text-28-02-12.xml.gz: HTML document text
21:20 🔗 chronomex hrm.
21:21 🔗 Nemo_bis looks like it's just an xml
21:23 🔗 Nemo_bis $ head text-28-02-12.xml.gz
21:23 🔗 Nemo_bis <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.4/ http://www.mediawiki.org/xml/export-0.4.xsd" version="0.4" xml:lang="en">
21:23 🔗 Nemo_bis etc.
21:24 🔗 Nemo_bis was indeed quite huge to be compressed

irclogger-viewer