#wikiteam 2012-02-28,Tue

↑back Search

Time	Nickname	Message
10:36 ^🔗	db48x	https://gist.github.com/1931814
10:42 ^🔗	db48x	so those two missing files were skipped because they're not actually in the correct month
10:43 ^🔗	db48x	the file size column does match the string "200409", however, so the estimate was off
10:45 ^🔗	*	db48x sighs
10:45 ^🔗	db48x	the svn history isn't very useful :)
11:23 ^🔗	emijrp	this morning i woke up thinking in a possible bug
11:24 ^🔗	emijrp	long filenames cant be saved
11:24 ^🔗	emijrp	mediawiki allows very long filenames, i had issues with that in dumpgenerator.py
11:24 ^🔗	emijrp	please, stop your wikimedia commons scripts, until i can make some tests
11:25 ^🔗	emijrp	db48x: Nemo_bis
12:12 ^🔗	emijrp	probably, also errors with names including '?'
12:30 ^🔗	Nemo_bis	emijrp, ok, stopping
12:31 ^🔗	Nemo_bis	emijrp, this is what I have now: http://p.defau.lt/?2oYOTUZwuvTRDtBXsFAtxg
12:31 ^🔗	soultcer	You could use the namehash as filename
12:33 ^🔗	emijrp	in dumpgenerator.py i use filename if it is less than 100 chars, or
12:33 ^🔗	emijrp	filename[:100] + hash (filename) if it is long
12:34 ^🔗	emijrp	just using hash as filename is not descriptive for browsing
12:35 ^🔗	emijrp	Nemo_bis: great, very fast, but i guess we need to re-generate all those, after fixing these bugs
12:35 ^🔗	Nemo_bis	emijrp, :/
12:35 ^🔗	emijrp	im sorry, but i said yesterday we are still under testing
12:35 ^🔗	Nemo_bis	I think some form of check would be useful and needed anyway, to redownload wrong images
12:35 ^🔗	Nemo_bis	Sure, I don't mind. Just this ^
12:43 ^🔗	ersi	emijrp: I'd suggest using some ID for filenames, ie a hash digest and that you save the original filename in metadata
12:43 ^🔗	ersi	But really, what's the problem with long filenames?
12:44 ^🔗	emijrp	ubuntu (and others i guess) doesnt allow filenames greater than 143 chars
12:44 ^🔗	emijrp	but you can upload longer filenames to mediawiki
12:51 ^🔗	emijrp	by the way, i think we can create 1 item per year
12:51 ^🔗	emijrp	IA items
12:56 ^🔗	ersi	It depends on the file system, not the operating system. You mean EXT4? I think that's the default filesystem since Ubuntu 10.04
12:57 ^🔗	ersi	Max filename length 256 bytes (characters)
12:57 ^🔗	emijrp	yes, but we want "global" compatiblity
12:57 ^🔗	emijrp	please who download the zips, have to unpack all dont get nasty errors
12:57 ^🔗	ersi	then in my opinion, you can only go with hashes + real filename in metadata
12:58 ^🔗	ersi	then it'd even work on shitty fat filesystems
12:58 ^🔗	emijrp	please = people
12:59 ^🔗	emijrp	but hashes as filenames, or hash + fileextension ?
12:59 ^🔗	emijrp	if iuse just hash, you dont know if it is a image, a odp, a video, etc
13:00 ^🔗	ersi	true
13:00 ^🔗	ersi	hash+file extention then :]
13:00 ^🔗	emijrp	anyway, i prefer the dumpgenerator.py method
13:01 ^🔗	emijrp	first 100 chars + hash (32 chars) + extension
13:01 ^🔗	emijrp	that is descriptive and fix the long filenames bug
13:01 ^🔗	emijrp	and you have info about the file type
13:02 ^🔗	emijrp	what is the limit of files by Ia item?
13:03 ^🔗	emijrp	yearly items = 365 zips + 365 .csv
13:03 ^🔗	ersi	132 characters + 4-10 characters as file ext?
13:03 ^🔗	emijrp	yes
13:04 ^🔗	emijrp	hash = hash over original filename
13:10 ^🔗	db48x	you'll need to process the filenames a little more than that
13:10 ^🔗	db48x	most unix filesystems allow any character except / in a name, FAT and NTFS are a little more restrictive
13:11 ^🔗	emijrp	you mean unicode chars?
13:11 ^🔗	db48x	you can't use ?, :, \, /, and half a dozen others on FAT or NTFS
13:12 ^🔗	emijrp	those are not allowed in mediawiki
13:12 ^🔗	emijrp	my mistake
13:12 ^🔗	emijrp	yes, only : not allowed
13:12 ^🔗	db48x	\/:*?"<>\| have to be eliminated
13:13 ^🔗	emijrp	i can use just a filename hash as filename, but, man, that is not usable
13:13 ^🔗	db48x	yea :(
13:13 ^🔗	emijrp	you have to browse with a open csv window
13:14 ^🔗	db48x	could build an html index
13:14 ^🔗	db48x	but still not as good
13:15 ^🔗	db48x	I've suggested before that we should create filesystem images instead of collections of files :)
13:18 ^🔗	emijrp	if we discard fat and ntfs, we can use first 100 chars + original filename hash (32 chars) + extension
13:19 ^🔗	db48x	yea
13:19 ^🔗	soultcer	If the internet archive used tar files you could just use the python tarfile extension and write the file directly from python. Would work on all platforms with zero unicode issues
13:19 ^🔗	emijrp	but if filename is less than 100, i dont change the filename
13:19 ^🔗	db48x	you could urlencode the first 50 characters, plus the filename hash and the extension
13:20 ^🔗	db48x	no, 50 would be too big in the worst case
13:21 ^🔗	emijrp	yes, but a guy who donwload the .tar in windows, what?
13:21 ^🔗	soultcer	Uses 7zip no problem
13:21 ^🔗	emijrp	he cant untar
13:21 ^🔗	db48x	and anyway urlencoded names are unreadable in the majority of languages
13:21 ^🔗	db48x	emijrp: almost any unzip program will also open tarballs
13:22 ^🔗	db48x	you could just replace those 9 characters with their urlencoded forms
13:22 ^🔗	soultcer	The POSIX.1-2001 standard fully supports unicode and long filenames and any other kind of metadata you can think of
13:22 ^🔗	emijrp	i mean, you can explore .tar from IA web interface, but on windows you cant untar files using arabic symbols + ? + / + *
13:23 ^🔗	soultcer	I still don't understand why the Internet Archive would prefer ZIP anyway. Tar is pretty much the gold standard in archiving files.
13:25 ^🔗	emijrp	becase zip compress and tar no'?
13:26 ^🔗	soultcer	You can compress a tar archive with any compressor you want (compress, gzip, bzip2, lzma, lzma2/xz, ...)
13:26 ^🔗	soultcer	Zip compresses each file individually, giving you almost no savings, especially on the images
13:27 ^🔗	soultcer	Compressing the tar archive yields a much higher compression ratio since all the metadata files contain a lot of the same data
13:27 ^🔗	emijrp	yes, but you cant browse tars
13:27 ^🔗	emijrp	compressed tars
13:28 ^🔗	emijrp	i understood from my question to Jason that the desired format for browsing is zip
13:28 ^🔗	emijrp	not tar
13:28 ^🔗	emijrp	zip = browsable + compress
13:28 ^🔗	emijrp	tar = browsable
13:28 ^🔗	soultcer	zip = browsable + compress + bad unicode support + filename problems
13:29 ^🔗	soultcer	tar = browsable + unicode support + long filenames
13:29 ^🔗	soultcer	The compression is pretty much useless on pictures (though it helps for metadata and svg=xml pictures)
13:29 ^🔗	emijrp	i read here yesterday that zip fixed their unicode issues
13:30 ^🔗	soultcer	Partially. The official Winzip standard knows of a flag which is set to "on" when the filenames are encoded in unicode. The unix version of zip adds a special metadata field for files which contains the unicode filename
13:31 ^🔗	soultcer	The downside is, that the python zip module supports neither of those extensions, while the python tar module supports unicode just fine.
13:31 ^🔗	emijrp	i dont use python to save files, just wget + culr
13:32 ^🔗	emijrp	wikimedia eservers dont like urllib
13:32 ^🔗	soultcer	That's my point: If you use python instead of wget, curl and zip commands, you could get around all those unicode and filename length issues
13:33 ^🔗	soultcer	If you can give me an example where urllib doesn't work, I'd actually be curious to investigate, even if you don't decide on using urllib.
13:33 ^🔗	emijrp	ok
13:33 ^🔗	emijrp	just try to save http://upload.wikimedia.org/wikipedia/commons/5/50/%D7%A9%D7%A0%D7%99%D7%A6%D7%9C%D7%A8.jpg using urllib or urllib2
13:35 ^🔗	emijrp	anyway, if we use tar, that allows unicode and 10000 chars filenames, how the hell i unpack that tar in my ubuntu? it doesnt like 143+ chars filenames
13:40 ^🔗	soultcer	Get a better filesystem?
13:41 ^🔗	soultcer	There must be some way to preserve the filenames. If you like to stick to zip, you could just write a list of "original filename:filename in zip archive" and put that list into each zip file?
13:42 ^🔗	emijrp	obviously if i rename files, i wont delete the originalname -> newname
13:42 ^🔗	emijrp	that goes to a .csv file
13:42 ^🔗	soultcer	Okay, guess that's fine too.
13:43 ^🔗	soultcer	Btw, if you use Ubuntu with ext3 or ext4 the maximum filename length will be 255 bytes, same as for ZFS
13:43 ^🔗	soultcer	And since Mediawiki commons uses ZFS as well their filenames shouldn't be longer than 255 bytes either?
13:44 ^🔗	emijrp	i use ext4
13:44 ^🔗	emijrp	but when i try to download this with wget, it crashes http://upload.wikimedia.org/wikipedia/commons/d/d8/Ferenc_JOACHIM_%281882-1964%29%2C_Hungarian_%28Magyar%29_artist_painter%2C_his_wife_Margit_GRAF%2C_their_three_children_Piroska%2C_Ferenc_G_and_Attila%2C_and_two_of_nine_grandchildren%2C_1944%2C_Budapest%2C_Hungary..jpg
13:44 ^🔗	emijrp	try it
13:45 ^🔗	soultcer	Works fine for me with wget on ext4 file system
13:45 ^🔗	emijrp	not for me
13:45 ^🔗	db48x	worked for me
13:46 ^🔗	soultcer	Weird, but assuming that this also happens on other Ubuntu installs it is better to limit filenames somehow
13:46 ^🔗	db48x	GNU Wget 1.12-2507-dirty and ext4
13:48 ^🔗	db48x	I guess I should go to work
13:50 ^🔗	emijrp	i dont understand why it fails on my comp
13:53 ^🔗	emijrp	do you know how to sha1sum in base 36?
14:02 ^🔗	db48x	you don't do sha1sum in any particular base
14:02 ^🔗	db48x	the result of the hashing alorithm is just a number
14:02 ^🔗	db48x	the sha1sum outputs that number in hexidecimal
14:03 ^🔗	db48x	you have that same number in base 36, so just convert
14:07 ^🔗	db48x	well, be back later
14:09 ^🔗	ersi	emijrp: I can download that URL with wget just fine
14:10 ^🔗	ersi	2f4ffd812a553a00cdfed2f2ec51f1f92baa0272 Ferenc_JOACHIM_(1882-1964),_Hungarian_(Magyar)_artist_painter,_his_wife_Margit_GRAF,_their_three_children_Piroska,_Ferenc_G_and_Attila,_and_two_of_nine_grandchildren,_1944,_Budapest,_Hungary..jpg
14:10 ^🔗	ersi	sha1sum on the file~
14:25 ^🔗	emijrp	can you download this with wget too Nemo_bis ? http://upload.wikimedia.org/wikipedia/commons/d/d8/Ferenc_JOACHIM_%281882-1964%29%2C_Hungarian_%28Magyar%29_artist_painter%2C_his_wife_Margit_GRAF%2C_their_three_children_Piroska%2C_Ferenc_G_and_Attila%2C_and_two_of_nine_grandchildren%2C_1944%2C_Budapest%2C_Hungary..jpg
14:32 ^🔗	ersi	emijrp: I could
14:32 ^🔗	ersi	oh, it's the same pic as before - nevermind
14:45 ^🔗	emijrp	im working on the long filenames and '?' chars bug,
15:07 ^🔗	emijrp	testing the new version. ...
15:17 ^🔗	emijrp	example of image with * http://commons.wikimedia.org/wiki/File:snore_-_Flickr_-_exfordy.jpg
15:40 ^🔗	emijrp	are there file extension larger than 4 chars?
15:42 ^🔗	emijrp	http://ubuntuforums.org/showthread.php?t=1173541
16:02 ^🔗	emijrp	ok
16:02 ^🔗	emijrp	the new version is out
16:02 ^🔗	emijrp	now .desc files are .xml
16:03 ^🔗	emijrp	filenames are truncated if length > 100, to this format: first 100 chars + hash (full filename) + extension
16:03 ^🔗	emijrp	script tested with arabic, hebrew, and weird filenames containing * ?
16:04 ^🔗	emijrp	oh, and now it creates the .zip and a .csv for every day
16:04 ^🔗	emijrp	the .csv contain all the metadata from the feed list for that day, including the renaming info
16:04 ^🔗	emijrp	http://code.google.com/p/wikiteam/source/browse/trunk/commonsdownloader.py
16:05 ^🔗	emijrp	i think we can download the 2004 year (it starts in 2004-09-07), and try to upload all that into IA
16:05 ^🔗	emijrp	make an item for 2004, create the browsable links, etc, and look if all is fine
16:05 ^🔗	emijrp	then, launch the script for 2005, 2006..
16:06 ^🔗	emijrp	Nemo_bis: downloaded yesterday 2004, it is less than 1GB (all days), so, we can make this test between today and tomorrow
16:15 ^🔗	emijrp	redownload the code, i have modfied a line a second ago
17:26 ^🔗	Nemo_bis	hm, back
17:49 ^🔗	Thomas-ED	hi guys
17:49 ^🔗	Thomas-ED	Nemo_bis, are you there?
17:49 ^🔗	Nemo_bis	yes
17:50 ^🔗	Thomas-ED	alright
17:50 ^🔗	Thomas-ED	so, i'm a sysadmin for encyclopedia dramatica, we're gonna start offering image dumps
17:50 ^🔗	Thomas-ED	and possibly page dumps
17:50 ^🔗	Nemo_bis	good
17:50 ^🔗	Thomas-ED	we have around 30GB of images
17:50 ^🔗	Thomas-ED	how do we go about this?
17:50 ^🔗	Nemo_bis	it's not that hard
17:51 ^🔗	Thomas-ED	i mean
17:51 ^🔗	Nemo_bis	For the images, you could just make a tar of your image directory and put it on archive.org
17:51 ^🔗	Thomas-ED	but
17:51 ^🔗	Thomas-ED	we can provide the dumps on our servers
17:51 ^🔗	Thomas-ED	do you guys download it? or do we upload it?
17:51 ^🔗	Nemo_bis	it's better if you upload directly
17:51 ^🔗	Nemo_bis	I can give you the exact command if you don't know IA's s3 interface
17:56 ^🔗	Nemo_bis	Thomas-ED, text is usually more interesting, there's https://www.mediawiki.org/wiki/Manual:DumpBackup.php for it
17:56 ^🔗	Thomas-ED	so nobody would want the images?
17:57 ^🔗	Thomas-ED	were you responsible for dumping ed.com?
17:57 ^🔗	Thomas-ED	we started .ch from those dumps :P
18:00 ^🔗	Nemo_bis	Thomas-ED, no, I didn't do that
18:00 ^🔗	Nemo_bis	the images are good for archival purposes, but the text is more important usually
18:01 ^🔗	Nemo_bis	also, without text one can't use the images, because one doesn't know what they are and what's their licensing status :)
18:02 ^🔗	Thomas-ED	well
18:02 ^🔗	Thomas-ED	i don't know if you're familiar with ED or anything
18:02 ^🔗	Thomas-ED	we don't really care about license or copyright, none of that info is available for our images
18:02 ^🔗	Nemo_bis	yes, that's why I said "usually" :p
18:02 ^🔗	Nemo_bis	some description will probably be there anyway
18:03 ^🔗	Nemo_bis	the images tar is very easy to do, but the dump script should work smoothly enough as well
18:06 ^🔗	Thomas-ED	can do the text, but the images will happen later as it will cause alot of I/O and we are in peak right now
18:11 ^🔗	Thomas-ED	Nemo_bis, we're gonna offer HTTP dloads to monthly text dumps, and bittorrent aswell
18:11 ^🔗	Thomas-ED	you think that will suffice? cba to upload anywhere.
18:11 ^🔗	Thomas-ED	people can grab from us
18:11 ^🔗	Nemo_bis	yep
18:11 ^🔗	Nemo_bis	just tell us when you publish the dumps
18:11 ^🔗	Nemo_bis	7z is the best for the full history dumps
18:12 ^🔗	Nemo_bis	those will probably be fairly small, so easy to download and reupload
18:15 ^🔗	Thomas-ED	yeah we are dumping every page and revision
18:15 ^🔗	Thomas-ED	so its gonna be quite big i think
18:16 ^🔗	Thomas-ED	gonna offer RSS with the BT tracker
18:16 ^🔗	Thomas-ED	so
18:16 ^🔗	Thomas-ED	people can automate the dloads of our dumps
18:17 ^🔗	Nemo_bis	Thomas-ED, ok, good
20:07 ^🔗	Thomas-ED	Nemo_bis,
20:07 ^🔗	Thomas-ED	http://dl.srsdl.com/public/ed/text-28-02-12.xml.gz
20:07 ^🔗	Thomas-ED	http://dl.srsdl.com/public/ed/text-28-02-12.xml.gz.torrent
20:07 ^🔗	Nemo_bis	cool
20:07 ^🔗	Thomas-ED	http://tracker.srsdl.com/ed.php
20:07 ^🔗	Thomas-ED	will publish monthly on that
20:08 ^🔗	Nemo_bis	Thomas-ED, did you try 7z?
20:09 ^🔗	Thomas-ED	nah we're just gonna do that lol
20:09 ^🔗	Nemo_bis	or do you have more bandwidth than CPU to waste?
20:09 ^🔗	Nemo_bis	ok
20:10 ^🔗	chronomex	the best way I've found to compress wiki dumps is .rcs.gz
20:10 ^🔗	chronomex	RCS alone gets areound the level of .xml.7z
20:10 ^🔗	chronomex	tho extraction into a wiki isnt easy from .rcs
20:11 ^🔗	Nemo_bis	chronomex, what dumps did you try?
20:11 ^🔗	Nemo_bis	I guess it varies wildly
20:12 ^🔗	Thomas-ED	to be honest, we just cant be bothered, we've made them available, so there :P
20:12 ^🔗	chronomex	I tried subsets of wikipedia, and a full dump of wikiti.
20:13 ^🔗	chronomex	works best with wikis that have lots of small changes in big articles
21:17 ^🔗	Nemo_bis	Thomas-ED, gzip: text-28-02-12.xml.gz: not in gzip format --> ??
21:18 ^🔗	chronomex	what does `file` say?
21:20 ^🔗	Nemo_bis	chronomex, text-28-02-12.xml.gz: HTML document text
21:20 ^🔗	chronomex	hrm.
21:21 ^🔗	Nemo_bis	looks like it's just an xml
21:23 ^🔗	Nemo_bis	$ head text-28-02-12.xml.gz
21:23 ^🔗	Nemo_bis	<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.4/ http://www.mediawiki.org/xml/export-0.4.xsd" version="0.4" xml:lang="en">
21:23 ^🔗	Nemo_bis	etc.
21:24 ^🔗	Nemo_bis	was indeed quite huge to be compressed

irclogger-viewer