Time |
Nickname |
Message |
10:36
🔗
|
db48x |
https://gist.github.com/1931814 |
10:42
🔗
|
db48x |
so those two missing files were skipped because they're not actually in the correct month |
10:43
🔗
|
db48x |
the file size column does match the string "200409", however, so the estimate was off |
10:45
🔗
|
* |
db48x sighs |
10:45
🔗
|
db48x |
the svn history isn't very useful :) |
11:23
🔗
|
emijrp |
this morning i woke up thinking in a possible bug |
11:24
🔗
|
emijrp |
long filenames cant be saved |
11:24
🔗
|
emijrp |
mediawiki allows very long filenames, i had issues with that in dumpgenerator.py |
11:24
🔗
|
emijrp |
please, stop your wikimedia commons scripts, until i can make some tests |
11:25
🔗
|
emijrp |
db48x: Nemo_bis |
12:12
🔗
|
emijrp |
probably, also errors with names including '?' |
12:30
🔗
|
Nemo_bis |
emijrp, ok, stopping |
12:31
🔗
|
Nemo_bis |
emijrp, this is what I have now: http://p.defau.lt/?2oYOTUZwuvTRDtBXsFAtxg |
12:31
🔗
|
soultcer |
You could use the namehash as filename |
12:33
🔗
|
emijrp |
in dumpgenerator.py i use filename if it is less than 100 chars, or |
12:33
🔗
|
emijrp |
filename[:100] + hash (filename) if it is long |
12:34
🔗
|
emijrp |
just using hash as filename is not descriptive for browsing |
12:35
🔗
|
emijrp |
Nemo_bis: great, very fast, but i guess we need to re-generate all those, after fixing these bugs |
12:35
🔗
|
Nemo_bis |
emijrp, :/ |
12:35
🔗
|
emijrp |
im sorry, but i said yesterday we are still under testing |
12:35
🔗
|
Nemo_bis |
I think some form of check would be useful and needed anyway, to redownload wrong images |
12:35
🔗
|
Nemo_bis |
Sure, I don't mind. Just this ^ |
12:43
🔗
|
ersi |
emijrp: I'd suggest using some ID for filenames, ie a hash digest and that you save the original filename in metadata |
12:43
🔗
|
ersi |
But really, what's the problem with long filenames? |
12:44
🔗
|
emijrp |
ubuntu (and others i guess) doesnt allow filenames greater than 143 chars |
12:44
🔗
|
emijrp |
but you can upload longer filenames to mediawiki |
12:51
🔗
|
emijrp |
by the way, i think we can create 1 item per year |
12:51
🔗
|
emijrp |
IA items |
12:56
🔗
|
ersi |
It depends on the file system, not the operating system. You mean EXT4? I think that's the default filesystem since Ubuntu 10.04 |
12:57
🔗
|
ersi |
Max filename length 256 bytes (characters) |
12:57
🔗
|
emijrp |
yes, but we want "global" compatiblity |
12:57
🔗
|
emijrp |
please who download the zips, have to unpack all dont get nasty errors |
12:57
🔗
|
ersi |
then in my opinion, you can only go with hashes + real filename in metadata |
12:58
🔗
|
ersi |
then it'd even work on shitty fat filesystems |
12:58
🔗
|
emijrp |
please = people |
12:59
🔗
|
emijrp |
but hashes as filenames, or hash + fileextension ? |
12:59
🔗
|
emijrp |
if iuse just hash, you dont know if it is a image, a odp, a video, etc |
13:00
🔗
|
ersi |
true |
13:00
🔗
|
ersi |
hash+file extention then :] |
13:00
🔗
|
emijrp |
anyway, i prefer the dumpgenerator.py method |
13:01
🔗
|
emijrp |
first 100 chars + hash (32 chars) + extension |
13:01
🔗
|
emijrp |
that is descriptive and fix the long filenames bug |
13:01
🔗
|
emijrp |
and you have info about the file type |
13:02
🔗
|
emijrp |
what is the limit of files by Ia item? |
13:03
🔗
|
emijrp |
yearly items = 365 zips + 365 .csv |
13:03
🔗
|
ersi |
132 characters + 4-10 characters as file ext? |
13:03
🔗
|
emijrp |
yes |
13:04
🔗
|
emijrp |
hash = hash over original filename |
13:10
🔗
|
db48x |
you'll need to process the filenames a little more than that |
13:10
🔗
|
db48x |
most unix filesystems allow any character except / in a name, FAT and NTFS are a little more restrictive |
13:11
🔗
|
emijrp |
you mean unicode chars? |
13:11
🔗
|
db48x |
you can't use ?, :, \, /, and half a dozen others on FAT or NTFS |
13:12
🔗
|
emijrp |
those are not allowed in mediawiki |
13:12
🔗
|
emijrp |
my mistake |
13:12
🔗
|
emijrp |
yes, only : not allowed |
13:12
🔗
|
db48x |
\/:*?"<>| have to be eliminated |
13:13
🔗
|
emijrp |
i can use just a filename hash as filename, but, man, that is not usable |
13:13
🔗
|
db48x |
yea :( |
13:13
🔗
|
emijrp |
you have to browse with a open csv window |
13:14
🔗
|
db48x |
could build an html index |
13:14
🔗
|
db48x |
but still not as good |
13:15
🔗
|
db48x |
I've suggested before that we should create filesystem images instead of collections of files :) |
13:18
🔗
|
emijrp |
if we discard fat and ntfs, we can use first 100 chars + original filename hash (32 chars) + extension |
13:19
🔗
|
db48x |
yea |
13:19
🔗
|
soultcer |
If the internet archive used tar files you could just use the python tarfile extension and write the file directly from python. Would work on all platforms with zero unicode issues |
13:19
🔗
|
emijrp |
but if filename is less than 100, i dont change the filename |
13:19
🔗
|
db48x |
you could urlencode the first 50 characters, plus the filename hash and the extension |
13:20
🔗
|
db48x |
no, 50 would be too big in the worst case |
13:21
🔗
|
emijrp |
yes, but a guy who donwload the .tar in windows, what? |
13:21
🔗
|
soultcer |
Uses 7zip no problem |
13:21
🔗
|
emijrp |
he cant untar |
13:21
🔗
|
db48x |
and anyway urlencoded names are unreadable in the majority of languages |
13:21
🔗
|
db48x |
emijrp: almost any unzip program will also open tarballs |
13:22
🔗
|
db48x |
you could just replace those 9 characters with their urlencoded forms |
13:22
🔗
|
soultcer |
The POSIX.1-2001 standard fully supports unicode and long filenames and any other kind of metadata you can think of |
13:22
🔗
|
emijrp |
i mean, you can explore .tar from IA web interface, but on windows you cant untar files using arabic symbols + ? + / + * |
13:23
🔗
|
soultcer |
I still don't understand why the Internet Archive would prefer ZIP anyway. Tar is pretty much the gold standard in archiving files. |
13:25
🔗
|
emijrp |
becase zip compress and tar no'? |
13:26
🔗
|
soultcer |
You can compress a tar archive with any compressor you want (compress, gzip, bzip2, lzma, lzma2/xz, ...) |
13:26
🔗
|
soultcer |
Zip compresses each file individually, giving you almost no savings, especially on the images |
13:27
🔗
|
soultcer |
Compressing the tar archive yields a much higher compression ratio since all the metadata files contain a lot of the same data |
13:27
🔗
|
emijrp |
yes, but you cant browse tars |
13:27
🔗
|
emijrp |
compressed tars |
13:28
🔗
|
emijrp |
i understood from my question to Jason that the desired format for browsing is zip |
13:28
🔗
|
emijrp |
not tar |
13:28
🔗
|
emijrp |
zip = browsable + compress |
13:28
🔗
|
emijrp |
tar = browsable |
13:28
🔗
|
soultcer |
zip = browsable + compress + bad unicode support + filename problems |
13:29
🔗
|
soultcer |
tar = browsable + unicode support + long filenames |
13:29
🔗
|
soultcer |
The compression is pretty much useless on pictures (though it helps for metadata and svg=xml pictures) |
13:29
🔗
|
emijrp |
i read here yesterday that zip fixed their unicode issues |
13:30
🔗
|
soultcer |
Partially. The official Winzip standard knows of a flag which is set to "on" when the filenames are encoded in unicode. The unix version of zip adds a special metadata field for files which contains the unicode filename |
13:31
🔗
|
soultcer |
The downside is, that the python zip module supports neither of those extensions, while the python tar module supports unicode just fine. |
13:31
🔗
|
emijrp |
i dont use python to save files, just wget + culr |
13:32
🔗
|
emijrp |
wikimedia eservers dont like urllib |
13:32
🔗
|
soultcer |
That's my point: If you use python instead of wget, curl and zip commands, you could get around all those unicode and filename length issues |
13:33
🔗
|
soultcer |
If you can give me an example where urllib doesn't work, I'd actually be curious to investigate, even if you don't decide on using urllib. |
13:33
🔗
|
emijrp |
ok |
13:33
🔗
|
emijrp |
just try to save http://upload.wikimedia.org/wikipedia/commons/5/50/%D7%A9%D7%A0%D7%99%D7%A6%D7%9C%D7%A8.jpg using urllib or urllib2 |
13:35
🔗
|
emijrp |
anyway, if we use tar, that allows unicode and 10000 chars filenames, how the hell i unpack that tar in my ubuntu? it doesnt like 143+ chars filenames |
13:40
🔗
|
soultcer |
Get a better filesystem? |
13:41
🔗
|
soultcer |
There must be some way to preserve the filenames. If you like to stick to zip, you could just write a list of "original filename:filename in zip archive" and put that list into each zip file? |
13:42
🔗
|
emijrp |
obviously if i rename files, i wont delete the originalname -> newname |
13:42
🔗
|
emijrp |
that goes to a .csv file |
13:42
🔗
|
soultcer |
Okay, guess that's fine too. |
13:43
🔗
|
soultcer |
Btw, if you use Ubuntu with ext3 or ext4 the maximum filename length will be 255 bytes, same as for ZFS |
13:43
🔗
|
soultcer |
And since Mediawiki commons uses ZFS as well their filenames shouldn't be longer than 255 bytes either? |
13:44
🔗
|
emijrp |
i use ext4 |
13:44
🔗
|
emijrp |
but when i try to download this with wget, it crashes http://upload.wikimedia.org/wikipedia/commons/d/d8/Ferenc_JOACHIM_%281882-1964%29%2C_Hungarian_%28Magyar%29_artist_painter%2C_his_wife_Margit_GRAF%2C_their_three_children_Piroska%2C_Ferenc_G_and_Attila%2C_and_two_of_nine_grandchildren%2C_1944%2C_Budapest%2C_Hungary..jpg |
13:44
🔗
|
emijrp |
try it |
13:45
🔗
|
soultcer |
Works fine for me with wget on ext4 file system |
13:45
🔗
|
emijrp |
not for me |
13:45
🔗
|
db48x |
worked for me |
13:46
🔗
|
soultcer |
Weird, but assuming that this also happens on other Ubuntu installs it is better to limit filenames somehow |
13:46
🔗
|
db48x |
GNU Wget 1.12-2507-dirty and ext4 |
13:48
🔗
|
db48x |
I guess I should go to work |
13:50
🔗
|
emijrp |
i dont understand why it fails on my comp |
13:53
🔗
|
emijrp |
do you know how to sha1sum in base 36? |
14:02
🔗
|
db48x |
you don't do sha1sum in any particular base |
14:02
🔗
|
db48x |
the result of the hashing alorithm is just a number |
14:02
🔗
|
db48x |
the sha1sum outputs that number in hexidecimal |
14:03
🔗
|
db48x |
you have that same number in base 36, so just convert |
14:07
🔗
|
db48x |
well, be back later |
14:09
🔗
|
ersi |
emijrp: I can download that URL with wget just fine |
14:10
🔗
|
ersi |
2f4ffd812a553a00cdfed2f2ec51f1f92baa0272 Ferenc_JOACHIM_(1882-1964),_Hungarian_(Magyar)_artist_painter,_his_wife_Margit_GRAF,_their_three_children_Piroska,_Ferenc_G_and_Attila,_and_two_of_nine_grandchildren,_1944,_Budapest,_Hungary..jpg |
14:10
🔗
|
ersi |
sha1sum on the file~ |
14:25
🔗
|
emijrp |
can you download this with wget too Nemo_bis ? http://upload.wikimedia.org/wikipedia/commons/d/d8/Ferenc_JOACHIM_%281882-1964%29%2C_Hungarian_%28Magyar%29_artist_painter%2C_his_wife_Margit_GRAF%2C_their_three_children_Piroska%2C_Ferenc_G_and_Attila%2C_and_two_of_nine_grandchildren%2C_1944%2C_Budapest%2C_Hungary..jpg |
14:32
🔗
|
ersi |
emijrp: I could |
14:32
🔗
|
ersi |
oh, it's the same pic as before - nevermind |
14:45
🔗
|
emijrp |
im working on the long filenames and '?' chars bug, |
15:07
🔗
|
emijrp |
testing the new version. ... |
15:17
🔗
|
emijrp |
example of image with * http://commons.wikimedia.org/wiki/File:*snore*_-_Flickr_-_exfordy.jpg |
15:40
🔗
|
emijrp |
are there file extension larger than 4 chars? |
15:42
🔗
|
emijrp |
http://ubuntuforums.org/showthread.php?t=1173541 |
16:02
🔗
|
emijrp |
ok |
16:02
🔗
|
emijrp |
the new version is out |
16:02
🔗
|
emijrp |
now .desc files are .xml |
16:03
🔗
|
emijrp |
filenames are truncated if length > 100, to this format: first 100 chars + hash (full filename) + extension |
16:03
🔗
|
emijrp |
script tested with arabic, hebrew, and weird filenames containing * ? |
16:04
🔗
|
emijrp |
oh, and now it creates the .zip and a .csv for every day |
16:04
🔗
|
emijrp |
the .csv contain all the metadata from the feed list for that day, including the renaming info |
16:04
🔗
|
emijrp |
http://code.google.com/p/wikiteam/source/browse/trunk/commonsdownloader.py |
16:05
🔗
|
emijrp |
i think we can download the 2004 year (it starts in 2004-09-07), and try to upload all that into IA |
16:05
🔗
|
emijrp |
make an item for 2004, create the browsable links, etc, and look if all is fine |
16:05
🔗
|
emijrp |
then, launch the script for 2005, 2006.. |
16:06
🔗
|
emijrp |
Nemo_bis: downloaded yesterday 2004, it is less than 1GB (all days), so, we can make this test between today and tomorrow |
16:15
🔗
|
emijrp |
redownload the code, i have modfied a line a second ago |
17:26
🔗
|
Nemo_bis |
hm, back |
17:49
🔗
|
Thomas-ED |
hi guys |
17:49
🔗
|
Thomas-ED |
Nemo_bis, are you there? |
17:49
🔗
|
Nemo_bis |
yes |
17:50
🔗
|
Thomas-ED |
alright |
17:50
🔗
|
Thomas-ED |
so, i'm a sysadmin for encyclopedia dramatica, we're gonna start offering image dumps |
17:50
🔗
|
Thomas-ED |
and possibly page dumps |
17:50
🔗
|
Nemo_bis |
good |
17:50
🔗
|
Thomas-ED |
we have around 30GB of images |
17:50
🔗
|
Thomas-ED |
how do we go about this? |
17:50
🔗
|
Nemo_bis |
it's not that hard |
17:51
🔗
|
Thomas-ED |
i mean |
17:51
🔗
|
Nemo_bis |
For the images, you could just make a tar of your image directory and put it on archive.org |
17:51
🔗
|
Thomas-ED |
but |
17:51
🔗
|
Thomas-ED |
we can provide the dumps on our servers |
17:51
🔗
|
Thomas-ED |
do you guys download it? or do we upload it? |
17:51
🔗
|
Nemo_bis |
it's better if you upload directly |
17:51
🔗
|
Nemo_bis |
I can give you the exact command if you don't know IA's s3 interface |
17:56
🔗
|
Nemo_bis |
Thomas-ED, text is usually more interesting, there's https://www.mediawiki.org/wiki/Manual:DumpBackup.php for it |
17:56
🔗
|
Thomas-ED |
so nobody would want the images? |
17:57
🔗
|
Thomas-ED |
were you responsible for dumping ed.com? |
17:57
🔗
|
Thomas-ED |
we started .ch from those dumps :P |
18:00
🔗
|
Nemo_bis |
Thomas-ED, no, I didn't do that |
18:00
🔗
|
Nemo_bis |
the images are good for archival purposes, but the text is more important usually |
18:01
🔗
|
Nemo_bis |
also, without text one can't use the images, because one doesn't know what they are and what's their licensing status :) |
18:02
🔗
|
Thomas-ED |
well |
18:02
🔗
|
Thomas-ED |
i don't know if you're familiar with ED or anything |
18:02
🔗
|
Thomas-ED |
we don't really care about license or copyright, none of that info is available for our images |
18:02
🔗
|
Nemo_bis |
yes, that's why I said "usually" :p |
18:02
🔗
|
Nemo_bis |
some description will probably be there anyway |
18:03
🔗
|
Nemo_bis |
the images tar is very easy to do, but the dump script should work smoothly enough as well |
18:06
🔗
|
Thomas-ED |
can do the text, but the images will happen later as it will cause alot of I/O and we are in peak right now |
18:11
🔗
|
Thomas-ED |
Nemo_bis, we're gonna offer HTTP dloads to monthly text dumps, and bittorrent aswell |
18:11
🔗
|
Thomas-ED |
you think that will suffice? cba to upload anywhere. |
18:11
🔗
|
Thomas-ED |
people can grab from us |
18:11
🔗
|
Nemo_bis |
yep |
18:11
🔗
|
Nemo_bis |
just tell us when you publish the dumps |
18:11
🔗
|
Nemo_bis |
7z is the best for the full history dumps |
18:12
🔗
|
Nemo_bis |
those will probably be fairly small, so easy to download and reupload |
18:15
🔗
|
Thomas-ED |
yeah we are dumping every page and revision |
18:15
🔗
|
Thomas-ED |
so its gonna be quite big i think |
18:16
🔗
|
Thomas-ED |
gonna offer RSS with the BT tracker |
18:16
🔗
|
Thomas-ED |
so |
18:16
🔗
|
Thomas-ED |
people can automate the dloads of our dumps |
18:17
🔗
|
Nemo_bis |
Thomas-ED, ok, good |
20:07
🔗
|
Thomas-ED |
Nemo_bis, |
20:07
🔗
|
Thomas-ED |
http://dl.srsdl.com/public/ed/text-28-02-12.xml.gz |
20:07
🔗
|
Thomas-ED |
http://dl.srsdl.com/public/ed/text-28-02-12.xml.gz.torrent |
20:07
🔗
|
Nemo_bis |
cool |
20:07
🔗
|
Thomas-ED |
http://tracker.srsdl.com/ed.php |
20:07
🔗
|
Thomas-ED |
will publish monthly on that |
20:08
🔗
|
Nemo_bis |
Thomas-ED, did you try 7z? |
20:09
🔗
|
Thomas-ED |
nah we're just gonna do that lol |
20:09
🔗
|
Nemo_bis |
or do you have more bandwidth than CPU to waste? |
20:09
🔗
|
Nemo_bis |
ok |
20:10
🔗
|
chronomex |
the best way I've found to compress wiki dumps is .rcs.gz |
20:10
🔗
|
chronomex |
RCS alone gets areound the level of .xml.7z |
20:10
🔗
|
chronomex |
tho extraction into a wiki isnt easy from .rcs |
20:11
🔗
|
Nemo_bis |
chronomex, what dumps did you try? |
20:11
🔗
|
Nemo_bis |
I guess it varies wildly |
20:12
🔗
|
Thomas-ED |
to be honest, we just cant be bothered, we've made them available, so there :P |
20:12
🔗
|
chronomex |
I tried subsets of wikipedia, and a full dump of wikiti. |
20:13
🔗
|
chronomex |
works best with wikis that have lots of small changes in big articles |
21:17
🔗
|
Nemo_bis |
Thomas-ED, gzip: text-28-02-12.xml.gz: not in gzip format --> ?? |
21:18
🔗
|
chronomex |
what does `file` say? |
21:20
🔗
|
Nemo_bis |
chronomex, text-28-02-12.xml.gz: HTML document text |
21:20
🔗
|
chronomex |
hrm. |
21:21
🔗
|
Nemo_bis |
looks like it's just an xml |
21:23
🔗
|
Nemo_bis |
$ head text-28-02-12.xml.gz |
21:23
🔗
|
Nemo_bis |
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.4/ http://www.mediawiki.org/xml/export-0.4.xsd" version="0.4" xml:lang="en"> |
21:23
🔗
|
Nemo_bis |
etc. |
21:24
🔗
|
Nemo_bis |
was indeed quite huge to be compressed |