[09:04] http://www.archive.org/details/wiki-encyclopediadramatica.ch [15:32] how is that test runs going? [15:33] Nemo_bis: all ok ? [15:33] emijrp, I think so [15:33] ok [15:33] did you finish 2004 ? [15:33] emijrp, but as I said i don't have time to check much [15:33] yes [15:33] from 2004-09-07 ? [15:33] if you tell me how to count images and so on I'll run the check [15:33] yep [15:34] have you deleted the folders? [15:34] if so, to check, we have to read the zips [15:34] emijrp, http://p.defau.lt/?BIWr35BatT0gAOdUEjIXXg [15:34] no, didn't delete, to be able to check :) [15:36] anyway, i dont know if check from the commonsql.csv or from the daily csv [15:36] * Nemo_bis has no idea [15:36] i guess from daily csv is better, it must content all the images for that day, downloaded or no [15:37] i will code a checker later, based in file size [15:37] ok, makes sense [15:37] checksum is unreliable [15:37] yep [15:37] and expensive [15:39] emijrp, if you make the description for files I'd upload them immediately [15:39] Ideally, you could prepare is in csv format for the bulk uploader. [15:40] So that people only have to run that simple perl script with all the metadata; perhaps not even needing to cut the rows about archives they don't have. [15:40] If you create the csv with some script, you can also add links to the browsing tool for zips. [15:40] (Quite tedious to do manually.) [15:41] Also, I think one item per month is better. Usually they don't want items to be bigger than 20-40 GiB, and some months will be very big. [15:50] simple and perl detected in same line, aborting [15:55] i dont know much about the s3 upload and its scripts [15:56] i hope other guy does that [15:56] the browser links are trivial to generate: itemurl + yyyy-mm-dd.zip + '/' [16:01] Nemo_bis: Isn't the absolute limit like, 200GB/item? [16:01] ersi, there isn't any limit [16:01] emijrp, trivial but boring :) [16:01] Alrighty, that it can *handle* then [16:01] Nemo_bis: you can do a trivial script for that [16:02] emijrp, the metadata file is very simple https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader [16:02] item size limit is infinite? not sure if enough [16:02] actually I'm not able to do such a script, I can't program at all :( [16:02] man, Python is easy [16:03] emijrp++ [16:03] if you look code and write some stuff, in a year you can make fun stuff [16:03] I made my first snippet within a few days (it was a very simple script, and I had some prior knowledge) [16:03] easy to express yourself imo [16:04] uff, I can write small python or bash scripts, but... [16:04] dunno, too many things to learn [16:08] yeah, totally. I'm lucky I get to do a lot at my work, like code python [16:24] is there a way to read filename sizes from inside a zip using python? [16:26] emijrp: I'd guess you'd have to decompress the zip first [16:26] Using the zipfile module you can open the file, then iteratoe over the infolist() method's result [16:26] what are you using? built-in zip() unzip()? or libs? [16:26] ah, nice [16:27] You will get a ZipInfo object for each file, the file_size attribute contains the uncompressed size. [16:27] The downside is that the zipfile module doesn't support any of the unicode extensions that zip has, so you might have some trouble with getting the correct filenames [16:31] ok thanks [16:32] TIL ^_^ [16:35] apart wikimedia commons, some wikipedias have their local images, en.wp for example 800,000 [16:35] most of them in the 10,000-100,000 range [16:36] by the way, not all free, due to fair use [16:39] soultcer: i have no issues reading arabic filenames inside a zip with python [16:39] [False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False] [16:39] [i.filename.endswith(u'ÙØ³ØªØ¹ÙÛÙ.png') for i in zipfile.ZipFile('2006-12-10.zip', 'r').infolist()] [16:40] look the true [16:41] Good [16:41] The documentation in python says that it does not do anything with the encoding, so as long as you use the same encoding as your filesystem it should work ;-) [16:42] Lunghezza: 19804246 (19M) [application/ogg] [16:42] Salvataggio in: "2005/02/11/Beethoven_-_Sonata_opus_111_-2.ogg" [16:43] i saw some images over 100mb [16:44] 36mb is the biggest file in the current csv [19:46] emijrp, what does one need to mark issues as duplicate and so on on google code? [19:46] I can't do anything (nor edit the wiki) [19:48] try noe [19:49] emijrp, still can't change issues [19:53] now you can edit issues but not delete [19:54] ok [19:54] yep, works [19:54] thanks [19:54] * Nemo_bis afk now [22:59] Nemo_bis: i have added he checker script http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process and how to use it [22:59] emijrp, thanks, I'll look now [22:59] it works on the zips an csv, not the folders [23:00] it detect missing files or corrupted ones [23:02] yep, it's finding somethings [23:03] the output is a bit misterious :) [23:05] pastebin please [23:06] it's still running [23:06] missing or corrupt errors? [23:07] emijrp, see first snippet http://p.defau.lt/?jh_6BAbw9hcHgHcWWvNCIQ [23:08] looks like a bug in the checker [23:08] all the files contains a ! [23:08] but there is a real corrupt file, i think [23:10] also "" filenames [23:10] but I don't think those are all the files with a ! [23:10] no idea what "" are [23:11] emijrp, http://p.defau.lt/?PIDSvPJ5DAV9bxAW6wxf7A [23:18] can you send me 2004-11-03 zip and csv? [23:25] nevermind [23:25] i get the same errors for that day [23:25] looks like a bug in the checker... trying to fix [23:28] it is broken on the server http://commons.wikimedia.org/wiki/File:Separation_axioms.png#filehistory [23:31] :/ [23:32] a fuck ton http://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory [23:40] how the hell there is empty filenames in the database? [23:43] TRUST IN WIKIPEDIA. [23:49] im going to send an email to wikimedia tech mailing list [23:52] The globalusage table probably wants a proper cleanup at somepoint [23:52] https://www.mediawiki.org/wiki/Special:Code/MediaWiki/112687 [23:52] 605 with C:% and underscores, though [23:52] I see zero with them [23:52] ouch [23:52] well, that's normal for both good and bad files [23:52] (with spaces) [23:52] There were over 4 million [23:52] C:\documents_and_setting\desktop\FA.bmp This is not even a valid path... [23:52] if you build a list of all the bad ones, I'll kill them later [23:53] but [23:53] I don't see a deleted entry [23:53] globalusage? [23:53] It's in the globalusage table [23:53] yes [23:53] that's probably someone linking to that [23:53] well, whatever [23:53] its' still stupid :p [23:53] what better way to show your file in wikipedia? [23:55] email sent [23:55] awaiting for response... [23:56] this ia a complex archive project [23:56] and wikipedia sucks