| Time | Nickname | Message | 
    
        | 09:04
            
                ๐ | Nemo_bis | http://www.archive.org/details/wiki-encyclopediadramatica.ch | 
    
        | 15:32
            
                ๐ | emijrp | how is that test runs going? | 
    
        | 15:33
            
                ๐ | emijrp | Nemo_bis: all ok ? | 
    
        | 15:33
            
                ๐ | Nemo_bis | emijrp, I think so | 
    
        | 15:33
            
                ๐ | emijrp | ok | 
    
        | 15:33
            
                ๐ | emijrp | did you finish 2004 ? | 
    
        | 15:33
            
                ๐ | Nemo_bis | emijrp, but as I said i don't have time to check much | 
    
        | 15:33
            
                ๐ | Nemo_bis | yes | 
    
        | 15:33
            
                ๐ | emijrp | from 2004-09-07 ? | 
    
        | 15:33
            
                ๐ | Nemo_bis | if you tell me how to count images and so on I'll run the check | 
    
        | 15:33
            
                ๐ | Nemo_bis | yep | 
    
        | 15:34
            
                ๐ | emijrp | have you deleted the folders? | 
    
        | 15:34
            
                ๐ | emijrp | if so, to check, we have to read the zips | 
    
        | 15:34
            
                ๐ | Nemo_bis | emijrp, http://p.defau.lt/?BIWr35BatT0gAOdUEjIXXg | 
    
        | 15:34
            
                ๐ | Nemo_bis | no, didn't delete, to be able to check :) | 
    
        | 15:36
            
                ๐ | emijrp | anyway, i dont know if check from the commonsql.csv or from the daily csv | 
    
        | 15:36
            
                ๐ | * | Nemo_bis has no idea | 
    
        | 15:36
            
                ๐ | emijrp | i guess from daily csv is better, it must content all the images for that day, downloaded or no | 
    
        | 15:37
            
                ๐ | emijrp | i will code a checker later, based in file size | 
    
        | 15:37
            
                ๐ | Nemo_bis | ok, makes sense | 
    
        | 15:37
            
                ๐ | Nemo_bis | checksum is unreliable | 
    
        | 15:37
            
                ๐ | emijrp | yep | 
    
        | 15:37
            
                ๐ | Nemo_bis | and expensive | 
    
        | 15:39
            
                ๐ | Nemo_bis | emijrp, if you make the description for files I'd upload them immediately | 
    
        | 15:39
            
                ๐ | Nemo_bis | Ideally, you could prepare is in csv format for the bulk uploader. | 
    
        | 15:40
            
                ๐ | Nemo_bis | So that people only have to run that simple perl script with all the metadata; perhaps not even needing to cut the rows about archives they don't have. | 
    
        | 15:40
            
                ๐ | Nemo_bis | If you create the csv with some script, you can also add links to the browsing tool for zips. | 
    
        | 15:40
            
                ๐ | Nemo_bis | (Quite tedious to do manually.) | 
    
        | 15:41
            
                ๐ | Nemo_bis | Also, I think one item per month is better. Usually they don't want items to be bigger than 20-40 GiB, and some months will be very big. | 
    
        | 15:50
            
                ๐ | ersi | simple and perl detected in same line, aborting | 
    
        | 15:55
            
                ๐ | emijrp | i dont know much about the s3 upload and its scripts | 
    
        | 15:56
            
                ๐ | emijrp | i hope other guy does that | 
    
        | 15:56
            
                ๐ | emijrp | the browser links are trivial to generate: itemurl + yyyy-mm-dd.zip + '/' | 
    
        | 16:01
            
                ๐ | ersi | Nemo_bis: Isn't the absolute limit like, 200GB/item? | 
    
        | 16:01
            
                ๐ | Nemo_bis | ersi, there isn't any limit | 
    
        | 16:01
            
                ๐ | Nemo_bis | emijrp, trivial but boring :) | 
    
        | 16:01
            
                ๐ | ersi | Alrighty, that it can *handle* then | 
    
        | 16:01
            
                ๐ | emijrp | Nemo_bis: you can do a trivial script for that | 
    
        | 16:02
            
                ๐ | Nemo_bis | emijrp, the metadata file is very simple https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader | 
    
        | 16:02
            
                ๐ | emijrp | item size limit is infinite? not sure if enough | 
    
        | 16:02
            
                ๐ | Nemo_bis | actually I'm not able to do such a script, I can't program at all :( | 
    
        | 16:02
            
                ๐ | emijrp | man, Python is easy | 
    
        | 16:03
            
                ๐ | ersi | emijrp++ | 
    
        | 16:03
            
                ๐ | emijrp | if you look code and write some stuff, in a year you can make fun stuff | 
    
        | 16:03
            
                ๐ | ersi | I made my first snippet within a few days (it was a very simple script, and I had some prior knowledge) | 
    
        | 16:03
            
                ๐ | ersi | easy to express yourself imo | 
    
        | 16:04
            
                ๐ | Nemo_bis | uff, I can write small python or bash scripts, but... | 
    
        | 16:04
            
                ๐ | Nemo_bis | dunno, too many things to learn | 
    
        | 16:08
            
                ๐ | ersi | yeah, totally. I'm lucky I get to do a lot at my work, like code python | 
    
        | 16:24
            
                ๐ | emijrp | is there a way to read filename sizes from inside a zip using python? | 
    
        | 16:26
            
                ๐ | ersi | emijrp: I'd guess you'd have to decompress the zip first | 
    
        | 16:26
            
                ๐ | soultcer | Using the zipfile module you can open the file, then iteratoe over the infolist() method's result | 
    
        | 16:26
            
                ๐ | ersi | what are you using? built-in zip() unzip()? or libs? | 
    
        | 16:26
            
                ๐ | ersi | ah, nice | 
    
        | 16:27
            
                ๐ | soultcer | You will get a ZipInfo object for each file, the file_size attribute contains the uncompressed size. | 
    
        | 16:27
            
                ๐ | soultcer | The downside is that the zipfile module doesn't support any of the unicode extensions that zip has, so you might have some trouble with getting the correct filenames | 
    
        | 16:31
            
                ๐ | emijrp | ok thanks | 
    
        | 16:32
            
                ๐ | ersi | TIL ^_^ | 
    
        | 16:35
            
                ๐ | emijrp | apart wikimedia commons, some wikipedias have their local images, en.wp for example 800,000 | 
    
        | 16:35
            
                ๐ | emijrp | most of them in the 10,000-100,000 range | 
    
        | 16:36
            
                ๐ | emijrp | by the way, not all free, due to fair use | 
    
        | 16:39
            
                ๐ | emijrp | soultcer: i have no issues reading arabic filenames inside a zip with python | 
    
        | 16:39
            
                ๐ | emijrp | [False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False] | 
    
        | 16:39
            
                ๐ | emijrp | [i.filename.endswith(u'รยรยณรยชรยนรยรยรย.png') for i in zipfile.ZipFile('2006-12-10.zip', 'r').infolist()] | 
    
        | 16:40
            
                ๐ | emijrp | look the true | 
    
        | 16:41
            
                ๐ | soultcer | Good | 
    
        | 16:41
            
                ๐ | soultcer | The documentation in python says that it does not do anything with the encoding, so as long as you use the same encoding as your filesystem it should work ;-) | 
    
        | 16:42
            
                ๐ | Nemo_bis | Lunghezza: 19804246 (19M) [application/ogg] | 
    
        | 16:42
            
                ๐ | Nemo_bis | Salvataggio in: "2005/02/11/Beethoven_-_Sonata_opus_111_-2.ogg" | 
    
        | 16:43
            
                ๐ | emijrp | i saw some images over 100mb | 
    
        | 16:44
            
                ๐ | emijrp | 36mb is the biggest file in the current csv | 
    
        | 19:46
            
                ๐ | Nemo_bis | emijrp, what does one need to mark issues as duplicate and so on on google code? | 
    
        | 19:46
            
                ๐ | Nemo_bis | I can't do anything (nor edit the wiki) | 
    
        | 19:48
            
                ๐ | emijrp | try noe | 
    
        | 19:49
            
                ๐ | Nemo_bis | emijrp, still can't change issues | 
    
        | 19:53
            
                ๐ | emijrp | now you can edit issues but not delete | 
    
        | 19:54
            
                ๐ | Nemo_bis | ok | 
    
        | 19:54
            
                ๐ | Nemo_bis | yep, works | 
    
        | 19:54
            
                ๐ | Nemo_bis | thanks | 
    
        | 19:54
            
                ๐ | * | Nemo_bis afk now | 
    
        | 22:59
            
                ๐ | emijrp | Nemo_bis: i have added he checker script http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process and how to use it | 
    
        | 22:59
            
                ๐ | Nemo_bis | emijrp, thanks, I'll look now | 
    
        | 22:59
            
                ๐ | emijrp | it works on the zips an csv, not the folders | 
    
        | 23:00
            
                ๐ | emijrp | it detect missing files or corrupted ones | 
    
        | 23:02
            
                ๐ | Nemo_bis | yep, it's finding somethings | 
    
        | 23:03
            
                ๐ | Nemo_bis | the output is a bit misterious :) | 
    
        | 23:05
            
                ๐ | emijrp | pastebin please | 
    
        | 23:06
            
                ๐ | Nemo_bis | it's still running | 
    
        | 23:06
            
                ๐ | emijrp | missing or corrupt errors? | 
    
        | 23:07
            
                ๐ | Nemo_bis | emijrp, see first snippet http://p.defau.lt/?jh_6BAbw9hcHgHcWWvNCIQ | 
    
        | 23:08
            
                ๐ | emijrp | looks like a bug in the checker | 
    
        | 23:08
            
                ๐ | emijrp | all the files contains a ! | 
    
        | 23:08
            
                ๐ | emijrp | but there is a real corrupt file, i think | 
    
        | 23:10
            
                ๐ | emijrp | also "" filenames | 
    
        | 23:10
            
                ๐ | Nemo_bis | but I don't think those are all the files with a ! | 
    
        | 23:10
            
                ๐ | Nemo_bis | no idea what "" are | 
    
        | 23:11
            
                ๐ | Nemo_bis | emijrp, http://p.defau.lt/?PIDSvPJ5DAV9bxAW6wxf7A | 
    
        | 23:18
            
                ๐ | emijrp | can you send me 2004-11-03 zip and csv? | 
    
        | 23:25
            
                ๐ | emijrp | nevermind | 
    
        | 23:25
            
                ๐ | emijrp | i get the same errors for that day | 
    
        | 23:25
            
                ๐ | emijrp | looks like a bug in the checker... trying to fix | 
    
        | 23:28
            
                ๐ | emijrp | it is broken on the server http://commons.wikimedia.org/wiki/File:Separation_axioms.png#filehistory | 
    
        | 23:31
            
                ๐ | Nemo_bis | :/ | 
    
        | 23:32
            
                ๐ | emijrp | a fuck ton http://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory | 
    
        | 23:40
            
                ๐ | emijrp | how the hell there is empty filenames in the database? | 
    
        | 23:43
            
                ๐ | emijrp | TRUST IN WIKIPEDIA. | 
    
        | 23:49
            
                ๐ | emijrp | im going to send an email to wikimedia tech mailing list | 
    
        | 23:52
            
                ๐ | Nemo_bis | <Reedy> The globalusage table probably wants a proper cleanup at somepoint | 
    
        | 23:52
            
                ๐ | Nemo_bis | <Reedy> https://www.mediawiki.org/wiki/Special:Code/MediaWiki/112687 | 
    
        | 23:52
            
                ๐ | Nemo_bis | <Platonides> 605 with C:% and underscores, though | 
    
        | 23:52
            
                ๐ | Nemo_bis | <Platonides> I see zero with them | 
    
        | 23:52
            
                ๐ | Nemo_bis | <Platonides> ouch | 
    
        | 23:52
            
                ๐ | Nemo_bis | <Platonides> well, that's normal for both good and bad files | 
    
        | 23:52
            
                ๐ | Nemo_bis | <Reedy> (with spaces) | 
    
        | 23:52
            
                ๐ | Nemo_bis | <Reedy> There were over 4 million | 
    
        | 23:52
            
                ๐ | Nemo_bis | <Platonides> C:\documents_and_setting\desktop\FA.bmp This is not even a valid path... | 
    
        | 23:52
            
                ๐ | Nemo_bis | <Reedy> if you build a list of all the bad ones, I'll kill them later | 
    
        | 23:53
            
                ๐ | Nemo_bis | but | 
    
        | 23:53
            
                ๐ | Nemo_bis | <Platonides> I don't see a deleted entry | 
    
        | 23:53
            
                ๐ | Nemo_bis | <Platonides> globalusage? | 
    
        | 23:53
            
                ๐ | Nemo_bis | <Reedy> It's in the globalusage table | 
    
        | 23:53
            
                ๐ | Nemo_bis | <Reedy> yes | 
    
        | 23:53
            
                ๐ | Nemo_bis | <Platonides> that's probably someone linking to that | 
    
        | 23:53
            
                ๐ | Nemo_bis | <Reedy> well, whatever | 
    
        | 23:53
            
                ๐ | Nemo_bis | <Reedy> its' still stupid :p | 
    
        | 23:53
            
                ๐ | Nemo_bis | <Platonides> what better way to show your file in wikipedia? | 
    
        | 23:55
            
                ๐ | emijrp | email sent | 
    
        | 23:55
            
                ๐ | emijrp | awaiting for response... | 
    
        | 23:56
            
                ๐ | emijrp | this ia a complex archive project | 
    
        | 23:56
            
                ๐ | emijrp | and wikipedia sucks |