#wikiteam 2012-02-29,Wed

โ†‘back Search

Time Nickname Message
09:04 ๐Ÿ”— Nemo_bis http://www.archive.org/details/wiki-encyclopediadramatica.ch
15:32 ๐Ÿ”— emijrp how is that test runs going?
15:33 ๐Ÿ”— emijrp Nemo_bis: all ok ?
15:33 ๐Ÿ”— Nemo_bis emijrp, I think so
15:33 ๐Ÿ”— emijrp ok
15:33 ๐Ÿ”— emijrp did you finish 2004 ?
15:33 ๐Ÿ”— Nemo_bis emijrp, but as I said i don't have time to check much
15:33 ๐Ÿ”— Nemo_bis yes
15:33 ๐Ÿ”— emijrp from 2004-09-07 ?
15:33 ๐Ÿ”— Nemo_bis if you tell me how to count images and so on I'll run the check
15:33 ๐Ÿ”— Nemo_bis yep
15:34 ๐Ÿ”— emijrp have you deleted the folders?
15:34 ๐Ÿ”— emijrp if so, to check, we have to read the zips
15:34 ๐Ÿ”— Nemo_bis emijrp, http://p.defau.lt/?BIWr35BatT0gAOdUEjIXXg
15:34 ๐Ÿ”— Nemo_bis no, didn't delete, to be able to check :)
15:36 ๐Ÿ”— emijrp anyway, i dont know if check from the commonsql.csv or from the daily csv
15:36 ๐Ÿ”— * Nemo_bis has no idea
15:36 ๐Ÿ”— emijrp i guess from daily csv is better, it must content all the images for that day, downloaded or no
15:37 ๐Ÿ”— emijrp i will code a checker later, based in file size
15:37 ๐Ÿ”— Nemo_bis ok, makes sense
15:37 ๐Ÿ”— Nemo_bis checksum is unreliable
15:37 ๐Ÿ”— emijrp yep
15:37 ๐Ÿ”— Nemo_bis and expensive
15:39 ๐Ÿ”— Nemo_bis emijrp, if you make the description for files I'd upload them immediately
15:39 ๐Ÿ”— Nemo_bis Ideally, you could prepare is in csv format for the bulk uploader.
15:40 ๐Ÿ”— Nemo_bis So that people only have to run that simple perl script with all the metadata; perhaps not even needing to cut the rows about archives they don't have.
15:40 ๐Ÿ”— Nemo_bis If you create the csv with some script, you can also add links to the browsing tool for zips.
15:40 ๐Ÿ”— Nemo_bis (Quite tedious to do manually.)
15:41 ๐Ÿ”— Nemo_bis Also, I think one item per month is better. Usually they don't want items to be bigger than 20-40 GiB, and some months will be very big.
15:50 ๐Ÿ”— ersi simple and perl detected in same line, aborting
15:55 ๐Ÿ”— emijrp i dont know much about the s3 upload and its scripts
15:56 ๐Ÿ”— emijrp i hope other guy does that
15:56 ๐Ÿ”— emijrp the browser links are trivial to generate: itemurl + yyyy-mm-dd.zip + '/'
16:01 ๐Ÿ”— ersi Nemo_bis: Isn't the absolute limit like, 200GB/item?
16:01 ๐Ÿ”— Nemo_bis ersi, there isn't any limit
16:01 ๐Ÿ”— Nemo_bis emijrp, trivial but boring :)
16:01 ๐Ÿ”— ersi Alrighty, that it can *handle* then
16:01 ๐Ÿ”— emijrp Nemo_bis: you can do a trivial script for that
16:02 ๐Ÿ”— Nemo_bis emijrp, the metadata file is very simple https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader
16:02 ๐Ÿ”— emijrp item size limit is infinite? not sure if enough
16:02 ๐Ÿ”— Nemo_bis actually I'm not able to do such a script, I can't program at all :(
16:02 ๐Ÿ”— emijrp man, Python is easy
16:03 ๐Ÿ”— ersi emijrp++
16:03 ๐Ÿ”— emijrp if you look code and write some stuff, in a year you can make fun stuff
16:03 ๐Ÿ”— ersi I made my first snippet within a few days (it was a very simple script, and I had some prior knowledge)
16:03 ๐Ÿ”— ersi easy to express yourself imo
16:04 ๐Ÿ”— Nemo_bis uff, I can write small python or bash scripts, but...
16:04 ๐Ÿ”— Nemo_bis dunno, too many things to learn
16:08 ๐Ÿ”— ersi yeah, totally. I'm lucky I get to do a lot at my work, like code python
16:24 ๐Ÿ”— emijrp is there a way to read filename sizes from inside a zip using python?
16:26 ๐Ÿ”— ersi emijrp: I'd guess you'd have to decompress the zip first
16:26 ๐Ÿ”— soultcer Using the zipfile module you can open the file, then iteratoe over the infolist() method's result
16:26 ๐Ÿ”— ersi what are you using? built-in zip() unzip()? or libs?
16:26 ๐Ÿ”— ersi ah, nice
16:27 ๐Ÿ”— soultcer You will get a ZipInfo object for each file, the file_size attribute contains the uncompressed size.
16:27 ๐Ÿ”— soultcer The downside is that the zipfile module doesn't support any of the unicode extensions that zip has, so you might have some trouble with getting the correct filenames
16:31 ๐Ÿ”— emijrp ok thanks
16:32 ๐Ÿ”— ersi TIL ^_^
16:35 ๐Ÿ”— emijrp apart wikimedia commons, some wikipedias have their local images, en.wp for example 800,000
16:35 ๐Ÿ”— emijrp most of them in the 10,000-100,000 range
16:36 ๐Ÿ”— emijrp by the way, not all free, due to fair use
16:39 ๐Ÿ”— emijrp soultcer: i have no issues reading arabic filenames inside a zip with python
16:39 ๐Ÿ”— emijrp [False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False]
16:39 ๐Ÿ”— emijrp [i.filename.endswith(u'ร™ย†ร˜ยณร˜ยชร˜ยนร™ย„ร›ยŒร™ย‚.png') for i in zipfile.ZipFile('2006-12-10.zip', 'r').infolist()]
16:40 ๐Ÿ”— emijrp look the true
16:41 ๐Ÿ”— soultcer Good
16:41 ๐Ÿ”— soultcer The documentation in python says that it does not do anything with the encoding, so as long as you use the same encoding as your filesystem it should work ;-)
16:42 ๐Ÿ”— Nemo_bis Lunghezza: 19804246 (19M) [application/ogg]
16:42 ๐Ÿ”— Nemo_bis Salvataggio in: "2005/02/11/Beethoven_-_Sonata_opus_111_-2.ogg"
16:43 ๐Ÿ”— emijrp i saw some images over 100mb
16:44 ๐Ÿ”— emijrp 36mb is the biggest file in the current csv
19:46 ๐Ÿ”— Nemo_bis emijrp, what does one need to mark issues as duplicate and so on on google code?
19:46 ๐Ÿ”— Nemo_bis I can't do anything (nor edit the wiki)
19:48 ๐Ÿ”— emijrp try noe
19:49 ๐Ÿ”— Nemo_bis emijrp, still can't change issues
19:53 ๐Ÿ”— emijrp now you can edit issues but not delete
19:54 ๐Ÿ”— Nemo_bis ok
19:54 ๐Ÿ”— Nemo_bis yep, works
19:54 ๐Ÿ”— Nemo_bis thanks
19:54 ๐Ÿ”— * Nemo_bis afk now
22:59 ๐Ÿ”— emijrp Nemo_bis: i have added he checker script http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process and how to use it
22:59 ๐Ÿ”— Nemo_bis emijrp, thanks, I'll look now
22:59 ๐Ÿ”— emijrp it works on the zips an csv, not the folders
23:00 ๐Ÿ”— emijrp it detect missing files or corrupted ones
23:02 ๐Ÿ”— Nemo_bis yep, it's finding somethings
23:03 ๐Ÿ”— Nemo_bis the output is a bit misterious :)
23:05 ๐Ÿ”— emijrp pastebin please
23:06 ๐Ÿ”— Nemo_bis it's still running
23:06 ๐Ÿ”— emijrp missing or corrupt errors?
23:07 ๐Ÿ”— Nemo_bis emijrp, see first snippet http://p.defau.lt/?jh_6BAbw9hcHgHcWWvNCIQ
23:08 ๐Ÿ”— emijrp looks like a bug in the checker
23:08 ๐Ÿ”— emijrp all the files contains a !
23:08 ๐Ÿ”— emijrp but there is a real corrupt file, i think
23:10 ๐Ÿ”— emijrp also "" filenames
23:10 ๐Ÿ”— Nemo_bis but I don't think those are all the files with a !
23:10 ๐Ÿ”— Nemo_bis no idea what "" are
23:11 ๐Ÿ”— Nemo_bis emijrp, http://p.defau.lt/?PIDSvPJ5DAV9bxAW6wxf7A
23:18 ๐Ÿ”— emijrp can you send me 2004-11-03 zip and csv?
23:25 ๐Ÿ”— emijrp nevermind
23:25 ๐Ÿ”— emijrp i get the same errors for that day
23:25 ๐Ÿ”— emijrp looks like a bug in the checker... trying to fix
23:28 ๐Ÿ”— emijrp it is broken on the server http://commons.wikimedia.org/wiki/File:Separation_axioms.png#filehistory
23:31 ๐Ÿ”— Nemo_bis :/
23:32 ๐Ÿ”— emijrp a fuck ton http://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory
23:40 ๐Ÿ”— emijrp how the hell there is empty filenames in the database?
23:43 ๐Ÿ”— emijrp TRUST IN WIKIPEDIA.
23:49 ๐Ÿ”— emijrp im going to send an email to wikimedia tech mailing list
23:52 ๐Ÿ”— Nemo_bis <Reedy> The globalusage table probably wants a proper cleanup at somepoint
23:52 ๐Ÿ”— Nemo_bis <Reedy> https://www.mediawiki.org/wiki/Special:Code/MediaWiki/112687
23:52 ๐Ÿ”— Nemo_bis <Platonides> 605 with C:% and underscores, though
23:52 ๐Ÿ”— Nemo_bis <Platonides> I see zero with them
23:52 ๐Ÿ”— Nemo_bis <Platonides> ouch
23:52 ๐Ÿ”— Nemo_bis <Platonides> well, that's normal for both good and bad files
23:52 ๐Ÿ”— Nemo_bis <Reedy> (with spaces)
23:52 ๐Ÿ”— Nemo_bis <Reedy> There were over 4 million
23:52 ๐Ÿ”— Nemo_bis <Platonides> C:\documents_and_setting\desktop\FA.bmp This is not even a valid path...
23:52 ๐Ÿ”— Nemo_bis <Reedy> if you build a list of all the bad ones, I'll kill them later
23:53 ๐Ÿ”— Nemo_bis but
23:53 ๐Ÿ”— Nemo_bis <Platonides> I don't see a deleted entry
23:53 ๐Ÿ”— Nemo_bis <Platonides> globalusage?
23:53 ๐Ÿ”— Nemo_bis <Reedy> It's in the globalusage table
23:53 ๐Ÿ”— Nemo_bis <Reedy> yes
23:53 ๐Ÿ”— Nemo_bis <Platonides> that's probably someone linking to that
23:53 ๐Ÿ”— Nemo_bis <Reedy> well, whatever
23:53 ๐Ÿ”— Nemo_bis <Reedy> its' still stupid :p
23:53 ๐Ÿ”— Nemo_bis <Platonides> what better way to show your file in wikipedia?
23:55 ๐Ÿ”— emijrp email sent
23:55 ๐Ÿ”— emijrp awaiting for response...
23:56 ๐Ÿ”— emijrp this ia a complex archive project
23:56 ๐Ÿ”— emijrp and wikipedia sucks

irclogger-viewer