Time |
Nickname |
Message |
09:04
๐
|
Nemo_bis |
http://www.archive.org/details/wiki-encyclopediadramatica.ch |
15:32
๐
|
emijrp |
how is that test runs going? |
15:33
๐
|
emijrp |
Nemo_bis: all ok ? |
15:33
๐
|
Nemo_bis |
emijrp, I think so |
15:33
๐
|
emijrp |
ok |
15:33
๐
|
emijrp |
did you finish 2004 ? |
15:33
๐
|
Nemo_bis |
emijrp, but as I said i don't have time to check much |
15:33
๐
|
Nemo_bis |
yes |
15:33
๐
|
emijrp |
from 2004-09-07 ? |
15:33
๐
|
Nemo_bis |
if you tell me how to count images and so on I'll run the check |
15:33
๐
|
Nemo_bis |
yep |
15:34
๐
|
emijrp |
have you deleted the folders? |
15:34
๐
|
emijrp |
if so, to check, we have to read the zips |
15:34
๐
|
Nemo_bis |
emijrp, http://p.defau.lt/?BIWr35BatT0gAOdUEjIXXg |
15:34
๐
|
Nemo_bis |
no, didn't delete, to be able to check :) |
15:36
๐
|
emijrp |
anyway, i dont know if check from the commonsql.csv or from the daily csv |
15:36
๐
|
* |
Nemo_bis has no idea |
15:36
๐
|
emijrp |
i guess from daily csv is better, it must content all the images for that day, downloaded or no |
15:37
๐
|
emijrp |
i will code a checker later, based in file size |
15:37
๐
|
Nemo_bis |
ok, makes sense |
15:37
๐
|
Nemo_bis |
checksum is unreliable |
15:37
๐
|
emijrp |
yep |
15:37
๐
|
Nemo_bis |
and expensive |
15:39
๐
|
Nemo_bis |
emijrp, if you make the description for files I'd upload them immediately |
15:39
๐
|
Nemo_bis |
Ideally, you could prepare is in csv format for the bulk uploader. |
15:40
๐
|
Nemo_bis |
So that people only have to run that simple perl script with all the metadata; perhaps not even needing to cut the rows about archives they don't have. |
15:40
๐
|
Nemo_bis |
If you create the csv with some script, you can also add links to the browsing tool for zips. |
15:40
๐
|
Nemo_bis |
(Quite tedious to do manually.) |
15:41
๐
|
Nemo_bis |
Also, I think one item per month is better. Usually they don't want items to be bigger than 20-40 GiB, and some months will be very big. |
15:50
๐
|
ersi |
simple and perl detected in same line, aborting |
15:55
๐
|
emijrp |
i dont know much about the s3 upload and its scripts |
15:56
๐
|
emijrp |
i hope other guy does that |
15:56
๐
|
emijrp |
the browser links are trivial to generate: itemurl + yyyy-mm-dd.zip + '/' |
16:01
๐
|
ersi |
Nemo_bis: Isn't the absolute limit like, 200GB/item? |
16:01
๐
|
Nemo_bis |
ersi, there isn't any limit |
16:01
๐
|
Nemo_bis |
emijrp, trivial but boring :) |
16:01
๐
|
ersi |
Alrighty, that it can *handle* then |
16:01
๐
|
emijrp |
Nemo_bis: you can do a trivial script for that |
16:02
๐
|
Nemo_bis |
emijrp, the metadata file is very simple https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader |
16:02
๐
|
emijrp |
item size limit is infinite? not sure if enough |
16:02
๐
|
Nemo_bis |
actually I'm not able to do such a script, I can't program at all :( |
16:02
๐
|
emijrp |
man, Python is easy |
16:03
๐
|
ersi |
emijrp++ |
16:03
๐
|
emijrp |
if you look code and write some stuff, in a year you can make fun stuff |
16:03
๐
|
ersi |
I made my first snippet within a few days (it was a very simple script, and I had some prior knowledge) |
16:03
๐
|
ersi |
easy to express yourself imo |
16:04
๐
|
Nemo_bis |
uff, I can write small python or bash scripts, but... |
16:04
๐
|
Nemo_bis |
dunno, too many things to learn |
16:08
๐
|
ersi |
yeah, totally. I'm lucky I get to do a lot at my work, like code python |
16:24
๐
|
emijrp |
is there a way to read filename sizes from inside a zip using python? |
16:26
๐
|
ersi |
emijrp: I'd guess you'd have to decompress the zip first |
16:26
๐
|
soultcer |
Using the zipfile module you can open the file, then iteratoe over the infolist() method's result |
16:26
๐
|
ersi |
what are you using? built-in zip() unzip()? or libs? |
16:26
๐
|
ersi |
ah, nice |
16:27
๐
|
soultcer |
You will get a ZipInfo object for each file, the file_size attribute contains the uncompressed size. |
16:27
๐
|
soultcer |
The downside is that the zipfile module doesn't support any of the unicode extensions that zip has, so you might have some trouble with getting the correct filenames |
16:31
๐
|
emijrp |
ok thanks |
16:32
๐
|
ersi |
TIL ^_^ |
16:35
๐
|
emijrp |
apart wikimedia commons, some wikipedias have their local images, en.wp for example 800,000 |
16:35
๐
|
emijrp |
most of them in the 10,000-100,000 range |
16:36
๐
|
emijrp |
by the way, not all free, due to fair use |
16:39
๐
|
emijrp |
soultcer: i have no issues reading arabic filenames inside a zip with python |
16:39
๐
|
emijrp |
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False] |
16:39
๐
|
emijrp |
[i.filename.endswith(u'รยรยณรยชรยนรยรยรย.png') for i in zipfile.ZipFile('2006-12-10.zip', 'r').infolist()] |
16:40
๐
|
emijrp |
look the true |
16:41
๐
|
soultcer |
Good |
16:41
๐
|
soultcer |
The documentation in python says that it does not do anything with the encoding, so as long as you use the same encoding as your filesystem it should work ;-) |
16:42
๐
|
Nemo_bis |
Lunghezza: 19804246 (19M) [application/ogg] |
16:42
๐
|
Nemo_bis |
Salvataggio in: "2005/02/11/Beethoven_-_Sonata_opus_111_-2.ogg" |
16:43
๐
|
emijrp |
i saw some images over 100mb |
16:44
๐
|
emijrp |
36mb is the biggest file in the current csv |
19:46
๐
|
Nemo_bis |
emijrp, what does one need to mark issues as duplicate and so on on google code? |
19:46
๐
|
Nemo_bis |
I can't do anything (nor edit the wiki) |
19:48
๐
|
emijrp |
try noe |
19:49
๐
|
Nemo_bis |
emijrp, still can't change issues |
19:53
๐
|
emijrp |
now you can edit issues but not delete |
19:54
๐
|
Nemo_bis |
ok |
19:54
๐
|
Nemo_bis |
yep, works |
19:54
๐
|
Nemo_bis |
thanks |
19:54
๐
|
* |
Nemo_bis afk now |
22:59
๐
|
emijrp |
Nemo_bis: i have added he checker script http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process and how to use it |
22:59
๐
|
Nemo_bis |
emijrp, thanks, I'll look now |
22:59
๐
|
emijrp |
it works on the zips an csv, not the folders |
23:00
๐
|
emijrp |
it detect missing files or corrupted ones |
23:02
๐
|
Nemo_bis |
yep, it's finding somethings |
23:03
๐
|
Nemo_bis |
the output is a bit misterious :) |
23:05
๐
|
emijrp |
pastebin please |
23:06
๐
|
Nemo_bis |
it's still running |
23:06
๐
|
emijrp |
missing or corrupt errors? |
23:07
๐
|
Nemo_bis |
emijrp, see first snippet http://p.defau.lt/?jh_6BAbw9hcHgHcWWvNCIQ |
23:08
๐
|
emijrp |
looks like a bug in the checker |
23:08
๐
|
emijrp |
all the files contains a ! |
23:08
๐
|
emijrp |
but there is a real corrupt file, i think |
23:10
๐
|
emijrp |
also "" filenames |
23:10
๐
|
Nemo_bis |
but I don't think those are all the files with a ! |
23:10
๐
|
Nemo_bis |
no idea what "" are |
23:11
๐
|
Nemo_bis |
emijrp, http://p.defau.lt/?PIDSvPJ5DAV9bxAW6wxf7A |
23:18
๐
|
emijrp |
can you send me 2004-11-03 zip and csv? |
23:25
๐
|
emijrp |
nevermind |
23:25
๐
|
emijrp |
i get the same errors for that day |
23:25
๐
|
emijrp |
looks like a bug in the checker... trying to fix |
23:28
๐
|
emijrp |
it is broken on the server http://commons.wikimedia.org/wiki/File:Separation_axioms.png#filehistory |
23:31
๐
|
Nemo_bis |
:/ |
23:32
๐
|
emijrp |
a fuck ton http://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory |
23:40
๐
|
emijrp |
how the hell there is empty filenames in the database? |
23:43
๐
|
emijrp |
TRUST IN WIKIPEDIA. |
23:49
๐
|
emijrp |
im going to send an email to wikimedia tech mailing list |
23:52
๐
|
Nemo_bis |
<Reedy> The globalusage table probably wants a proper cleanup at somepoint |
23:52
๐
|
Nemo_bis |
<Reedy> https://www.mediawiki.org/wiki/Special:Code/MediaWiki/112687 |
23:52
๐
|
Nemo_bis |
<Platonides> 605 with C:% and underscores, though |
23:52
๐
|
Nemo_bis |
<Platonides> I see zero with them |
23:52
๐
|
Nemo_bis |
<Platonides> ouch |
23:52
๐
|
Nemo_bis |
<Platonides> well, that's normal for both good and bad files |
23:52
๐
|
Nemo_bis |
<Reedy> (with spaces) |
23:52
๐
|
Nemo_bis |
<Reedy> There were over 4 million |
23:52
๐
|
Nemo_bis |
<Platonides> C:\documents_and_setting\desktop\FA.bmp This is not even a valid path... |
23:52
๐
|
Nemo_bis |
<Reedy> if you build a list of all the bad ones, I'll kill them later |
23:53
๐
|
Nemo_bis |
but |
23:53
๐
|
Nemo_bis |
<Platonides> I don't see a deleted entry |
23:53
๐
|
Nemo_bis |
<Platonides> globalusage? |
23:53
๐
|
Nemo_bis |
<Reedy> It's in the globalusage table |
23:53
๐
|
Nemo_bis |
<Reedy> yes |
23:53
๐
|
Nemo_bis |
<Platonides> that's probably someone linking to that |
23:53
๐
|
Nemo_bis |
<Reedy> well, whatever |
23:53
๐
|
Nemo_bis |
<Reedy> its' still stupid :p |
23:53
๐
|
Nemo_bis |
<Platonides> what better way to show your file in wikipedia? |
23:55
๐
|
emijrp |
email sent |
23:55
๐
|
emijrp |
awaiting for response... |
23:56
๐
|
emijrp |
this ia a complex archive project |
23:56
๐
|
emijrp |
and wikipedia sucks |