Time |
Nickname |
Message |
16:51
π
|
emijrp |
i need some volunteers to archive http://commons.wikimedia.org, it contains 12M files, but the first chunk is about 1M files ~= 500 gb, that chunk is made of daily chunks |
16:51
π
|
emijrp |
im going to upload the script and feed list |
16:58
π
|
Nemo_bis |
emijrp, what do you need? |
16:59
π
|
Nemo_bis |
what I can offer is limited technical knowledge, quite a lot of bandwidth and some disk space :) |
16:59
π
|
Nemo_bis |
if you put the script somewhere and a list of chunks to claim on archiveteam.org, I'll do as much as possible |
17:09
π
|
emijrp |
wait a second |
17:20
π
|
emijrp |
http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process |
17:20
π
|
Nemo_bis |
emijrp, thanks |
17:21
π
|
Nemo_bis |
I can download 150 GiB or something to start with |
17:21
π
|
Nemo_bis |
unless you want to crowdsource it more |
17:21
π
|
emijrp |
first i want just a tests |
17:21
π
|
Nemo_bis |
ok |
17:22
π
|
Nemo_bis |
not today though |
17:22
π
|
Nemo_bis |
also, why 7z? doesn't look very useful for images |
17:22
π
|
Nemo_bis |
and zip or tar can be browsed with the IA tools |
17:22
π
|
Nemo_bis |
(without downloading) |
17:25
π
|
emijrp |
every day-by-day folder contain files + .xml with descriptions |
17:25
π
|
emijrp |
thats why i chose 7z, for the xml |
17:25
π
|
emijrp |
but if zip is browsable.. |
17:25
π
|
emijrp |
by the way, some days have +5000 pics |
17:26
π
|
emijrp |
can browse that? |
17:26
π
|
Nemo_bis |
I think so |
17:26
π
|
Nemo_bis |
It has some problems also with huge archives, like 5-10 GiB |
17:27
π
|
Nemo_bis |
limit unclear |
17:27
π
|
Nemo_bis |
s/also/only/ |
17:29
π
|
emijrp |
that woul be the size for every package |
17:29
π
|
emijrp |
1mb/image, 5000 images, 5gb |
17:32
π
|
Nemo_bis |
should be ok |
18:48
π
|
emijrp |
ok, we can make some tests http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process |
18:48
π
|
emijrp |
i have tested the script, but, i hope we can find any error before we start to DDoS wikipedia servers |
18:48
π
|
emijrp |
by the way, my upload stream is shit, so, i wont use this script much, irony |
18:52
π
|
emijrp |
Nemo_bis: can you get from 2004-09-07 to 2004-09-30 ? |
18:56
π
|
Nemo_bis |
emijrp, I can run it, but not look much at it today, no time |
19:01
π
|
Nemo_bis |
emijrp, I download that 7z and uncompress it in the working directory? |
19:02
π
|
emijrp |
yes |
19:03
π
|
Nemo_bis |
emijrp, running |
19:04
π
|
Nemo_bis |
emijrp, as they're small enough, should we put them in rather big item on archive.org? |
19:04
π
|
emijrp |
it creates zips by day, i think we can follow that format |
19:04
π
|
emijrp |
as the pagecounts from 2007 |
19:05
π
|
emijrp |
do you upload 1 to 1 or in batches? |
19:05
π
|
Nemo_bis |
emijrp, pagecounts? |
19:05
π
|
Nemo_bis |
those are in monthly items |
19:05
π
|
emijrp |
domas visits logs |
19:06
π
|
emijrp |
yes, but are single files |
19:06
π
|
emijrp |
if you upload several .zip into one item, you can browse them separately? |
19:06
π
|
Nemo_bis |
emijrp, yes |
19:07
π
|
emijrp |
ok, in that case we can create one time per month |
19:07
π
|
Nemo_bis |
but anyway you need to add the link |
19:07
π
|
Nemo_bis |
I don't think it links to the browsing thingy automatically |
19:07
π
|
Nemo_bis |
not always at least |
19:08
π
|
emijrp |
you create an item, and upload all the 30 zips for a month, then they are listed |
19:08
π
|
emijrp |
oh db48x is download september too |
19:08
π
|
emijrp |
look the wiki |
19:09
π
|
Nemo_bis |
emijrp, yes, but how do you link to the zipview.php thingy? |
19:09
π
|
Nemo_bis |
ah, you edit conflicted me |
19:10
π
|
Nemo_bis |
-rw-rw-r-- 1 federico federico 11M 2012-02-27 20:07 2004-09-08.zip |
19:10
π
|
Nemo_bis |
-rw-rw-r-- 1 federico federico 560K 2012-02-27 20:04 2004-09-07.zip |
19:11
π
|
Nemo_bis |
-rw-rw-r-- 1 federico federico 3,7M 2012-02-27 20:09 2004-09-09.zip |
19:13
π
|
emijrp |
not bad |
19:13
π
|
emijrp |
: ) |
19:13
π
|
emijrp |
i hope we can archive Commons finally |
19:15
π
|
Nemo_bis |
db48x, maybe you can do the next month? |
19:16
π
|
soultcer |
The internet archive prefers zips? |
19:16
π
|
Nemo_bis |
soultcer, SketchCow said so |
19:17
π
|
soultcer |
Kinda surprising, I always thought tar was the "most future proof" format |
19:17
π
|
db48x |
Nemo_bis: I claimed this one first :) |
19:17
π
|
Nemo_bis |
db48x, no, you only saved first |
19:17
π
|
db48x |
same thing :) |
19:17
π
|
Nemo_bis |
I claimed first and you stole it because you were in the wrong channel :p |
19:18
π
|
db48x |
I didn't start editing the page until after I had checked out the software from svn, downloaded the csv file and started the download :) |
19:19
π
|
db48x |
anyway, I don't think duplicates will hurt us |
19:19
π
|
emijrp |
come guys |
19:19
π
|
Nemo_bis |
still wrong :p |
19:19
π
|
Nemo_bis |
we're justo kidding :( |
19:19
π
|
emijrp |
come on * |
19:19
π
|
Nemo_bis |
emijrp, if you create a standard description then everyone can upload to IA without errors |
19:20
π
|
soultcer |
emijrp: What do you think about adding a hash check? |
19:20
π
|
* |
Nemo_bis switching to next month |
19:20
π
|
soultcer |
Since you already have direct access to the commons database, you could just add another filed to the csv file |
19:20
π
|
emijrp |
they use sha1 in base 36, i dont know how to compare that with the donwloaded pics |
19:21
π
|
emijrp |
sha1sum works in other base |
19:21
π
|
soultcer |
Easy as pie |
19:22
π
|
soultcer |
Hm, do you mind if I play around with your script for a bit and give you some patches? |
19:22
π
|
emijrp |
by the way, i dont think we willhave many corrupted files.. |
19:22
π
|
emijrp |
ok |
19:22
π
|
soultcer |
I'm paranoid about flipped bits, even thought odds are that we won't have a single corrupted image |
19:23
π
|
chronomex |
it's more likely than you think! |
19:23
π
|
soultcer |
I have my db server in an underground room to prevent cosmic rays from getting in |
19:23
π
|
emijrp |
otober is not a month http://www.archiveteam.org/index.php?title=Wikimedia_Commons&curid=461&diff=7312&oldid=7311&rcid=10599 |
19:24
π
|
chronomex |
soultcer: but CENTIPEDES! |
19:25
π
|
soultcer |
I have a cat that eats those things for breakfast |
19:25
π
|
Nemo_bis |
emijrp, {{cn}} |
19:26
π
|
Nemo_bis |
emijrp, is that field actually populated? |
19:26
π
|
emijrp |
sha1sum ? no |
19:28
π
|
chronomex |
soultcer: memes are lost on the young. |
19:28
π
|
soultcer |
I try to not follow internet culture too much |
19:28
π
|
soultcer |
It's distracting |
19:28
π
|
chronomex |
this is -ancient- meme |
19:30
π
|
emijrp |
old as internets |
19:30
π
|
chronomex |
http://www.cigarsmokers.com/threads/2898-quot-centipedes-In-MY-vagina-quot |
19:30
π
|
chronomex |
photoshop of an even older ad "Porn? on MY PC?!?" |
19:31
π
|
Nemo_bis |
emijrp, I meant this perhaps https://bugzilla.wikimedia.org/show_bug.cgi?id=17057 |
19:32
π
|
emijrp |
soultcer: look that bug, looks like sha1 in wikimedia database sucks |
19:32
π
|
chronomex |
ignore everything on that site but the picture |
19:33
π
|
soultcer |
emijrp, Yeah well that shit happens if you deploy one of the largest websites from svn trunk |
19:34
π
|
Nemo_bis |
soultcer, they no longer do so actually |
19:35
π
|
soultcer |
We could check the hashes and then tell them which images have the wrong hashes :-) |
19:36
π
|
Nemo_bis |
Or emijrp could download all those files from the toolserver claiming he's just checking hashes, then upload everything to archive.org |
19:36
π
|
soultcer |
Wikimedia Germany would probably not approve |
19:36
π
|
Nemo_bis |
I doubt they care |
19:36
π
|
chronomex |
the germans can suck it |
19:36
π
|
soultcer |
But the toolserver belongs to them |
19:36
π
|
Nemo_bis |
it's just 500 MB |
19:36
π
|
Nemo_bis |
*GB |
19:37
π
|
Nemo_bis |
there's a user downloading 3 TB of images for upload to Commons on TS |
19:37
π
|
emijrp |
topic? |
19:37
π
|
Nemo_bis |
emijrp, dunno, US-gov stuff |
19:37
π
|
emijrp |
planes, war |
19:38
π
|
Nemo_bis |
hopefully not |
19:38
π
|
chronomex |
nuclear reactors, cornfields, bridges, moon landing |
19:39
π
|
emijrp |
how many images are 3tb? |
19:41
π
|
Nemo_bis |
emijrp, dunno, they're huge TIFFs IIRC |
19:42
π
|
Nemo_bis |
like 20 Mb each |
19:42
π
|
Nemo_bis |
ask multichill |
19:42
π
|
Nemo_bis |
or check [[Commons:Batch uploading]] or whatever |
19:43
π
|
Nemo_bis |
seriously though, emijrp, I think this first batch could also be done on TS |
19:43
π
|
emijrp |
why ? |
19:43
π
|
Nemo_bis |
perhaps everything is too much given that the WMF says they're working on it |
19:44
π
|
Nemo_bis |
well, you could run it very fast I guess, but if we manage this way ok (it's quite CPU-consuming apparently) |
19:44
π
|
Nemo_bis |
btw emijrp, do you use https://wiki.toolserver.org/view/SGE ? I see those hosts are basically idle |
19:45
π
|
Nemo_bis |
lots of CPu to (ab)use there |
19:45
π
|
emijrp |
no |
19:45
π
|
soultcer |
I don't think CPU is the limit |
19:51
π
|
Nemo_bis |
unless you download at 100 MiB/s or something I doubt you can harm much |
19:55
π
|
soultcer |
emjirp: Any specifiy reason why you used wget and curl instead of urllib? |
19:57
π
|
emijrp |
i had issues in the past trying to get stuff with urllib from wikipedia servers |
19:58
π
|
soultcer |
Weird |
19:58
π
|
soultcer |
Hm, what do you do about unicode in the zip files, afaik zip doesn't have any concept of filename encoding |
20:00
π
|
soultcer |
Nevermind, I see they fixed that and the Info-ZIP program will write the unicode filenames as a special metadata field |
20:03
π
|
emijrp |
i did a test with hebrew and arabic filenames |
20:09
π
|
db48x |
probably going to grab a year next |
20:12
π
|
emijrp |
check the size estimates http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Size_stats |
20:18
π
|
db48x |
2006 is only 327 GB |
20:18
π
|
db48x |
I'd want to run a number of copies though |
20:26
π
|
Nemo_bis |
emijrp, doesn't the script delete files after zipping them? |
20:26
π
|
emijrp |
no |
20:26
π
|
emijrp |
i dont like script deleting stuff |
20:27
π
|
Nemo_bis |
hm |
20:29
π
|
Nemo_bis |
well, I can't even see tha bandwidth consumption in my machine graphs |
20:29
π
|
db48x |
should spawn a separate thread for every day |
20:31
π
|
emijrp |
yep |
20:31
π
|
emijrp |
thats the point for day-by-day download too |
20:31
π
|
emijrp |
multiple threads |
20:34
π
|
db48x |
size estimates are probably low |
20:34
π
|
db48x |
2004-09 was estimated at 180 Mb, it came to 206 Mb after compression |
20:40
π
|
Nemo_bis |
that's descriptions |
20:45
π
|
db48x |
unlikely |
20:46
π
|
db48x |
only 8.3 megs of those |
20:47
π
|
emijrp |
estimates don't include old versions of pics |
20:47
π
|
db48x |
ah |
20:47
π
|
emijrp |
some pics are reuploaded with changes (white balacing, photoshop, etc) |
20:48
π
|
emijrp |
those are the 200XYYZZHHMMSS!... filenames |
20:48
π
|
emijrp |
don't worry if you see a 2006.....! pic in a 2005 folder, or whatever |
20:49
π
|
db48x |
heh |
20:49
π
|
Nemo_bis |
lots of 2011 images in the 2004 folder |
20:49
π
|
emijrp |
that timestamped filenames contain the date when the image was reuploaded |
20:49
π
|
emijrp |
yes Nemo_bis, but the 2011 means that the 2004 image was changed in 2011, not a 2011 image |
20:50
π
|
Nemo_bis |
emijrp, yes, "2011 images" = "images with filename 2011*" |
20:50
π
|
emijrp |
it is weird, mediawiki way of life.. |
20:50
π
|
Nemo_bis |
not so weird, just annoying that there's no permalink |
20:50
π
|
Nemo_bis |
(for new images, and IIRC) |
20:51
π
|
emijrp |
grep "|2005" commonssql.csv | awk -F"|" '{x += $5} END {print "sum: "x/1024/1024/1024}' |
20:51
π
|
emijrp |
sum: 111.67 |
20:51
π
|
emijrp |
gb |
20:53
π
|
emijrp |
grep "|200409" commonssql.csv -c |
20:53
π
|
emijrp |
says 790 pics |
20:53
π
|
emijrp |
but you wrote 788 in the wiki |
20:54
π
|
emijrp |
db48x: |
20:55
π
|
emijrp |
2 missing files or you didnt count right |
21:00
π
|
emijrp |
im going to announce this commons project in the mailing list |
21:01
π
|
Nemo_bis |
2004-10-27 ha sooooo many flowers |
21:02
π
|
emijrp |
not only that day, commons is full of flower lovers |
21:02
π
|
emijrp |
gaaaaaaaaaaaaaaaaaaaaaaaay |
21:03
π
|
Nemo_bis |
hm? |
21:04
π
|
Nemo_bis |
there's not even a featured flower project https://commons.wikimedia.org/wiki/Commons:Featured_Kittens |
21:06
π
|
db48x |
I counted the files it actually downloaded |
21:06
π
|
db48x |
find 2004 -type f | grep -v '\.desc$' | wc -l |
21:08
π
|
Nemo_bis |
lol |
21:08
π
|
Nemo_bis |
[ΓΒ] Aesculus hippocastanum flowersΓ’ΒΒ (53 F) |
21:08
π
|
Nemo_bis |
[Γ’ΒΒ] Centaurea cyanusΓ’ΒΒ (2 C, 1 P, 139 F) |
21:08
π
|
Nemo_bis |
[Γ’ΒΒ] Cornflowers by countryΓ’ΒΒ (1 C) |
21:08
π
|
Nemo_bis |
[Γ’ΒΒ] Cornflowers in GermanyΓ’ΒΒ (1 C, 3 F) |
21:08
π
|
Nemo_bis |
[Γ’ΒΒ] Flowers by taxonΓ’ΒΒ (33 C, 1 F) |
21:08
π
|
Nemo_bis |
[Γ’ΒΒ] Cornflowers in Baden-WΓΒΌrttembergΓ’ΒΒ (1 C) |
21:08
π
|
Nemo_bis |
[ΓΒ] Cornflowers in Rhein-Neckar-KreisΓ’ΒΒ (5 F) |
21:15
π
|
emijrp |
db48x: can you compare filenames from .csv with those in your directory? |
21:15
π
|
emijrp |
and detect the 2 missing ones, this is important |