| Time |
Nickname |
Message |
|
16:51
π
|
emijrp |
i need some volunteers to archive http://commons.wikimedia.org, it contains 12M files, but the first chunk is about 1M files ~= 500 gb, that chunk is made of daily chunks |
|
16:51
π
|
emijrp |
im going to upload the script and feed list |
|
16:58
π
|
Nemo_bis |
emijrp, what do you need? |
|
16:59
π
|
Nemo_bis |
what I can offer is limited technical knowledge, quite a lot of bandwidth and some disk space :) |
|
16:59
π
|
Nemo_bis |
if you put the script somewhere and a list of chunks to claim on archiveteam.org, I'll do as much as possible |
|
17:09
π
|
emijrp |
wait a second |
|
17:20
π
|
emijrp |
http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process |
|
17:20
π
|
Nemo_bis |
emijrp, thanks |
|
17:21
π
|
Nemo_bis |
I can download 150 GiB or something to start with |
|
17:21
π
|
Nemo_bis |
unless you want to crowdsource it more |
|
17:21
π
|
emijrp |
first i want just a tests |
|
17:21
π
|
Nemo_bis |
ok |
|
17:22
π
|
Nemo_bis |
not today though |
|
17:22
π
|
Nemo_bis |
also, why 7z? doesn't look very useful for images |
|
17:22
π
|
Nemo_bis |
and zip or tar can be browsed with the IA tools |
|
17:22
π
|
Nemo_bis |
(without downloading) |
|
17:25
π
|
emijrp |
every day-by-day folder contain files + .xml with descriptions |
|
17:25
π
|
emijrp |
thats why i chose 7z, for the xml |
|
17:25
π
|
emijrp |
but if zip is browsable.. |
|
17:25
π
|
emijrp |
by the way, some days have +5000 pics |
|
17:26
π
|
emijrp |
can browse that? |
|
17:26
π
|
Nemo_bis |
I think so |
|
17:26
π
|
Nemo_bis |
It has some problems also with huge archives, like 5-10 GiB |
|
17:27
π
|
Nemo_bis |
limit unclear |
|
17:27
π
|
Nemo_bis |
s/also/only/ |
|
17:29
π
|
emijrp |
that woul be the size for every package |
|
17:29
π
|
emijrp |
1mb/image, 5000 images, 5gb |
|
17:32
π
|
Nemo_bis |
should be ok |
|
18:48
π
|
emijrp |
ok, we can make some tests http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process |
|
18:48
π
|
emijrp |
i have tested the script, but, i hope we can find any error before we start to DDoS wikipedia servers |
|
18:48
π
|
emijrp |
by the way, my upload stream is shit, so, i wont use this script much, irony |
|
18:52
π
|
emijrp |
Nemo_bis: can you get from 2004-09-07 to 2004-09-30 ? |
|
18:56
π
|
Nemo_bis |
emijrp, I can run it, but not look much at it today, no time |
|
19:01
π
|
Nemo_bis |
emijrp, I download that 7z and uncompress it in the working directory? |
|
19:02
π
|
emijrp |
yes |
|
19:03
π
|
Nemo_bis |
emijrp, running |
|
19:04
π
|
Nemo_bis |
emijrp, as they're small enough, should we put them in rather big item on archive.org? |
|
19:04
π
|
emijrp |
it creates zips by day, i think we can follow that format |
|
19:04
π
|
emijrp |
as the pagecounts from 2007 |
|
19:05
π
|
emijrp |
do you upload 1 to 1 or in batches? |
|
19:05
π
|
Nemo_bis |
emijrp, pagecounts? |
|
19:05
π
|
Nemo_bis |
those are in monthly items |
|
19:05
π
|
emijrp |
domas visits logs |
|
19:06
π
|
emijrp |
yes, but are single files |
|
19:06
π
|
emijrp |
if you upload several .zip into one item, you can browse them separately? |
|
19:06
π
|
Nemo_bis |
emijrp, yes |
|
19:07
π
|
emijrp |
ok, in that case we can create one time per month |
|
19:07
π
|
Nemo_bis |
but anyway you need to add the link |
|
19:07
π
|
Nemo_bis |
I don't think it links to the browsing thingy automatically |
|
19:07
π
|
Nemo_bis |
not always at least |
|
19:08
π
|
emijrp |
you create an item, and upload all the 30 zips for a month, then they are listed |
|
19:08
π
|
emijrp |
oh db48x is download september too |
|
19:08
π
|
emijrp |
look the wiki |
|
19:09
π
|
Nemo_bis |
emijrp, yes, but how do you link to the zipview.php thingy? |
|
19:09
π
|
Nemo_bis |
ah, you edit conflicted me |
|
19:10
π
|
Nemo_bis |
-rw-rw-r-- 1 federico federico 11M 2012-02-27 20:07 2004-09-08.zip |
|
19:10
π
|
Nemo_bis |
-rw-rw-r-- 1 federico federico 560K 2012-02-27 20:04 2004-09-07.zip |
|
19:11
π
|
Nemo_bis |
-rw-rw-r-- 1 federico federico 3,7M 2012-02-27 20:09 2004-09-09.zip |
|
19:13
π
|
emijrp |
not bad |
|
19:13
π
|
emijrp |
: ) |
|
19:13
π
|
emijrp |
i hope we can archive Commons finally |
|
19:15
π
|
Nemo_bis |
db48x, maybe you can do the next month? |
|
19:16
π
|
soultcer |
The internet archive prefers zips? |
|
19:16
π
|
Nemo_bis |
soultcer, SketchCow said so |
|
19:17
π
|
soultcer |
Kinda surprising, I always thought tar was the "most future proof" format |
|
19:17
π
|
db48x |
Nemo_bis: I claimed this one first :) |
|
19:17
π
|
Nemo_bis |
db48x, no, you only saved first |
|
19:17
π
|
db48x |
same thing :) |
|
19:17
π
|
Nemo_bis |
I claimed first and you stole it because you were in the wrong channel :p |
|
19:18
π
|
db48x |
I didn't start editing the page until after I had checked out the software from svn, downloaded the csv file and started the download :) |
|
19:19
π
|
db48x |
anyway, I don't think duplicates will hurt us |
|
19:19
π
|
emijrp |
come guys |
|
19:19
π
|
Nemo_bis |
still wrong :p |
|
19:19
π
|
Nemo_bis |
we're justo kidding :( |
|
19:19
π
|
emijrp |
come on * |
|
19:19
π
|
Nemo_bis |
emijrp, if you create a standard description then everyone can upload to IA without errors |
|
19:20
π
|
soultcer |
emijrp: What do you think about adding a hash check? |
|
19:20
π
|
* |
Nemo_bis switching to next month |
|
19:20
π
|
soultcer |
Since you already have direct access to the commons database, you could just add another filed to the csv file |
|
19:20
π
|
emijrp |
they use sha1 in base 36, i dont know how to compare that with the donwloaded pics |
|
19:21
π
|
emijrp |
sha1sum works in other base |
|
19:21
π
|
soultcer |
Easy as pie |
|
19:22
π
|
soultcer |
Hm, do you mind if I play around with your script for a bit and give you some patches? |
|
19:22
π
|
emijrp |
by the way, i dont think we willhave many corrupted files.. |
|
19:22
π
|
emijrp |
ok |
|
19:22
π
|
soultcer |
I'm paranoid about flipped bits, even thought odds are that we won't have a single corrupted image |
|
19:23
π
|
chronomex |
it's more likely than you think! |
|
19:23
π
|
soultcer |
I have my db server in an underground room to prevent cosmic rays from getting in |
|
19:23
π
|
emijrp |
otober is not a month http://www.archiveteam.org/index.php?title=Wikimedia_Commons&curid=461&diff=7312&oldid=7311&rcid=10599 |
|
19:24
π
|
chronomex |
soultcer: but CENTIPEDES! |
|
19:25
π
|
soultcer |
I have a cat that eats those things for breakfast |
|
19:25
π
|
Nemo_bis |
emijrp, {{cn}} |
|
19:26
π
|
Nemo_bis |
emijrp, is that field actually populated? |
|
19:26
π
|
emijrp |
sha1sum ? no |
|
19:28
π
|
chronomex |
soultcer: memes are lost on the young. |
|
19:28
π
|
soultcer |
I try to not follow internet culture too much |
|
19:28
π
|
soultcer |
It's distracting |
|
19:28
π
|
chronomex |
this is -ancient- meme |
|
19:30
π
|
emijrp |
old as internets |
|
19:30
π
|
chronomex |
http://www.cigarsmokers.com/threads/2898-quot-centipedes-In-MY-vagina-quot |
|
19:30
π
|
chronomex |
photoshop of an even older ad "Porn? on MY PC?!?" |
|
19:31
π
|
Nemo_bis |
emijrp, I meant this perhaps https://bugzilla.wikimedia.org/show_bug.cgi?id=17057 |
|
19:32
π
|
emijrp |
soultcer: look that bug, looks like sha1 in wikimedia database sucks |
|
19:32
π
|
chronomex |
ignore everything on that site but the picture |
|
19:33
π
|
soultcer |
emijrp, Yeah well that shit happens if you deploy one of the largest websites from svn trunk |
|
19:34
π
|
Nemo_bis |
soultcer, they no longer do so actually |
|
19:35
π
|
soultcer |
We could check the hashes and then tell them which images have the wrong hashes :-) |
|
19:36
π
|
Nemo_bis |
Or emijrp could download all those files from the toolserver claiming he's just checking hashes, then upload everything to archive.org |
|
19:36
π
|
soultcer |
Wikimedia Germany would probably not approve |
|
19:36
π
|
Nemo_bis |
I doubt they care |
|
19:36
π
|
chronomex |
the germans can suck it |
|
19:36
π
|
soultcer |
But the toolserver belongs to them |
|
19:36
π
|
Nemo_bis |
it's just 500 MB |
|
19:36
π
|
Nemo_bis |
*GB |
|
19:37
π
|
Nemo_bis |
there's a user downloading 3 TB of images for upload to Commons on TS |
|
19:37
π
|
emijrp |
topic? |
|
19:37
π
|
Nemo_bis |
emijrp, dunno, US-gov stuff |
|
19:37
π
|
emijrp |
planes, war |
|
19:38
π
|
Nemo_bis |
hopefully not |
|
19:38
π
|
chronomex |
nuclear reactors, cornfields, bridges, moon landing |
|
19:39
π
|
emijrp |
how many images are 3tb? |
|
19:41
π
|
Nemo_bis |
emijrp, dunno, they're huge TIFFs IIRC |
|
19:42
π
|
Nemo_bis |
like 20 Mb each |
|
19:42
π
|
Nemo_bis |
ask multichill |
|
19:42
π
|
Nemo_bis |
or check [[Commons:Batch uploading]] or whatever |
|
19:43
π
|
Nemo_bis |
seriously though, emijrp, I think this first batch could also be done on TS |
|
19:43
π
|
emijrp |
why ? |
|
19:43
π
|
Nemo_bis |
perhaps everything is too much given that the WMF says they're working on it |
|
19:44
π
|
Nemo_bis |
well, you could run it very fast I guess, but if we manage this way ok (it's quite CPU-consuming apparently) |
|
19:44
π
|
Nemo_bis |
btw emijrp, do you use https://wiki.toolserver.org/view/SGE ? I see those hosts are basically idle |
|
19:45
π
|
Nemo_bis |
lots of CPu to (ab)use there |
|
19:45
π
|
emijrp |
no |
|
19:45
π
|
soultcer |
I don't think CPU is the limit |
|
19:51
π
|
Nemo_bis |
unless you download at 100 MiB/s or something I doubt you can harm much |
|
19:55
π
|
soultcer |
emjirp: Any specifiy reason why you used wget and curl instead of urllib? |
|
19:57
π
|
emijrp |
i had issues in the past trying to get stuff with urllib from wikipedia servers |
|
19:58
π
|
soultcer |
Weird |
|
19:58
π
|
soultcer |
Hm, what do you do about unicode in the zip files, afaik zip doesn't have any concept of filename encoding |
|
20:00
π
|
soultcer |
Nevermind, I see they fixed that and the Info-ZIP program will write the unicode filenames as a special metadata field |
|
20:03
π
|
emijrp |
i did a test with hebrew and arabic filenames |
|
20:09
π
|
db48x |
probably going to grab a year next |
|
20:12
π
|
emijrp |
check the size estimates http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Size_stats |
|
20:18
π
|
db48x |
2006 is only 327 GB |
|
20:18
π
|
db48x |
I'd want to run a number of copies though |
|
20:26
π
|
Nemo_bis |
emijrp, doesn't the script delete files after zipping them? |
|
20:26
π
|
emijrp |
no |
|
20:26
π
|
emijrp |
i dont like script deleting stuff |
|
20:27
π
|
Nemo_bis |
hm |
|
20:29
π
|
Nemo_bis |
well, I can't even see tha bandwidth consumption in my machine graphs |
|
20:29
π
|
db48x |
should spawn a separate thread for every day |
|
20:31
π
|
emijrp |
yep |
|
20:31
π
|
emijrp |
thats the point for day-by-day download too |
|
20:31
π
|
emijrp |
multiple threads |
|
20:34
π
|
db48x |
size estimates are probably low |
|
20:34
π
|
db48x |
2004-09 was estimated at 180 Mb, it came to 206 Mb after compression |
|
20:40
π
|
Nemo_bis |
that's descriptions |
|
20:45
π
|
db48x |
unlikely |
|
20:46
π
|
db48x |
only 8.3 megs of those |
|
20:47
π
|
emijrp |
estimates don't include old versions of pics |
|
20:47
π
|
db48x |
ah |
|
20:47
π
|
emijrp |
some pics are reuploaded with changes (white balacing, photoshop, etc) |
|
20:48
π
|
emijrp |
those are the 200XYYZZHHMMSS!... filenames |
|
20:48
π
|
emijrp |
don't worry if you see a 2006.....! pic in a 2005 folder, or whatever |
|
20:49
π
|
db48x |
heh |
|
20:49
π
|
Nemo_bis |
lots of 2011 images in the 2004 folder |
|
20:49
π
|
emijrp |
that timestamped filenames contain the date when the image was reuploaded |
|
20:49
π
|
emijrp |
yes Nemo_bis, but the 2011 means that the 2004 image was changed in 2011, not a 2011 image |
|
20:50
π
|
Nemo_bis |
emijrp, yes, "2011 images" = "images with filename 2011*" |
|
20:50
π
|
emijrp |
it is weird, mediawiki way of life.. |
|
20:50
π
|
Nemo_bis |
not so weird, just annoying that there's no permalink |
|
20:50
π
|
Nemo_bis |
(for new images, and IIRC) |
|
20:51
π
|
emijrp |
grep "|2005" commonssql.csv | awk -F"|" '{x += $5} END {print "sum: "x/1024/1024/1024}' |
|
20:51
π
|
emijrp |
sum: 111.67 |
|
20:51
π
|
emijrp |
gb |
|
20:53
π
|
emijrp |
grep "|200409" commonssql.csv -c |
|
20:53
π
|
emijrp |
says 790 pics |
|
20:53
π
|
emijrp |
but you wrote 788 in the wiki |
|
20:54
π
|
emijrp |
db48x: |
|
20:55
π
|
emijrp |
2 missing files or you didnt count right |
|
21:00
π
|
emijrp |
im going to announce this commons project in the mailing list |
|
21:01
π
|
Nemo_bis |
2004-10-27 ha sooooo many flowers |
|
21:02
π
|
emijrp |
not only that day, commons is full of flower lovers |
|
21:02
π
|
emijrp |
gaaaaaaaaaaaaaaaaaaaaaaaay |
|
21:03
π
|
Nemo_bis |
hm? |
|
21:04
π
|
Nemo_bis |
there's not even a featured flower project https://commons.wikimedia.org/wiki/Commons:Featured_Kittens |
|
21:06
π
|
db48x |
I counted the files it actually downloaded |
|
21:06
π
|
db48x |
find 2004 -type f | grep -v '\.desc$' | wc -l |
|
21:08
π
|
Nemo_bis |
lol |
|
21:08
π
|
Nemo_bis |
[ΓΒ] Aesculus hippocastanum flowersΓ’ΒΒ (53 F) |
|
21:08
π
|
Nemo_bis |
[Γ’ΒΒ] Centaurea cyanusΓ’ΒΒ (2 C, 1 P, 139 F) |
|
21:08
π
|
Nemo_bis |
[Γ’ΒΒ] Cornflowers by countryΓ’ΒΒ (1 C) |
|
21:08
π
|
Nemo_bis |
[Γ’ΒΒ] Cornflowers in GermanyΓ’ΒΒ (1 C, 3 F) |
|
21:08
π
|
Nemo_bis |
[Γ’ΒΒ] Flowers by taxonΓ’ΒΒ (33 C, 1 F) |
|
21:08
π
|
Nemo_bis |
[Γ’ΒΒ] Cornflowers in Baden-WΓΒΌrttembergΓ’ΒΒ (1 C) |
|
21:08
π
|
Nemo_bis |
[ΓΒ] Cornflowers in Rhein-Neckar-KreisΓ’ΒΒ (5 F) |
|
21:15
π
|
emijrp |
db48x: can you compare filenames from .csv with those in your directory? |
|
21:15
π
|
emijrp |
and detect the 2 missing ones, this is important |