#wikiteam 2012-02-27,Mon

↑back Search

Time Nickname Message
16:51 πŸ”— emijrp i need some volunteers to archive http://commons.wikimedia.org, it contains 12M files, but the first chunk is about 1M files ~= 500 gb, that chunk is made of daily chunks
16:51 πŸ”— emijrp im going to upload the script and feed list
16:58 πŸ”— Nemo_bis emijrp, what do you need?
16:59 πŸ”— Nemo_bis what I can offer is limited technical knowledge, quite a lot of bandwidth and some disk space :)
16:59 πŸ”— Nemo_bis if you put the script somewhere and a list of chunks to claim on archiveteam.org, I'll do as much as possible
17:09 πŸ”— emijrp wait a second
17:20 πŸ”— emijrp http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process
17:20 πŸ”— Nemo_bis emijrp, thanks
17:21 πŸ”— Nemo_bis I can download 150 GiB or something to start with
17:21 πŸ”— Nemo_bis unless you want to crowdsource it more
17:21 πŸ”— emijrp first i want just a tests
17:21 πŸ”— Nemo_bis ok
17:22 πŸ”— Nemo_bis not today though
17:22 πŸ”— Nemo_bis also, why 7z? doesn't look very useful for images
17:22 πŸ”— Nemo_bis and zip or tar can be browsed with the IA tools
17:22 πŸ”— Nemo_bis (without downloading)
17:25 πŸ”— emijrp every day-by-day folder contain files + .xml with descriptions
17:25 πŸ”— emijrp thats why i chose 7z, for the xml
17:25 πŸ”— emijrp but if zip is browsable..
17:25 πŸ”— emijrp by the way, some days have +5000 pics
17:26 πŸ”— emijrp can browse that?
17:26 πŸ”— Nemo_bis I think so
17:26 πŸ”— Nemo_bis It has some problems also with huge archives, like 5-10 GiB
17:27 πŸ”— Nemo_bis limit unclear
17:27 πŸ”— Nemo_bis s/also/only/
17:29 πŸ”— emijrp that woul be the size for every package
17:29 πŸ”— emijrp 1mb/image, 5000 images, 5gb
17:32 πŸ”— Nemo_bis should be ok
18:48 πŸ”— emijrp ok, we can make some tests http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process
18:48 πŸ”— emijrp i have tested the script, but, i hope we can find any error before we start to DDoS wikipedia servers
18:48 πŸ”— emijrp by the way, my upload stream is shit, so, i wont use this script much, irony
18:52 πŸ”— emijrp Nemo_bis: can you get from 2004-09-07 to 2004-09-30 ?
18:56 πŸ”— Nemo_bis emijrp, I can run it, but not look much at it today, no time
19:01 πŸ”— Nemo_bis emijrp, I download that 7z and uncompress it in the working directory?
19:02 πŸ”— emijrp yes
19:03 πŸ”— Nemo_bis emijrp, running
19:04 πŸ”— Nemo_bis emijrp, as they're small enough, should we put them in rather big item on archive.org?
19:04 πŸ”— emijrp it creates zips by day, i think we can follow that format
19:04 πŸ”— emijrp as the pagecounts from 2007
19:05 πŸ”— emijrp do you upload 1 to 1 or in batches?
19:05 πŸ”— Nemo_bis emijrp, pagecounts?
19:05 πŸ”— Nemo_bis those are in monthly items
19:05 πŸ”— emijrp domas visits logs
19:06 πŸ”— emijrp yes, but are single files
19:06 πŸ”— emijrp if you upload several .zip into one item, you can browse them separately?
19:06 πŸ”— Nemo_bis emijrp, yes
19:07 πŸ”— emijrp ok, in that case we can create one time per month
19:07 πŸ”— Nemo_bis but anyway you need to add the link
19:07 πŸ”— Nemo_bis I don't think it links to the browsing thingy automatically
19:07 πŸ”— Nemo_bis not always at least
19:08 πŸ”— emijrp you create an item, and upload all the 30 zips for a month, then they are listed
19:08 πŸ”— emijrp oh db48x is download september too
19:08 πŸ”— emijrp look the wiki
19:09 πŸ”— Nemo_bis emijrp, yes, but how do you link to the zipview.php thingy?
19:09 πŸ”— Nemo_bis ah, you edit conflicted me
19:10 πŸ”— Nemo_bis -rw-rw-r-- 1 federico federico 11M 2012-02-27 20:07 2004-09-08.zip
19:10 πŸ”— Nemo_bis -rw-rw-r-- 1 federico federico 560K 2012-02-27 20:04 2004-09-07.zip
19:11 πŸ”— Nemo_bis -rw-rw-r-- 1 federico federico 3,7M 2012-02-27 20:09 2004-09-09.zip
19:13 πŸ”— emijrp not bad
19:13 πŸ”— emijrp : )
19:13 πŸ”— emijrp i hope we can archive Commons finally
19:15 πŸ”— Nemo_bis db48x, maybe you can do the next month?
19:16 πŸ”— soultcer The internet archive prefers zips?
19:16 πŸ”— Nemo_bis soultcer, SketchCow said so
19:17 πŸ”— soultcer Kinda surprising, I always thought tar was the "most future proof" format
19:17 πŸ”— db48x Nemo_bis: I claimed this one first :)
19:17 πŸ”— Nemo_bis db48x, no, you only saved first
19:17 πŸ”— db48x same thing :)
19:17 πŸ”— Nemo_bis I claimed first and you stole it because you were in the wrong channel :p
19:18 πŸ”— db48x I didn't start editing the page until after I had checked out the software from svn, downloaded the csv file and started the download :)
19:19 πŸ”— db48x anyway, I don't think duplicates will hurt us
19:19 πŸ”— emijrp come guys
19:19 πŸ”— Nemo_bis still wrong :p
19:19 πŸ”— Nemo_bis we're justo kidding :(
19:19 πŸ”— emijrp come on *
19:19 πŸ”— Nemo_bis emijrp, if you create a standard description then everyone can upload to IA without errors
19:20 πŸ”— soultcer emijrp: What do you think about adding a hash check?
19:20 πŸ”— * Nemo_bis switching to next month
19:20 πŸ”— soultcer Since you already have direct access to the commons database, you could just add another filed to the csv file
19:20 πŸ”— emijrp they use sha1 in base 36, i dont know how to compare that with the donwloaded pics
19:21 πŸ”— emijrp sha1sum works in other base
19:21 πŸ”— soultcer Easy as pie
19:22 πŸ”— soultcer Hm, do you mind if I play around with your script for a bit and give you some patches?
19:22 πŸ”— emijrp by the way, i dont think we willhave many corrupted files..
19:22 πŸ”— emijrp ok
19:22 πŸ”— soultcer I'm paranoid about flipped bits, even thought odds are that we won't have a single corrupted image
19:23 πŸ”— chronomex it's more likely than you think!
19:23 πŸ”— soultcer I have my db server in an underground room to prevent cosmic rays from getting in
19:23 πŸ”— emijrp otober is not a month http://www.archiveteam.org/index.php?title=Wikimedia_Commons&curid=461&diff=7312&oldid=7311&rcid=10599
19:24 πŸ”— chronomex soultcer: but CENTIPEDES!
19:25 πŸ”— soultcer I have a cat that eats those things for breakfast
19:25 πŸ”— Nemo_bis emijrp, {{cn}}
19:26 πŸ”— Nemo_bis emijrp, is that field actually populated?
19:26 πŸ”— emijrp sha1sum ? no
19:28 πŸ”— chronomex soultcer: memes are lost on the young.
19:28 πŸ”— soultcer I try to not follow internet culture too much
19:28 πŸ”— soultcer It's distracting
19:28 πŸ”— chronomex this is -ancient- meme
19:30 πŸ”— emijrp old as internets
19:30 πŸ”— chronomex http://www.cigarsmokers.com/threads/2898-quot-centipedes-In-MY-vagina-quot
19:30 πŸ”— chronomex photoshop of an even older ad "Porn? on MY PC?!?"
19:31 πŸ”— Nemo_bis emijrp, I meant this perhaps https://bugzilla.wikimedia.org/show_bug.cgi?id=17057
19:32 πŸ”— emijrp soultcer: look that bug, looks like sha1 in wikimedia database sucks
19:32 πŸ”— chronomex ignore everything on that site but the picture
19:33 πŸ”— soultcer emijrp, Yeah well that shit happens if you deploy one of the largest websites from svn trunk
19:34 πŸ”— Nemo_bis soultcer, they no longer do so actually
19:35 πŸ”— soultcer We could check the hashes and then tell them which images have the wrong hashes :-)
19:36 πŸ”— Nemo_bis Or emijrp could download all those files from the toolserver claiming he's just checking hashes, then upload everything to archive.org
19:36 πŸ”— soultcer Wikimedia Germany would probably not approve
19:36 πŸ”— Nemo_bis I doubt they care
19:36 πŸ”— chronomex the germans can suck it
19:36 πŸ”— soultcer But the toolserver belongs to them
19:36 πŸ”— Nemo_bis it's just 500 MB
19:36 πŸ”— Nemo_bis *GB
19:37 πŸ”— Nemo_bis there's a user downloading 3 TB of images for upload to Commons on TS
19:37 πŸ”— emijrp topic?
19:37 πŸ”— Nemo_bis emijrp, dunno, US-gov stuff
19:37 πŸ”— emijrp planes, war
19:38 πŸ”— Nemo_bis hopefully not
19:38 πŸ”— chronomex nuclear reactors, cornfields, bridges, moon landing
19:39 πŸ”— emijrp how many images are 3tb?
19:41 πŸ”— Nemo_bis emijrp, dunno, they're huge TIFFs IIRC
19:42 πŸ”— Nemo_bis like 20 Mb each
19:42 πŸ”— Nemo_bis ask multichill
19:42 πŸ”— Nemo_bis or check [[Commons:Batch uploading]] or whatever
19:43 πŸ”— Nemo_bis seriously though, emijrp, I think this first batch could also be done on TS
19:43 πŸ”— emijrp why ?
19:43 πŸ”— Nemo_bis perhaps everything is too much given that the WMF says they're working on it
19:44 πŸ”— Nemo_bis well, you could run it very fast I guess, but if we manage this way ok (it's quite CPU-consuming apparently)
19:44 πŸ”— Nemo_bis btw emijrp, do you use https://wiki.toolserver.org/view/SGE ? I see those hosts are basically idle
19:45 πŸ”— Nemo_bis lots of CPu to (ab)use there
19:45 πŸ”— emijrp no
19:45 πŸ”— soultcer I don't think CPU is the limit
19:51 πŸ”— Nemo_bis unless you download at 100 MiB/s or something I doubt you can harm much
19:55 πŸ”— soultcer emjirp: Any specifiy reason why you used wget and curl instead of urllib?
19:57 πŸ”— emijrp i had issues in the past trying to get stuff with urllib from wikipedia servers
19:58 πŸ”— soultcer Weird
19:58 πŸ”— soultcer Hm, what do you do about unicode in the zip files, afaik zip doesn't have any concept of filename encoding
20:00 πŸ”— soultcer Nevermind, I see they fixed that and the Info-ZIP program will write the unicode filenames as a special metadata field
20:03 πŸ”— emijrp i did a test with hebrew and arabic filenames
20:09 πŸ”— db48x probably going to grab a year next
20:12 πŸ”— emijrp check the size estimates http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Size_stats
20:18 πŸ”— db48x 2006 is only 327 GB
20:18 πŸ”— db48x I'd want to run a number of copies though
20:26 πŸ”— Nemo_bis emijrp, doesn't the script delete files after zipping them?
20:26 πŸ”— emijrp no
20:26 πŸ”— emijrp i dont like script deleting stuff
20:27 πŸ”— Nemo_bis hm
20:29 πŸ”— Nemo_bis well, I can't even see tha bandwidth consumption in my machine graphs
20:29 πŸ”— db48x should spawn a separate thread for every day
20:31 πŸ”— emijrp yep
20:31 πŸ”— emijrp thats the point for day-by-day download too
20:31 πŸ”— emijrp multiple threads
20:34 πŸ”— db48x size estimates are probably low
20:34 πŸ”— db48x 2004-09 was estimated at 180 Mb, it came to 206 Mb after compression
20:40 πŸ”— Nemo_bis that's descriptions
20:45 πŸ”— db48x unlikely
20:46 πŸ”— db48x only 8.3 megs of those
20:47 πŸ”— emijrp estimates don't include old versions of pics
20:47 πŸ”— db48x ah
20:47 πŸ”— emijrp some pics are reuploaded with changes (white balacing, photoshop, etc)
20:48 πŸ”— emijrp those are the 200XYYZZHHMMSS!... filenames
20:48 πŸ”— emijrp don't worry if you see a 2006.....! pic in a 2005 folder, or whatever
20:49 πŸ”— db48x heh
20:49 πŸ”— Nemo_bis lots of 2011 images in the 2004 folder
20:49 πŸ”— emijrp that timestamped filenames contain the date when the image was reuploaded
20:49 πŸ”— emijrp yes Nemo_bis, but the 2011 means that the 2004 image was changed in 2011, not a 2011 image
20:50 πŸ”— Nemo_bis emijrp, yes, "2011 images" = "images with filename 2011*"
20:50 πŸ”— emijrp it is weird, mediawiki way of life..
20:50 πŸ”— Nemo_bis not so weird, just annoying that there's no permalink
20:50 πŸ”— Nemo_bis (for new images, and IIRC)
20:51 πŸ”— emijrp grep "|2005" commonssql.csv | awk -F"|" '{x += $5} END {print "sum: "x/1024/1024/1024}'
20:51 πŸ”— emijrp sum: 111.67
20:51 πŸ”— emijrp gb
20:53 πŸ”— emijrp grep "|200409" commonssql.csv -c
20:53 πŸ”— emijrp says 790 pics
20:53 πŸ”— emijrp but you wrote 788 in the wiki
20:54 πŸ”— emijrp db48x:
20:55 πŸ”— emijrp 2 missing files or you didnt count right
21:00 πŸ”— emijrp im going to announce this commons project in the mailing list
21:01 πŸ”— Nemo_bis 2004-10-27 ha sooooo many flowers
21:02 πŸ”— emijrp not only that day, commons is full of flower lovers
21:02 πŸ”— emijrp gaaaaaaaaaaaaaaaaaaaaaaaay
21:03 πŸ”— Nemo_bis hm?
21:04 πŸ”— Nemo_bis there's not even a featured flower project https://commons.wikimedia.org/wiki/Commons:Featured_Kittens
21:06 πŸ”— db48x I counted the files it actually downloaded
21:06 πŸ”— db48x find 2004 -type f | grep -v '\.desc$' | wc -l
21:08 πŸ”— Nemo_bis lol
21:08 πŸ”— Nemo_bis [×] Aesculus hippocastanum flowersΓ’Β€ΒŽ (53 F)
21:08 πŸ”— Nemo_bis [Γ’ΒˆΒ’] Centaurea cyanusΓ’Β€ΒŽ (2 C, 1 P, 139 F)
21:08 πŸ”— Nemo_bis [Γ’ΒˆΒ’] Cornflowers by countryΓ’Β€ΒŽ (1 C)
21:08 πŸ”— Nemo_bis [Γ’ΒˆΒ’] Cornflowers in GermanyΓ’Β€ΒŽ (1 C, 3 F)
21:08 πŸ”— Nemo_bis [Γ’ΒˆΒ’] Flowers by taxonΓ’Β€ΒŽ (33 C, 1 F)
21:08 πŸ”— Nemo_bis [Γ’ΒˆΒ’] Cornflowers in Baden-WürttembergΓ’Β€ΒŽ (1 C)
21:08 πŸ”— Nemo_bis [×] Cornflowers in Rhein-Neckar-KreisΓ’Β€ΒŽ (5 F)
21:15 πŸ”— emijrp db48x: can you compare filenames from .csv with those in your directory?
21:15 πŸ”— emijrp and detect the 2 missing ones, this is important

irclogger-viewer