#wikiteam 2012-02-27,Mon

↑back Search

Time	Nickname	Message
16:51 ^🔗	emijrp	i need some volunteers to archive http://commons.wikimedia.org, it contains 12M files, but the first chunk is about 1M files ~= 500 gb, that chunk is made of daily chunks
16:51 ^🔗	emijrp	im going to upload the script and feed list
16:58 ^🔗	Nemo_bis	emijrp, what do you need?
16:59 ^🔗	Nemo_bis	what I can offer is limited technical knowledge, quite a lot of bandwidth and some disk space :)
16:59 ^🔗	Nemo_bis	if you put the script somewhere and a list of chunks to claim on archiveteam.org, I'll do as much as possible
17:09 ^🔗	emijrp	wait a second
17:20 ^🔗	emijrp	http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process
17:20 ^🔗	Nemo_bis	emijrp, thanks
17:21 ^🔗	Nemo_bis	I can download 150 GiB or something to start with
17:21 ^🔗	Nemo_bis	unless you want to crowdsource it more
17:21 ^🔗	emijrp	first i want just a tests
17:21 ^🔗	Nemo_bis	ok
17:22 ^🔗	Nemo_bis	not today though
17:22 ^🔗	Nemo_bis	also, why 7z? doesn't look very useful for images
17:22 ^🔗	Nemo_bis	and zip or tar can be browsed with the IA tools
17:22 ^🔗	Nemo_bis	(without downloading)
17:25 ^🔗	emijrp	every day-by-day folder contain files + .xml with descriptions
17:25 ^🔗	emijrp	thats why i chose 7z, for the xml
17:25 ^🔗	emijrp	but if zip is browsable..
17:25 ^🔗	emijrp	by the way, some days have +5000 pics
17:26 ^🔗	emijrp	can browse that?
17:26 ^🔗	Nemo_bis	I think so
17:26 ^🔗	Nemo_bis	It has some problems also with huge archives, like 5-10 GiB
17:27 ^🔗	Nemo_bis	limit unclear
17:27 ^🔗	Nemo_bis	s/also/only/
17:29 ^🔗	emijrp	that woul be the size for every package
17:29 ^🔗	emijrp	1mb/image, 5000 images, 5gb
17:32 ^🔗	Nemo_bis	should be ok
18:48 ^🔗	emijrp	ok, we can make some tests http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process
18:48 ^🔗	emijrp	i have tested the script, but, i hope we can find any error before we start to DDoS wikipedia servers
18:48 ^🔗	emijrp	by the way, my upload stream is shit, so, i wont use this script much, irony
18:52 ^🔗	emijrp	Nemo_bis: can you get from 2004-09-07 to 2004-09-30 ?
18:56 ^🔗	Nemo_bis	emijrp, I can run it, but not look much at it today, no time
19:01 ^🔗	Nemo_bis	emijrp, I download that 7z and uncompress it in the working directory?
19:02 ^🔗	emijrp	yes
19:03 ^🔗	Nemo_bis	emijrp, running
19:04 ^🔗	Nemo_bis	emijrp, as they're small enough, should we put them in rather big item on archive.org?
19:04 ^🔗	emijrp	it creates zips by day, i think we can follow that format
19:04 ^🔗	emijrp	as the pagecounts from 2007
19:05 ^🔗	emijrp	do you upload 1 to 1 or in batches?
19:05 ^🔗	Nemo_bis	emijrp, pagecounts?
19:05 ^🔗	Nemo_bis	those are in monthly items
19:05 ^🔗	emijrp	domas visits logs
19:06 ^🔗	emijrp	yes, but are single files
19:06 ^🔗	emijrp	if you upload several .zip into one item, you can browse them separately?
19:06 ^🔗	Nemo_bis	emijrp, yes
19:07 ^🔗	emijrp	ok, in that case we can create one time per month
19:07 ^🔗	Nemo_bis	but anyway you need to add the link
19:07 ^🔗	Nemo_bis	I don't think it links to the browsing thingy automatically
19:07 ^🔗	Nemo_bis	not always at least
19:08 ^🔗	emijrp	you create an item, and upload all the 30 zips for a month, then they are listed
19:08 ^🔗	emijrp	oh db48x is download september too
19:08 ^🔗	emijrp	look the wiki
19:09 ^🔗	Nemo_bis	emijrp, yes, but how do you link to the zipview.php thingy?
19:09 ^🔗	Nemo_bis	ah, you edit conflicted me
19:10 ^🔗	Nemo_bis	-rw-rw-r-- 1 federico federico 11M 2012-02-27 20:07 2004-09-08.zip
19:10 ^🔗	Nemo_bis	-rw-rw-r-- 1 federico federico 560K 2012-02-27 20:04 2004-09-07.zip
19:11 ^🔗	Nemo_bis	-rw-rw-r-- 1 federico federico 3,7M 2012-02-27 20:09 2004-09-09.zip
19:13 ^🔗	emijrp	not bad
19:13 ^🔗	emijrp	: )
19:13 ^🔗	emijrp	i hope we can archive Commons finally
19:15 ^🔗	Nemo_bis	db48x, maybe you can do the next month?
19:16 ^🔗	soultcer	The internet archive prefers zips?
19:16 ^🔗	Nemo_bis	soultcer, SketchCow said so
19:17 ^🔗	soultcer	Kinda surprising, I always thought tar was the "most future proof" format
19:17 ^🔗	db48x	Nemo_bis: I claimed this one first :)
19:17 ^🔗	Nemo_bis	db48x, no, you only saved first
19:17 ^🔗	db48x	same thing :)
19:17 ^🔗	Nemo_bis	I claimed first and you stole it because you were in the wrong channel :p
19:18 ^🔗	db48x	I didn't start editing the page until after I had checked out the software from svn, downloaded the csv file and started the download :)
19:19 ^🔗	db48x	anyway, I don't think duplicates will hurt us
19:19 ^🔗	emijrp	come guys
19:19 ^🔗	Nemo_bis	still wrong :p
19:19 ^🔗	Nemo_bis	we're justo kidding :(
19:19 ^🔗	emijrp	come on *
19:19 ^🔗	Nemo_bis	emijrp, if you create a standard description then everyone can upload to IA without errors
19:20 ^🔗	soultcer	emijrp: What do you think about adding a hash check?
19:20 ^🔗	*	Nemo_bis switching to next month
19:20 ^🔗	soultcer	Since you already have direct access to the commons database, you could just add another filed to the csv file
19:20 ^🔗	emijrp	they use sha1 in base 36, i dont know how to compare that with the donwloaded pics
19:21 ^🔗	emijrp	sha1sum works in other base
19:21 ^🔗	soultcer	Easy as pie
19:22 ^🔗	soultcer	Hm, do you mind if I play around with your script for a bit and give you some patches?
19:22 ^🔗	emijrp	by the way, i dont think we willhave many corrupted files..
19:22 ^🔗	emijrp	ok
19:22 ^🔗	soultcer	I'm paranoid about flipped bits, even thought odds are that we won't have a single corrupted image
19:23 ^🔗	chronomex	it's more likely than you think!
19:23 ^🔗	soultcer	I have my db server in an underground room to prevent cosmic rays from getting in
19:23 ^🔗	emijrp	otober is not a month http://www.archiveteam.org/index.php?title=Wikimedia_Commons&curid=461&diff=7312&oldid=7311&rcid=10599
19:24 ^🔗	chronomex	soultcer: but CENTIPEDES!
19:25 ^🔗	soultcer	I have a cat that eats those things for breakfast
19:25 ^🔗	Nemo_bis	emijrp, {{cn}}
19:26 ^🔗	Nemo_bis	emijrp, is that field actually populated?
19:26 ^🔗	emijrp	sha1sum ? no
19:28 ^🔗	chronomex	soultcer: memes are lost on the young.
19:28 ^🔗	soultcer	I try to not follow internet culture too much
19:28 ^🔗	soultcer	It's distracting
19:28 ^🔗	chronomex	this is -ancient- meme
19:30 ^🔗	emijrp	old as internets
19:30 ^🔗	chronomex	http://www.cigarsmokers.com/threads/2898-quot-centipedes-In-MY-vagina-quot
19:30 ^🔗	chronomex	photoshop of an even older ad "Porn? on MY PC?!?"
19:31 ^🔗	Nemo_bis	emijrp, I meant this perhaps https://bugzilla.wikimedia.org/show_bug.cgi?id=17057
19:32 ^🔗	emijrp	soultcer: look that bug, looks like sha1 in wikimedia database sucks
19:32 ^🔗	chronomex	ignore everything on that site but the picture
19:33 ^🔗	soultcer	emijrp, Yeah well that shit happens if you deploy one of the largest websites from svn trunk
19:34 ^🔗	Nemo_bis	soultcer, they no longer do so actually
19:35 ^🔗	soultcer	We could check the hashes and then tell them which images have the wrong hashes :-)
19:36 ^🔗	Nemo_bis	Or emijrp could download all those files from the toolserver claiming he's just checking hashes, then upload everything to archive.org
19:36 ^🔗	soultcer	Wikimedia Germany would probably not approve
19:36 ^🔗	Nemo_bis	I doubt they care
19:36 ^🔗	chronomex	the germans can suck it
19:36 ^🔗	soultcer	But the toolserver belongs to them
19:36 ^🔗	Nemo_bis	it's just 500 MB
19:36 ^🔗	Nemo_bis	*GB
19:37 ^🔗	Nemo_bis	there's a user downloading 3 TB of images for upload to Commons on TS
19:37 ^🔗	emijrp	topic?
19:37 ^🔗	Nemo_bis	emijrp, dunno, US-gov stuff
19:37 ^🔗	emijrp	planes, war
19:38 ^🔗	Nemo_bis	hopefully not
19:38 ^🔗	chronomex	nuclear reactors, cornfields, bridges, moon landing
19:39 ^🔗	emijrp	how many images are 3tb?
19:41 ^🔗	Nemo_bis	emijrp, dunno, they're huge TIFFs IIRC
19:42 ^🔗	Nemo_bis	like 20 Mb each
19:42 ^🔗	Nemo_bis	ask multichill
19:42 ^🔗	Nemo_bis	or check [[Commons:Batch uploading]] or whatever
19:43 ^🔗	Nemo_bis	seriously though, emijrp, I think this first batch could also be done on TS
19:43 ^🔗	emijrp	why ?
19:43 ^🔗	Nemo_bis	perhaps everything is too much given that the WMF says they're working on it
19:44 ^🔗	Nemo_bis	well, you could run it very fast I guess, but if we manage this way ok (it's quite CPU-consuming apparently)
19:44 ^🔗	Nemo_bis	btw emijrp, do you use https://wiki.toolserver.org/view/SGE ? I see those hosts are basically idle
19:45 ^🔗	Nemo_bis	lots of CPu to (ab)use there
19:45 ^🔗	emijrp	no
19:45 ^🔗	soultcer	I don't think CPU is the limit
19:51 ^🔗	Nemo_bis	unless you download at 100 MiB/s or something I doubt you can harm much
19:55 ^🔗	soultcer	emjirp: Any specifiy reason why you used wget and curl instead of urllib?
19:57 ^🔗	emijrp	i had issues in the past trying to get stuff with urllib from wikipedia servers
19:58 ^🔗	soultcer	Weird
19:58 ^🔗	soultcer	Hm, what do you do about unicode in the zip files, afaik zip doesn't have any concept of filename encoding
20:00 ^🔗	soultcer	Nevermind, I see they fixed that and the Info-ZIP program will write the unicode filenames as a special metadata field
20:03 ^🔗	emijrp	i did a test with hebrew and arabic filenames
20:09 ^🔗	db48x	probably going to grab a year next
20:12 ^🔗	emijrp	check the size estimates http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Size_stats
20:18 ^🔗	db48x	2006 is only 327 GB
20:18 ^🔗	db48x	I'd want to run a number of copies though
20:26 ^🔗	Nemo_bis	emijrp, doesn't the script delete files after zipping them?
20:26 ^🔗	emijrp	no
20:26 ^🔗	emijrp	i dont like script deleting stuff
20:27 ^🔗	Nemo_bis	hm
20:29 ^🔗	Nemo_bis	well, I can't even see tha bandwidth consumption in my machine graphs
20:29 ^🔗	db48x	should spawn a separate thread for every day
20:31 ^🔗	emijrp	yep
20:31 ^🔗	emijrp	thats the point for day-by-day download too
20:31 ^🔗	emijrp	multiple threads
20:34 ^🔗	db48x	size estimates are probably low
20:34 ^🔗	db48x	2004-09 was estimated at 180 Mb, it came to 206 Mb after compression
20:40 ^🔗	Nemo_bis	that's descriptions
20:45 ^🔗	db48x	unlikely
20:46 ^🔗	db48x	only 8.3 megs of those
20:47 ^🔗	emijrp	estimates don't include old versions of pics
20:47 ^🔗	db48x	ah
20:47 ^🔗	emijrp	some pics are reuploaded with changes (white balacing, photoshop, etc)
20:48 ^🔗	emijrp	those are the 200XYYZZHHMMSS!... filenames
20:48 ^🔗	emijrp	don't worry if you see a 2006.....! pic in a 2005 folder, or whatever
20:49 ^🔗	db48x	heh
20:49 ^🔗	Nemo_bis	lots of 2011 images in the 2004 folder
20:49 ^🔗	emijrp	that timestamped filenames contain the date when the image was reuploaded
20:49 ^🔗	emijrp	yes Nemo_bis, but the 2011 means that the 2004 image was changed in 2011, not a 2011 image
20:50 ^🔗	Nemo_bis	emijrp, yes, "2011 images" = "images with filename 2011*"
20:50 ^🔗	emijrp	it is weird, mediawiki way of life..
20:50 ^🔗	Nemo_bis	not so weird, just annoying that there's no permalink
20:50 ^🔗	Nemo_bis	(for new images, and IIRC)
20:51 ^🔗	emijrp	grep "\|2005" commonssql.csv \| awk -F"\|" '{x += $5} END {print "sum: "x/1024/1024/1024}'
20:51 ^🔗	emijrp	sum: 111.67
20:51 ^🔗	emijrp	gb
20:53 ^🔗	emijrp	grep "\|200409" commonssql.csv -c
20:53 ^🔗	emijrp	says 790 pics
20:53 ^🔗	emijrp	but you wrote 788 in the wiki
20:54 ^🔗	emijrp	db48x:
20:55 ^🔗	emijrp	2 missing files or you didnt count right
21:00 ^🔗	emijrp	im going to announce this commons project in the mailing list
21:01 ^🔗	Nemo_bis	2004-10-27 ha sooooo many flowers
21:02 ^🔗	emijrp	not only that day, commons is full of flower lovers
21:02 ^🔗	emijrp	gaaaaaaaaaaaaaaaaaaaaaaaay
21:03 ^🔗	Nemo_bis	hm?
21:04 ^🔗	Nemo_bis	there's not even a featured flower project https://commons.wikimedia.org/wiki/Commons:Featured_Kittens
21:06 ^🔗	db48x	I counted the files it actually downloaded
21:06 ^🔗	db48x	find 2004 -type f \| grep -v '\.desc$' \| wc -l
21:08 ^🔗	Nemo_bis	lol
21:08 ^🔗	Nemo_bis	[Ã] Aesculus hippocastanum flowersâ (53 F)
21:08 ^🔗	Nemo_bis	[â] Centaurea cyanusâ (2 C, 1 P, 139 F)
21:08 ^🔗	Nemo_bis	[â] Cornflowers by countryâ (1 C)
21:08 ^🔗	Nemo_bis	[â] Cornflowers in Germanyâ (1 C, 3 F)
21:08 ^🔗	Nemo_bis	[â] Flowers by taxonâ (33 C, 1 F)
21:08 ^🔗	Nemo_bis	[â] Cornflowers in Baden-WÃ¼rttembergâ (1 C)
21:08 ^🔗	Nemo_bis	[Ã] Cornflowers in Rhein-Neckar-Kreisâ (5 F)
21:15 ^🔗	emijrp	db48x: can you compare filenames from .csv with those in your directory?
21:15 ^🔗	emijrp	and detect the 2 missing ones, this is important

irclogger-viewer