[16:51] i need some volunteers to archive http://commons.wikimedia.org, it contains 12M files, but the first chunk is about 1M files ~= 500 gb, that chunk is made of daily chunks [16:51] im going to upload the script and feed list [16:58] emijrp, what do you need? [16:59] what I can offer is limited technical knowledge, quite a lot of bandwidth and some disk space :) [16:59] if you put the script somewhere and a list of chunks to claim on archiveteam.org, I'll do as much as possible [17:09] wait a second [17:20] http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process [17:20] emijrp, thanks [17:21] I can download 150 GiB or something to start with [17:21] unless you want to crowdsource it more [17:21] first i want just a tests [17:21] ok [17:22] not today though [17:22] also, why 7z? doesn't look very useful for images [17:22] and zip or tar can be browsed with the IA tools [17:22] (without downloading) [17:25] every day-by-day folder contain files + .xml with descriptions [17:25] thats why i chose 7z, for the xml [17:25] but if zip is browsable.. [17:25] by the way, some days have +5000 pics [17:26] can browse that? [17:26] I think so [17:26] It has some problems also with huge archives, like 5-10 GiB [17:27] limit unclear [17:27] s/also/only/ [17:29] that woul be the size for every package [17:29] 1mb/image, 5000 images, 5gb [17:32] should be ok [18:48] ok, we can make some tests http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Archiving_process [18:48] i have tested the script, but, i hope we can find any error before we start to DDoS wikipedia servers [18:48] by the way, my upload stream is shit, so, i wont use this script much, irony [18:52] Nemo_bis: can you get from 2004-09-07 to 2004-09-30 ? [18:56] emijrp, I can run it, but not look much at it today, no time [19:01] emijrp, I download that 7z and uncompress it in the working directory? [19:02] yes [19:03] emijrp, running [19:04] emijrp, as they're small enough, should we put them in rather big item on archive.org? [19:04] it creates zips by day, i think we can follow that format [19:04] as the pagecounts from 2007 [19:05] do you upload 1 to 1 or in batches? [19:05] emijrp, pagecounts? [19:05] those are in monthly items [19:05] domas visits logs [19:06] yes, but are single files [19:06] if you upload several .zip into one item, you can browse them separately? [19:06] emijrp, yes [19:07] ok, in that case we can create one time per month [19:07] but anyway you need to add the link [19:07] I don't think it links to the browsing thingy automatically [19:07] not always at least [19:08] you create an item, and upload all the 30 zips for a month, then they are listed [19:08] oh db48x is download september too [19:08] look the wiki [19:09] emijrp, yes, but how do you link to the zipview.php thingy? [19:09] ah, you edit conflicted me [19:10] -rw-rw-r-- 1 federico federico 11M 2012-02-27 20:07 2004-09-08.zip [19:10] -rw-rw-r-- 1 federico federico 560K 2012-02-27 20:04 2004-09-07.zip [19:11] -rw-rw-r-- 1 federico federico 3,7M 2012-02-27 20:09 2004-09-09.zip [19:13] not bad [19:13] : ) [19:13] i hope we can archive Commons finally [19:15] db48x, maybe you can do the next month? [19:16] The internet archive prefers zips? [19:16] soultcer, SketchCow said so [19:17] Kinda surprising, I always thought tar was the "most future proof" format [19:17] Nemo_bis: I claimed this one first :) [19:17] db48x, no, you only saved first [19:17] same thing :) [19:17] I claimed first and you stole it because you were in the wrong channel :p [19:18] I didn't start editing the page until after I had checked out the software from svn, downloaded the csv file and started the download :) [19:19] anyway, I don't think duplicates will hurt us [19:19] come guys [19:19] still wrong :p [19:19] we're justo kidding :( [19:19] come on * [19:19] emijrp, if you create a standard description then everyone can upload to IA without errors [19:20] emijrp: What do you think about adding a hash check? [19:20] * Nemo_bis switching to next month [19:20] Since you already have direct access to the commons database, you could just add another filed to the csv file [19:20] they use sha1 in base 36, i dont know how to compare that with the donwloaded pics [19:21] sha1sum works in other base [19:21] Easy as pie [19:22] Hm, do you mind if I play around with your script for a bit and give you some patches? [19:22] by the way, i dont think we willhave many corrupted files.. [19:22] ok [19:22] I'm paranoid about flipped bits, even thought odds are that we won't have a single corrupted image [19:23] it's more likely than you think! [19:23] I have my db server in an underground room to prevent cosmic rays from getting in [19:23] otober is not a month http://www.archiveteam.org/index.php?title=Wikimedia_Commons&curid=461&diff=7312&oldid=7311&rcid=10599 [19:24] soultcer: but CENTIPEDES! [19:25] I have a cat that eats those things for breakfast [19:25] emijrp, {{cn}} [19:26] emijrp, is that field actually populated? [19:26] sha1sum ? no [19:28] soultcer: memes are lost on the young. [19:28] I try to not follow internet culture too much [19:28] It's distracting [19:28] this is -ancient- meme [19:30] old as internets [19:30] http://www.cigarsmokers.com/threads/2898-quot-centipedes-In-MY-vagina-quot [19:30] photoshop of an even older ad "Porn? on MY PC?!?" [19:31] emijrp, I meant this perhaps https://bugzilla.wikimedia.org/show_bug.cgi?id=17057 [19:32] soultcer: look that bug, looks like sha1 in wikimedia database sucks [19:32] ignore everything on that site but the picture [19:33] emijrp, Yeah well that shit happens if you deploy one of the largest websites from svn trunk [19:34] soultcer, they no longer do so actually [19:35] We could check the hashes and then tell them which images have the wrong hashes :-) [19:36] Or emijrp could download all those files from the toolserver claiming he's just checking hashes, then upload everything to archive.org [19:36] Wikimedia Germany would probably not approve [19:36] I doubt they care [19:36] the germans can suck it [19:36] But the toolserver belongs to them [19:36] it's just 500 MB [19:36] *GB [19:37] there's a user downloading 3 TB of images for upload to Commons on TS [19:37] topic? [19:37] emijrp, dunno, US-gov stuff [19:37] planes, war [19:38] hopefully not [19:38] nuclear reactors, cornfields, bridges, moon landing [19:39] how many images are 3tb? [19:41] emijrp, dunno, they're huge TIFFs IIRC [19:42] like 20 Mb each [19:42] ask multichill [19:42] or check [[Commons:Batch uploading]] or whatever [19:43] seriously though, emijrp, I think this first batch could also be done on TS [19:43] why ? [19:43] perhaps everything is too much given that the WMF says they're working on it [19:44] well, you could run it very fast I guess, but if we manage this way ok (it's quite CPU-consuming apparently) [19:44] btw emijrp, do you use https://wiki.toolserver.org/view/SGE ? I see those hosts are basically idle [19:45] lots of CPu to (ab)use there [19:45] no [19:45] I don't think CPU is the limit [19:51] unless you download at 100 MiB/s or something I doubt you can harm much [19:55] emjirp: Any specifiy reason why you used wget and curl instead of urllib? [19:57] i had issues in the past trying to get stuff with urllib from wikipedia servers [19:58] Weird [19:58] Hm, what do you do about unicode in the zip files, afaik zip doesn't have any concept of filename encoding [20:00] Nevermind, I see they fixed that and the Info-ZIP program will write the unicode filenames as a special metadata field [20:03] i did a test with hebrew and arabic filenames [20:09] probably going to grab a year next [20:12] check the size estimates http://www.archiveteam.org/index.php?title=Wikimedia_Commons#Size_stats [20:18] 2006 is only 327 GB [20:18] I'd want to run a number of copies though [20:26] emijrp, doesn't the script delete files after zipping them? [20:26] no [20:26] i dont like script deleting stuff [20:27] hm [20:29] well, I can't even see tha bandwidth consumption in my machine graphs [20:29] should spawn a separate thread for every day [20:31] yep [20:31] thats the point for day-by-day download too [20:31] multiple threads [20:34] size estimates are probably low [20:34] 2004-09 was estimated at 180 Mb, it came to 206 Mb after compression [20:40] that's descriptions [20:45] unlikely [20:46] only 8.3 megs of those [20:47] estimates don't include old versions of pics [20:47] ah [20:47] some pics are reuploaded with changes (white balacing, photoshop, etc) [20:48] those are the 200XYYZZHHMMSS!... filenames [20:48] don't worry if you see a 2006.....! pic in a 2005 folder, or whatever [20:49] heh [20:49] lots of 2011 images in the 2004 folder [20:49] that timestamped filenames contain the date when the image was reuploaded [20:49] yes Nemo_bis, but the 2011 means that the 2004 image was changed in 2011, not a 2011 image [20:50] emijrp, yes, "2011 images" = "images with filename 2011*" [20:50] it is weird, mediawiki way of life.. [20:50] not so weird, just annoying that there's no permalink [20:50] (for new images, and IIRC) [20:51] grep "|2005" commonssql.csv | awk -F"|" '{x += $5} END {print "sum: "x/1024/1024/1024}' [20:51] sum: 111.67 [20:51] gb [20:53] grep "|200409" commonssql.csv -c [20:53] says 790 pics [20:53] but you wrote 788 in the wiki [20:54] db48x: [20:55] 2 missing files or you didnt count right [21:00] im going to announce this commons project in the mailing list [21:01] 2004-10-27 ha sooooo many flowers [21:02] not only that day, commons is full of flower lovers [21:02] gaaaaaaaaaaaaaaaaaaaaaaaay [21:03] hm? [21:04] there's not even a featured flower project https://commons.wikimedia.org/wiki/Commons:Featured_Kittens [21:06] I counted the files it actually downloaded [21:06] find 2004 -type f | grep -v '\.desc$' | wc -l [21:08] lol [21:08] [Ã] Aesculus hippocastanum flowersâ (53 F) [21:08] [â] Centaurea cyanusâ (2 C, 1 P, 139 F) [21:08] [â] Cornflowers by countryâ (1 C) [21:08] [â] Cornflowers in Germanyâ (1 C, 3 F) [21:08] [â] Flowers by taxonâ (33 C, 1 F) [21:08] [â] Cornflowers in Baden-WÃ¼rttembergâ (1 C) [21:08] [Ã] Cornflowers in Rhein-Neckar-Kreisâ (5 F) [21:15] db48x: can you compare filenames from .csv with those in your directory? [21:15] and detect the 2 missing ones, this is important