#wikiteam 2012-01-21,Sat

↑back Search

Time Nickname Message
01:46 🔗 underscor soultcer: We would leave it up
01:46 🔗 underscor Unless a DMCA is explicitly filed
01:46 🔗 underscor It's a don't-ask-don't tell thing
01:46 🔗 underscor ;)
09:09 🔗 Nemo_bis oh, so Obama will remove it too?
09:53 🔗 ersi It's the freckin Internet Archive
09:53 🔗 ersi not a spam-ridden-advertisement-crap-site
10:28 🔗 chronomex as far as the rapidshare-clone sites go, megaupload was one of the better ones
13:12 🔗 soultcer underscor: If you can use your vm at the internet archive then archiving wikimedia commons one at a time should be easy
13:39 🔗 Nemo_bis does he have 18 TB?
13:39 🔗 Nemo_bis anyway, WMF folks are going to set up an rsync soon
13:40 🔗 soultcer No, but if we just download all images from one day, put them into a tar archive and then put it into the internet archive storage it will work
13:49 🔗 Nemo_bis I wouldn't call it "easy", though :)
13:49 🔗 Nemo_bis and perhaps the WMF will do the tarballs
13:50 🔗 soultcer In 10 years maybe
13:50 🔗 soultcer They don't even have an offsite backup
13:52 🔗 Nemo_bis actually, maybe a few weeks, they were discussing it yesterday
13:52 🔗 soultcer Link?
14:01 🔗 Nemo_bis it was on IRC
14:01 🔗 Nemo_bis anyway, fresh of print: https://wikitech.wikimedia.org/view/Dumps/Development_2012
14:02 🔗 soultcer Cool, thanks
14:04 🔗 soultcer It says waiting for contact on archive.org there, I hope they find someone
19:07 🔗 underscor If anyone sees emijrp, let him know I can talk with wikimedia people about that
19:08 🔗 underscor since he's an admin there or something
19:08 🔗 underscor soultcer: Nemo_bis I don't have 18TB at once, but we can't put things that big anyway
19:08 🔗 underscor Things in the archive system shouldn't be more than about 10GB
19:09 🔗 underscor I was thinking having an item for each Category: namespace
19:09 🔗 underscor Each with its own date
19:09 🔗 soultcer It would be easier without categories. Categories can change
19:09 🔗 underscor like wmf_20110121_categoryname
19:09 🔗 underscor How would you logically break it up then?
19:10 🔗 soultcer Just by date
19:10 🔗 soultcer Then you could basically take the database dump of the image table and go bananas on it
19:11 🔗 soultcer They don't want people to get old image versions according to their robots.txt, but we could also add the oldimage table and get old image revisions too no problem
19:11 🔗 underscor You mean like everything uploaded on 2011-01-21, then everything uploaded on 2011-01-20...
19:11 🔗 underscor etcetera
19:11 🔗 soultcer Yes
19:11 🔗 soultcer Best would of course be to just give the wikipedia guys an introduction into the s3 api and an item where they can upload
19:11 🔗 soultcer s/item/bucket/
19:12 🔗 underscor They can't just do that though, with how the s3 api works
19:12 🔗 underscor It's 10GB per item
19:12 🔗 underscor and item = bucket
19:13 🔗 soultcer Oh, collection then
19:13 🔗 soultcer I sometimes confuse Amazon S3 stuff with IA S3 stuff
19:20 🔗 underscor Oh, yeah
19:20 🔗 underscor Sorry, haha
19:21 🔗 underscor Yeah, if they're up for that, it's the best solution
19:21 🔗 underscor I kinda want to figure out a way we can do it without having these massive tarballs, but I dunno how feasible that is
19:21 🔗 soultcer If not, would it be possible to install mysql on your IA VM. How much space does it have?
19:21 🔗 soultcer What's wrong with massive tarballs?
19:21 🔗 underscor I have roughtly 2TB I can play with at any one point
19:22 🔗 underscor Nothing particularly, just not easy to go restore a single image
19:24 🔗 soultcer I don't think the IA S3 system will like some 15 million files with a couple of hundred kilobytes each
19:30 🔗 underscor The IA doesn't really use S3 structure, the s3 is just an translation layer to the storage system they use.
19:30 🔗 underscor The internal logic should handle it fine
19:31 🔗 underscor Well, actually, I guess that depends on whether there are more than about 1000 files uploaded a day
19:31 🔗 underscor Because that's the practical limit on an itemsize
19:31 🔗 soultcer Maybe we should create smaller tarballs in the size of 100 mb or so each
19:32 🔗 underscor With an index, too
19:32 🔗 underscor Yeah, that'd be a good idea
19:32 🔗 soultcer If we index them on something like filename or upload date you won't really need an external index
19:34 🔗 soultcer Upload date is good because then we can start with old images that are unlikely to be deleted. But you will always need a copy of the wikipedia db or some other index to see when the file you are looking for was uploaded
19:49 🔗 Nemo_bis the problem with daily tarball is that their size can indeed vary a lot
19:49 🔗 Nemo_bis but they won't easily become unmanageable
19:50 🔗 soultcer You can always split the daily tarball into multiple daily tarballs with a predefined size
21:46 🔗 underscor True
21:46 🔗 underscor soultcer: We could always keep a copy of the images table in sqldump format

irclogger-viewer