[01:46] soultcer: We would leave it up [01:46] Unless a DMCA is explicitly filed [01:46] It's a don't-ask-don't tell thing [01:46] ;) [09:09] oh, so Obama will remove it too? [09:53] It's the freckin Internet Archive [09:53] not a spam-ridden-advertisement-crap-site [10:28] as far as the rapidshare-clone sites go, megaupload was one of the better ones [13:12] underscor: If you can use your vm at the internet archive then archiving wikimedia commons one at a time should be easy [13:39] does he have 18 TB? [13:39] anyway, WMF folks are going to set up an rsync soon [13:40] No, but if we just download all images from one day, put them into a tar archive and then put it into the internet archive storage it will work [13:49] I wouldn't call it "easy", though :) [13:49] and perhaps the WMF will do the tarballs [13:50] In 10 years maybe [13:50] They don't even have an offsite backup [13:52] actually, maybe a few weeks, they were discussing it yesterday [13:52] Link? [14:01] it was on IRC [14:01] anyway, fresh of print: https://wikitech.wikimedia.org/view/Dumps/Development_2012 [14:02] Cool, thanks [14:04] It says waiting for contact on archive.org there, I hope they find someone [19:07] If anyone sees emijrp, let him know I can talk with wikimedia people about that [19:08] since he's an admin there or something [19:08] soultcer: Nemo_bis I don't have 18TB at once, but we can't put things that big anyway [19:08] Things in the archive system shouldn't be more than about 10GB [19:09] I was thinking having an item for each Category: namespace [19:09] Each with its own date [19:09] It would be easier without categories. Categories can change [19:09] like wmf_20110121_categoryname [19:09] How would you logically break it up then? [19:10] Just by date [19:10] Then you could basically take the database dump of the image table and go bananas on it [19:11] They don't want people to get old image versions according to their robots.txt, but we could also add the oldimage table and get old image revisions too no problem [19:11] You mean like everything uploaded on 2011-01-21, then everything uploaded on 2011-01-20... [19:11] etcetera [19:11] Yes [19:11] Best would of course be to just give the wikipedia guys an introduction into the s3 api and an item where they can upload [19:11] s/item/bucket/ [19:12] They can't just do that though, with how the s3 api works [19:12] It's 10GB per item [19:12] and item = bucket [19:13] Oh, collection then [19:13] I sometimes confuse Amazon S3 stuff with IA S3 stuff [19:20] Oh, yeah [19:20] Sorry, haha [19:21] Yeah, if they're up for that, it's the best solution [19:21] I kinda want to figure out a way we can do it without having these massive tarballs, but I dunno how feasible that is [19:21] If not, would it be possible to install mysql on your IA VM. How much space does it have? [19:21] What's wrong with massive tarballs? [19:21] I have roughtly 2TB I can play with at any one point [19:22] Nothing particularly, just not easy to go restore a single image [19:24] I don't think the IA S3 system will like some 15 million files with a couple of hundred kilobytes each [19:30] The IA doesn't really use S3 structure, the s3 is just an translation layer to the storage system they use. [19:30] The internal logic should handle it fine [19:31] Well, actually, I guess that depends on whether there are more than about 1000 files uploaded a day [19:31] Because that's the practical limit on an itemsize [19:31] Maybe we should create smaller tarballs in the size of 100 mb or so each [19:32] With an index, too [19:32] Yeah, that'd be a good idea [19:32] If we index them on something like filename or upload date you won't really need an external index [19:34] Upload date is good because then we can start with old images that are unlikely to be deleted. But you will always need a copy of the wikipedia db or some other index to see when the file you are looking for was uploaded [19:49] the problem with daily tarball is that their size can indeed vary a lot [19:49] but they won't easily become unmanageable [19:50] You can always split the daily tarball into multiple daily tarballs with a predefined size [21:46] True [21:46] soultcer: We could always keep a copy of the images table in sqldump format