#wikiteam 2012-01-21,Sat

↑back Search

Time	Nickname	Message
01:46 ^🔗	underscor	soultcer: We would leave it up
01:46 ^🔗	underscor	Unless a DMCA is explicitly filed
01:46 ^🔗	underscor	It's a don't-ask-don't tell thing
01:46 ^🔗	underscor	;)
09:09 ^🔗	Nemo_bis	oh, so Obama will remove it too?
09:53 ^🔗	ersi	It's the freckin Internet Archive
09:53 ^🔗	ersi	not a spam-ridden-advertisement-crap-site
10:28 ^🔗	chronomex	as far as the rapidshare-clone sites go, megaupload was one of the better ones
13:12 ^🔗	soultcer	underscor: If you can use your vm at the internet archive then archiving wikimedia commons one at a time should be easy
13:39 ^🔗	Nemo_bis	does he have 18 TB?
13:39 ^🔗	Nemo_bis	anyway, WMF folks are going to set up an rsync soon
13:40 ^🔗	soultcer	No, but if we just download all images from one day, put them into a tar archive and then put it into the internet archive storage it will work
13:49 ^🔗	Nemo_bis	I wouldn't call it "easy", though :)
13:49 ^🔗	Nemo_bis	and perhaps the WMF will do the tarballs
13:50 ^🔗	soultcer	In 10 years maybe
13:50 ^🔗	soultcer	They don't even have an offsite backup
13:52 ^🔗	Nemo_bis	actually, maybe a few weeks, they were discussing it yesterday
13:52 ^🔗	soultcer	Link?
14:01 ^🔗	Nemo_bis	it was on IRC
14:01 ^🔗	Nemo_bis	anyway, fresh of print: https://wikitech.wikimedia.org/view/Dumps/Development_2012
14:02 ^🔗	soultcer	Cool, thanks
14:04 ^🔗	soultcer	It says waiting for contact on archive.org there, I hope they find someone
19:07 ^🔗	underscor	If anyone sees emijrp, let him know I can talk with wikimedia people about that
19:08 ^🔗	underscor	since he's an admin there or something
19:08 ^🔗	underscor	soultcer: Nemo_bis I don't have 18TB at once, but we can't put things that big anyway
19:08 ^🔗	underscor	Things in the archive system shouldn't be more than about 10GB
19:09 ^🔗	underscor	I was thinking having an item for each Category: namespace
19:09 ^🔗	underscor	Each with its own date
19:09 ^🔗	soultcer	It would be easier without categories. Categories can change
19:09 ^🔗	underscor	like wmf_20110121_categoryname
19:09 ^🔗	underscor	How would you logically break it up then?
19:10 ^🔗	soultcer	Just by date
19:10 ^🔗	soultcer	Then you could basically take the database dump of the image table and go bananas on it
19:11 ^🔗	soultcer	They don't want people to get old image versions according to their robots.txt, but we could also add the oldimage table and get old image revisions too no problem
19:11 ^🔗	underscor	You mean like everything uploaded on 2011-01-21, then everything uploaded on 2011-01-20...
19:11 ^🔗	underscor	etcetera
19:11 ^🔗	soultcer	Yes
19:11 ^🔗	soultcer	Best would of course be to just give the wikipedia guys an introduction into the s3 api and an item where they can upload
19:11 ^🔗	soultcer	s/item/bucket/
19:12 ^🔗	underscor	They can't just do that though, with how the s3 api works
19:12 ^🔗	underscor	It's 10GB per item
19:12 ^🔗	underscor	and item = bucket
19:13 ^🔗	soultcer	Oh, collection then
19:13 ^🔗	soultcer	I sometimes confuse Amazon S3 stuff with IA S3 stuff
19:20 ^🔗	underscor	Oh, yeah
19:20 ^🔗	underscor	Sorry, haha
19:21 ^🔗	underscor	Yeah, if they're up for that, it's the best solution
19:21 ^🔗	underscor	I kinda want to figure out a way we can do it without having these massive tarballs, but I dunno how feasible that is
19:21 ^🔗	soultcer	If not, would it be possible to install mysql on your IA VM. How much space does it have?
19:21 ^🔗	soultcer	What's wrong with massive tarballs?
19:21 ^🔗	underscor	I have roughtly 2TB I can play with at any one point
19:22 ^🔗	underscor	Nothing particularly, just not easy to go restore a single image
19:24 ^🔗	soultcer	I don't think the IA S3 system will like some 15 million files with a couple of hundred kilobytes each
19:30 ^🔗	underscor	The IA doesn't really use S3 structure, the s3 is just an translation layer to the storage system they use.
19:30 ^🔗	underscor	The internal logic should handle it fine
19:31 ^🔗	underscor	Well, actually, I guess that depends on whether there are more than about 1000 files uploaded a day
19:31 ^🔗	underscor	Because that's the practical limit on an itemsize
19:31 ^🔗	soultcer	Maybe we should create smaller tarballs in the size of 100 mb or so each
19:32 ^🔗	underscor	With an index, too
19:32 ^🔗	underscor	Yeah, that'd be a good idea
19:32 ^🔗	soultcer	If we index them on something like filename or upload date you won't really need an external index
19:34 ^🔗	soultcer	Upload date is good because then we can start with old images that are unlikely to be deleted. But you will always need a copy of the wikipedia db or some other index to see when the file you are looking for was uploaded
19:49 ^🔗	Nemo_bis	the problem with daily tarball is that their size can indeed vary a lot
19:49 ^🔗	Nemo_bis	but they won't easily become unmanageable
19:50 ^🔗	soultcer	You can always split the daily tarball into multiple daily tarballs with a predefined size
21:46 ^🔗	underscor	True
21:46 ^🔗	underscor	soultcer: We could always keep a copy of the images table in sqldump format

irclogger-viewer