Time |
Nickname |
Message |
01:46
🔗
|
underscor |
soultcer: We would leave it up |
01:46
🔗
|
underscor |
Unless a DMCA is explicitly filed |
01:46
🔗
|
underscor |
It's a don't-ask-don't tell thing |
01:46
🔗
|
underscor |
;) |
09:09
🔗
|
Nemo_bis |
oh, so Obama will remove it too? |
09:53
🔗
|
ersi |
It's the freckin Internet Archive |
09:53
🔗
|
ersi |
not a spam-ridden-advertisement-crap-site |
10:28
🔗
|
chronomex |
as far as the rapidshare-clone sites go, megaupload was one of the better ones |
13:12
🔗
|
soultcer |
underscor: If you can use your vm at the internet archive then archiving wikimedia commons one at a time should be easy |
13:39
🔗
|
Nemo_bis |
does he have 18 TB? |
13:39
🔗
|
Nemo_bis |
anyway, WMF folks are going to set up an rsync soon |
13:40
🔗
|
soultcer |
No, but if we just download all images from one day, put them into a tar archive and then put it into the internet archive storage it will work |
13:49
🔗
|
Nemo_bis |
I wouldn't call it "easy", though :) |
13:49
🔗
|
Nemo_bis |
and perhaps the WMF will do the tarballs |
13:50
🔗
|
soultcer |
In 10 years maybe |
13:50
🔗
|
soultcer |
They don't even have an offsite backup |
13:52
🔗
|
Nemo_bis |
actually, maybe a few weeks, they were discussing it yesterday |
13:52
🔗
|
soultcer |
Link? |
14:01
🔗
|
Nemo_bis |
it was on IRC |
14:01
🔗
|
Nemo_bis |
anyway, fresh of print: https://wikitech.wikimedia.org/view/Dumps/Development_2012 |
14:02
🔗
|
soultcer |
Cool, thanks |
14:04
🔗
|
soultcer |
It says waiting for contact on archive.org there, I hope they find someone |
19:07
🔗
|
underscor |
If anyone sees emijrp, let him know I can talk with wikimedia people about that |
19:08
🔗
|
underscor |
since he's an admin there or something |
19:08
🔗
|
underscor |
soultcer: Nemo_bis I don't have 18TB at once, but we can't put things that big anyway |
19:08
🔗
|
underscor |
Things in the archive system shouldn't be more than about 10GB |
19:09
🔗
|
underscor |
I was thinking having an item for each Category: namespace |
19:09
🔗
|
underscor |
Each with its own date |
19:09
🔗
|
soultcer |
It would be easier without categories. Categories can change |
19:09
🔗
|
underscor |
like wmf_20110121_categoryname |
19:09
🔗
|
underscor |
How would you logically break it up then? |
19:10
🔗
|
soultcer |
Just by date |
19:10
🔗
|
soultcer |
Then you could basically take the database dump of the image table and go bananas on it |
19:11
🔗
|
soultcer |
They don't want people to get old image versions according to their robots.txt, but we could also add the oldimage table and get old image revisions too no problem |
19:11
🔗
|
underscor |
You mean like everything uploaded on 2011-01-21, then everything uploaded on 2011-01-20... |
19:11
🔗
|
underscor |
etcetera |
19:11
🔗
|
soultcer |
Yes |
19:11
🔗
|
soultcer |
Best would of course be to just give the wikipedia guys an introduction into the s3 api and an item where they can upload |
19:11
🔗
|
soultcer |
s/item/bucket/ |
19:12
🔗
|
underscor |
They can't just do that though, with how the s3 api works |
19:12
🔗
|
underscor |
It's 10GB per item |
19:12
🔗
|
underscor |
and item = bucket |
19:13
🔗
|
soultcer |
Oh, collection then |
19:13
🔗
|
soultcer |
I sometimes confuse Amazon S3 stuff with IA S3 stuff |
19:20
🔗
|
underscor |
Oh, yeah |
19:20
🔗
|
underscor |
Sorry, haha |
19:21
🔗
|
underscor |
Yeah, if they're up for that, it's the best solution |
19:21
🔗
|
underscor |
I kinda want to figure out a way we can do it without having these massive tarballs, but I dunno how feasible that is |
19:21
🔗
|
soultcer |
If not, would it be possible to install mysql on your IA VM. How much space does it have? |
19:21
🔗
|
soultcer |
What's wrong with massive tarballs? |
19:21
🔗
|
underscor |
I have roughtly 2TB I can play with at any one point |
19:22
🔗
|
underscor |
Nothing particularly, just not easy to go restore a single image |
19:24
🔗
|
soultcer |
I don't think the IA S3 system will like some 15 million files with a couple of hundred kilobytes each |
19:30
🔗
|
underscor |
The IA doesn't really use S3 structure, the s3 is just an translation layer to the storage system they use. |
19:30
🔗
|
underscor |
The internal logic should handle it fine |
19:31
🔗
|
underscor |
Well, actually, I guess that depends on whether there are more than about 1000 files uploaded a day |
19:31
🔗
|
underscor |
Because that's the practical limit on an itemsize |
19:31
🔗
|
soultcer |
Maybe we should create smaller tarballs in the size of 100 mb or so each |
19:32
🔗
|
underscor |
With an index, too |
19:32
🔗
|
underscor |
Yeah, that'd be a good idea |
19:32
🔗
|
soultcer |
If we index them on something like filename or upload date you won't really need an external index |
19:34
🔗
|
soultcer |
Upload date is good because then we can start with old images that are unlikely to be deleted. But you will always need a copy of the wikipedia db or some other index to see when the file you are looking for was uploaded |
19:49
🔗
|
Nemo_bis |
the problem with daily tarball is that their size can indeed vary a lot |
19:49
🔗
|
Nemo_bis |
but they won't easily become unmanageable |
19:50
🔗
|
soultcer |
You can always split the daily tarball into multiple daily tarballs with a predefined size |
21:46
🔗
|
underscor |
True |
21:46
🔗
|
underscor |
soultcer: We could always keep a copy of the images table in sqldump format |