| Time |
Nickname |
Message |
|
01:46
🔗
|
underscor |
soultcer: We would leave it up |
|
01:46
🔗
|
underscor |
Unless a DMCA is explicitly filed |
|
01:46
🔗
|
underscor |
It's a don't-ask-don't tell thing |
|
01:46
🔗
|
underscor |
;) |
|
09:09
🔗
|
Nemo_bis |
oh, so Obama will remove it too? |
|
09:53
🔗
|
ersi |
It's the freckin Internet Archive |
|
09:53
🔗
|
ersi |
not a spam-ridden-advertisement-crap-site |
|
10:28
🔗
|
chronomex |
as far as the rapidshare-clone sites go, megaupload was one of the better ones |
|
13:12
🔗
|
soultcer |
underscor: If you can use your vm at the internet archive then archiving wikimedia commons one at a time should be easy |
|
13:39
🔗
|
Nemo_bis |
does he have 18 TB? |
|
13:39
🔗
|
Nemo_bis |
anyway, WMF folks are going to set up an rsync soon |
|
13:40
🔗
|
soultcer |
No, but if we just download all images from one day, put them into a tar archive and then put it into the internet archive storage it will work |
|
13:49
🔗
|
Nemo_bis |
I wouldn't call it "easy", though :) |
|
13:49
🔗
|
Nemo_bis |
and perhaps the WMF will do the tarballs |
|
13:50
🔗
|
soultcer |
In 10 years maybe |
|
13:50
🔗
|
soultcer |
They don't even have an offsite backup |
|
13:52
🔗
|
Nemo_bis |
actually, maybe a few weeks, they were discussing it yesterday |
|
13:52
🔗
|
soultcer |
Link? |
|
14:01
🔗
|
Nemo_bis |
it was on IRC |
|
14:01
🔗
|
Nemo_bis |
anyway, fresh of print: https://wikitech.wikimedia.org/view/Dumps/Development_2012 |
|
14:02
🔗
|
soultcer |
Cool, thanks |
|
14:04
🔗
|
soultcer |
It says waiting for contact on archive.org there, I hope they find someone |
|
19:07
🔗
|
underscor |
If anyone sees emijrp, let him know I can talk with wikimedia people about that |
|
19:08
🔗
|
underscor |
since he's an admin there or something |
|
19:08
🔗
|
underscor |
soultcer: Nemo_bis I don't have 18TB at once, but we can't put things that big anyway |
|
19:08
🔗
|
underscor |
Things in the archive system shouldn't be more than about 10GB |
|
19:09
🔗
|
underscor |
I was thinking having an item for each Category: namespace |
|
19:09
🔗
|
underscor |
Each with its own date |
|
19:09
🔗
|
soultcer |
It would be easier without categories. Categories can change |
|
19:09
🔗
|
underscor |
like wmf_20110121_categoryname |
|
19:09
🔗
|
underscor |
How would you logically break it up then? |
|
19:10
🔗
|
soultcer |
Just by date |
|
19:10
🔗
|
soultcer |
Then you could basically take the database dump of the image table and go bananas on it |
|
19:11
🔗
|
soultcer |
They don't want people to get old image versions according to their robots.txt, but we could also add the oldimage table and get old image revisions too no problem |
|
19:11
🔗
|
underscor |
You mean like everything uploaded on 2011-01-21, then everything uploaded on 2011-01-20... |
|
19:11
🔗
|
underscor |
etcetera |
|
19:11
🔗
|
soultcer |
Yes |
|
19:11
🔗
|
soultcer |
Best would of course be to just give the wikipedia guys an introduction into the s3 api and an item where they can upload |
|
19:11
🔗
|
soultcer |
s/item/bucket/ |
|
19:12
🔗
|
underscor |
They can't just do that though, with how the s3 api works |
|
19:12
🔗
|
underscor |
It's 10GB per item |
|
19:12
🔗
|
underscor |
and item = bucket |
|
19:13
🔗
|
soultcer |
Oh, collection then |
|
19:13
🔗
|
soultcer |
I sometimes confuse Amazon S3 stuff with IA S3 stuff |
|
19:20
🔗
|
underscor |
Oh, yeah |
|
19:20
🔗
|
underscor |
Sorry, haha |
|
19:21
🔗
|
underscor |
Yeah, if they're up for that, it's the best solution |
|
19:21
🔗
|
underscor |
I kinda want to figure out a way we can do it without having these massive tarballs, but I dunno how feasible that is |
|
19:21
🔗
|
soultcer |
If not, would it be possible to install mysql on your IA VM. How much space does it have? |
|
19:21
🔗
|
soultcer |
What's wrong with massive tarballs? |
|
19:21
🔗
|
underscor |
I have roughtly 2TB I can play with at any one point |
|
19:22
🔗
|
underscor |
Nothing particularly, just not easy to go restore a single image |
|
19:24
🔗
|
soultcer |
I don't think the IA S3 system will like some 15 million files with a couple of hundred kilobytes each |
|
19:30
🔗
|
underscor |
The IA doesn't really use S3 structure, the s3 is just an translation layer to the storage system they use. |
|
19:30
🔗
|
underscor |
The internal logic should handle it fine |
|
19:31
🔗
|
underscor |
Well, actually, I guess that depends on whether there are more than about 1000 files uploaded a day |
|
19:31
🔗
|
underscor |
Because that's the practical limit on an itemsize |
|
19:31
🔗
|
soultcer |
Maybe we should create smaller tarballs in the size of 100 mb or so each |
|
19:32
🔗
|
underscor |
With an index, too |
|
19:32
🔗
|
underscor |
Yeah, that'd be a good idea |
|
19:32
🔗
|
soultcer |
If we index them on something like filename or upload date you won't really need an external index |
|
19:34
🔗
|
soultcer |
Upload date is good because then we can start with old images that are unlikely to be deleted. But you will always need a copy of the wikipedia db or some other index to see when the file you are looking for was uploaded |
|
19:49
🔗
|
Nemo_bis |
the problem with daily tarball is that their size can indeed vary a lot |
|
19:49
🔗
|
Nemo_bis |
but they won't easily become unmanageable |
|
19:50
🔗
|
soultcer |
You can always split the daily tarball into multiple daily tarballs with a predefined size |
|
21:46
🔗
|
underscor |
True |
|
21:46
🔗
|
underscor |
soultcer: We could always keep a copy of the images table in sqldump format |