Time |
Nickname |
Message |
01:15
🔗
|
xmc |
\o/ |
11:22
🔗
|
ersi |
snails |
11:22
🔗
|
ersi |
snails everywhereeee |
12:07
🔗
|
* |
Nemo_bis hands a hammer |
12:07
🔗
|
Nemo_bis |
useful also if they lose an s |
18:15
🔗
|
pft |
if i pull wikia backups should i put them in https://archive.org/details/wikiteamhttps://archive.org/details/wikiteam ? |
18:33
🔗
|
Nemo_bis |
pft: you can't |
18:34
🔗
|
Nemo_bis |
but please add WikiTeam keyword so that an admin can later move them |
18:34
🔗
|
pft |
ok |
18:34
🔗
|
Nemo_bis |
pft: What sort of Wikia backups are you pulling? |
18:34
🔗
|
pft |
just their own dumps |
18:34
🔗
|
Nemo_bis |
yes but which |
18:34
🔗
|
pft |
memory alpha and wookiepedia at the moment |
18:34
🔗
|
pft |
i might pull more later |
18:34
🔗
|
Nemo_bis |
ah ok, little scale |
18:34
🔗
|
pft |
yes |
18:35
🔗
|
pft |
wookiepedia is 400m so it's pretty small |
18:35
🔗
|
balrog |
are their dumps complete? |
18:35
🔗
|
pft |
i haven't tested yet |
18:35
🔗
|
Nemo_bis |
I'm in contact with them for them to upload to archive.org all their dumps, but I've been told it needs to be discussed in some senior staff meeting |
18:35
🔗
|
pft |
i will ungzip at home and probably load into mediawikis |
18:35
🔗
|
Nemo_bis |
balrog: define complete? |
18:36
🔗
|
balrog |
have all the data that's visible |
18:36
🔗
|
balrog |
Nemo_bis: why is a senior staff meeting required, if I may ask? |
18:36
🔗
|
Nemo_bis |
how would I know :) |
18:36
🔗
|
Nemo_bis |
and no, not all data visible, that's impossible with XML dumos |
18:37
🔗
|
Nemo_bis |
but all data needed to make all the data which is visible, minus logs and private user data :) except I don't see images dumps any longer and they don't dump all wikis |
18:37
🔗
|
pft |
yeah i didn't see any image dumps anywhere which is frustrating |
18:37
🔗
|
balrog |
they don't rpovide image dumps? :/ |
18:37
🔗
|
balrog |
provide* |
18:38
🔗
|
balrog |
and not all wikis? can wiki administrators turn it off individually? |
18:38
🔗
|
pft |
well dumps.wikia.net appears to be gone and the woanloads available on Special:Statistics seem to be user-generated by staff and have a limited duration |
18:38
🔗
|
pft |
http://en.memory-alpha.org/wiki/Special:Statistics |
18:38
🔗
|
pft |
that page has a "current pages and history" but i don't see anything about images |
18:38
🔗
|
Nemo_bis |
it never did but they made them nevertheless |
18:39
🔗
|
Nemo_bis |
perhaps we just need to find out the filenames |
18:39
🔗
|
balrog |
s3://wikia_xml_dumps/w/wo/wowwiki_pages_current. xml.gz etc |
18:39
🔗
|
Nemo_bis |
for images |
18:40
🔗
|
Nemo_bis |
this is how it used to be https://ia801507.us.archive.org/zipview.php?zip=/28/items/wikia_dump_20121204/c.zip |
18:41
🔗
|
balrog |
hmm |
18:41
🔗
|
balrog |
wikia's source code is open |
18:41
🔗
|
balrog |
including the part that uploads the dumps to S3 |
18:41
🔗
|
pft |
interesting |
18:42
🔗
|
balrog |
https://github.com/Wikia/app/blob/dev/extensions/wikia/WikiFactory/Close/maintenance.php |
18:42
🔗
|
balrog |
look for DumpsOnDemand::putToAmazonS3 |
18:42
🔗
|
Nemo_bis |
well, not actually all of it |
18:42
🔗
|
Nemo_bis |
though they are working on open sourcing it all |
18:43
🔗
|
balrog |
Nemo_bis: interesting |
18:43
🔗
|
balrog |
https://github.com/Wikia/app/blob/dev/extensions/wikia/WikiFactory/Dumps/DumpsOnDemand.php |
18:43
🔗
|
balrog |
"url" => 'http://s3.amazonaws.com/wikia_xml_dumps/' . self::getPath( "{$wgDBname}_pages_current.xml.gz" ), |
18:43
🔗
|
balrog |
"url" => 'http://s3.amazonaws.com/wikia_xml_dumps/' . self::getPath( "{$wgDBname}_pages_full.xml.gz" ), |
18:43
🔗
|
balrog |
don't see anything for images |
18:45
🔗
|
pft |
yeah, they appear as tars in the link Nemo_bis pasted |
18:45
🔗
|
pft |
i'm guessing that was more of a manual thing they did |
18:45
🔗
|
balrog |
"Wikia does not perform dumps of images (but see m:Wikix)." |
18:45
🔗
|
balrog |
http://meta.wikimedia.org/wiki/Wikix |
18:46
🔗
|
balrog |
...interesting |
18:46
🔗
|
balrog |
that will extract and grab all images in an xml dump |
18:46
🔗
|
pft |
nice1 |
18:46
🔗
|
pft |
er nice! |
18:46
🔗
|
Nemo_bis |
pft: it was not manual |
18:46
🔗
|
pft |
ahh o |
18:46
🔗
|
pft |
er ok |
18:47
🔗
|
Nemo_bis |
wikix is horribly painful |
18:47
🔗
|
Nemo_bis |
and it's not designed to handle 300k wikis |
18:47
🔗
|
pft |
ahhh |
18:47
🔗
|
pft |
sorry, i realize this is all stuff you have been down before |
18:48
🔗
|
balrog |
Nemo_bis: really? :/ |
18:48
🔗
|
pft |
just trying to figure out how to help |
18:48
🔗
|
balrog |
Nemo_bis: is there a reference to what's been done? |
18:55
🔗
|
Nemo_bis |
balrog: about? |
18:55
🔗
|
balrog |
with regards to what tools have been tested and such |
18:56
🔗
|
Nemo_bis |
for what |
18:57
🔗
|
balrog |
dumping large wikis |
18:57
🔗
|
Nemo_bis |
most Wikia wikis are very tiny |
18:57
🔗
|
Nemo_bis |
there isn't much to test, we only need to see if Wikia is helpful or not |
18:58
🔗
|
Nemo_bis |
if it's not helpful, we'll have to run dumpgenerator on all their 350k wikis to get all the text and images |
18:58
🔗
|
balrog |
ouch |
18:58
🔗
|
Nemo_bis |
but that's not particularly painful, just a bit boring |
18:58
🔗
|
balrog |
how difficult would it be to submit a PR to their repo that would cause images to also be archived? |
18:58
🔗
|
Nemo_bis |
unless they go really rogue and disable API or so, which I don't think they'd do though |
18:59
🔗
|
Nemo_bis |
they allegedly have problems with space |
18:59
🔗
|
balrog |
how many wikis have we run into which have disabled API access? |
18:59
🔗
|
Nemo_bis |
this is probably what the seniors have to discuss, whether to spend 10 $ instead of 5 for the space on S3 :) |
18:59
🔗
|
Nemo_bis |
thousands |
18:59
🔗
|
balrog |
how do we dump those? :/ |
18:59
🔗
|
Nemo_bis |
with pre-API method |
18:59
🔗
|
Nemo_bis |
Special:Export |
19:00
🔗
|
Nemo_bis |
some disable even that, but it's been only a couple wikis so far |
19:00
🔗
|
pft |
i tried to grab memory-alpha but coudln't find the api page for it before i did more readinga nd found that I could download the dump |
19:00
🔗
|
Nemo_bis |
usually the problem with wiki sysadmins is stupidity, not malice |
19:04
🔗
|
xmc |
same with forums, too |
19:06
🔗
|
Nemo_bis |
:) |
19:08
🔗
|
balrog |
what's the best way to dump forums though? they're not as rough on wget at least |
19:11
🔗
|
pft |
we need to start contributing to open-source projects to put in easy backup things that are publicly enabled by default ;) |
19:12
🔗
|
Nemo_bis |
pft: you're welcome :) https://bugzilla.wikimedia.org/buglist.cgi?resolution=---&query_format=advanced&component=Export%2FImport |
19:13
🔗
|
pft |
nice |