Time |
Nickname |
Message |
16:49
🔗
|
ete_ |
Nemo_bis was a way to export all WikiApiary urls ever set up? |
16:53
🔗
|
Nemo_bis |
ete_: not as far as I know |
16:58
🔗
|
ete_ |
Okay, I'll add a link to a page with that to one of the archiveteam pages. What info do you want in the table? API URL, main page URL, anything else? |
17:04
🔗
|
Nemo_bis |
ete_: api.php and index.php url, nothing else |
17:04
🔗
|
ete_ |
Okay. And excluding sites marked as backed up by you guys is fine? |
17:05
🔗
|
ete_ |
anything else I should exclude (WMF, Wikia, defunct sites?) |
17:07
🔗
|
Nemo_bis |
ete_: WMF, Wikia, wiki-site and editthis |
17:07
🔗
|
Nemo_bis |
Defunct I'd say no, in case someone has a backup |
17:07
🔗
|
ete_ |
Okay |
17:07
🔗
|
Nemo_bis |
Not sure about excluding archived or not, as currently the property is not reliable |
17:08
🔗
|
ete_ |
If I don't exclude archived, there may be too many results |
17:08
🔗
|
ete_ |
I'll test |
17:09
🔗
|
Nemo_bis |
That's true as well :) and hopefully we'll get better at matching |
17:09
🔗
|
ete_ |
yep :) |
17:09
🔗
|
ete_ |
there's an alternate API url field available on wikiapiary now too |
17:09
🔗
|
ete_ |
do you want that displayed? |
17:17
🔗
|
Nemo_bis |
ete_: no, main is enough |
17:17
🔗
|
ete_ |
okay |
17:18
🔗
|
ete_ |
Is it okay if I exclude WMF by netblock organization, or does each WMF subfarm need excluding in case something's missed? |
17:24
🔗
|
ete_ |
huh, wikiapiary is missing editthis and wikisite. I'll fix that too. |
17:26
🔗
|
Nemo_bis |
ete_: probably netblock is enough |
17:27
🔗
|
ete_ |
okay |
18:16
🔗
|
ete_ |
Nemo_bis https://wikiapiary.com/wiki/User:Nemo_bis/WikiApiary_URL_export |
18:16
🔗
|
ete_ |
it's super expensive so I put it in your userspace to reduce traffic |
18:17
🔗
|
Nemo_bis |
indeed it's loaaaaaaaaaaaaading |
18:17
🔗
|
Nemo_bis |
oh I got the <title> now |
18:25
🔗
|
ete_ |
hm, there seems to be 2500 duplicate lines |
18:27
🔗
|
ete_ |
good news: it's because the offset is too high in later queries and there's actually only ~4500 results |
18:29
🔗
|
Nemo_bis |
Oh :) |
18:29
🔗
|
Nemo_bis |
Yes, I reported that bug on #semantic-mediawiki the other day but I'm not sure it was filed |
18:33
🔗
|
ete_ |
I wonder how many WA wikis will be new to the WikiTeam's collection |
18:37
🔗
|
Nemo_bis |
ete_: considering that only 35 % of the unarchived wikis on wikiapiary are actually unarchived, and that we have a list of 2500 unarchived wikis, probably not that many I must say :) |
18:37
🔗
|
Nemo_bis |
WikiApiary got most of its URLs from us (and mutante) in a way or another :P |
18:37
🔗
|
ete_ |
right |
18:38
🔗
|
Nemo_bis |
The other way I was just starting 100 parallel dumpgenerator.py over your lists. |
18:38
🔗
|
Nemo_bis |
day |
18:39
🔗
|
Nemo_bis |
Then I thought, hey, let's make a sanity check... and found out 65 % were already dumped, so I stopped. |
18:39
🔗
|
ete_ |
heh |
18:39
🔗
|
Nemo_bis |
Because the other thousands are those which error out for various issues with our scripts |
18:40
🔗
|
ete_ |
right |
18:41
🔗
|
ete_ |
how much is dumped into WARC and how much to MW XML? |
18:54
🔗
|
Nemo_bis |
ete_: no warcs, only XML and images |
18:54
🔗
|
Nemo_bis |
well, files |
18:56
🔗
|
Nemo_bis |
uh he's out :( |
18:56
🔗
|
Nemo_bis |
but we'd like to collect some more data https://code.google.com/p/wikiteam/issues/detail?id=82 |