[16:49] Nemo_bis was a way to export all WikiApiary urls ever set up? [16:53] ete_: not as far as I know [16:58] Okay, I'll add a link to a page with that to one of the archiveteam pages. What info do you want in the table? API URL, main page URL, anything else? [17:04] ete_: api.php and index.php url, nothing else [17:04] Okay. And excluding sites marked as backed up by you guys is fine? [17:05] anything else I should exclude (WMF, Wikia, defunct sites?) [17:07] ete_: WMF, Wikia, wiki-site and editthis [17:07] Defunct I'd say no, in case someone has a backup [17:07] Okay [17:07] Not sure about excluding archived or not, as currently the property is not reliable [17:08] If I don't exclude archived, there may be too many results [17:08] I'll test [17:09] That's true as well :) and hopefully we'll get better at matching [17:09] yep :) [17:09] there's an alternate API url field available on wikiapiary now too [17:09] do you want that displayed? [17:17] ete_: no, main is enough [17:17] okay [17:18] Is it okay if I exclude WMF by netblock organization, or does each WMF subfarm need excluding in case something's missed? [17:24] huh, wikiapiary is missing editthis and wikisite. I'll fix that too. [17:26] ete_: probably netblock is enough [17:27] okay [18:16] Nemo_bis https://wikiapiary.com/wiki/User:Nemo_bis/WikiApiary_URL_export [18:16] it's super expensive so I put it in your userspace to reduce traffic [18:17] indeed it's loaaaaaaaaaaaaading [18:17] oh I got the now [18:25] <ete_> hm, there seems to be 2500 duplicate lines [18:27] <ete_> good news: it's because the offset is too high in later queries and there's actually only ~4500 results [18:29] <Nemo_bis> Oh :) [18:29] <Nemo_bis> Yes, I reported that bug on #semantic-mediawiki the other day but I'm not sure it was filed [18:33] <ete_> I wonder how many WA wikis will be new to the WikiTeam's collection [18:37] <Nemo_bis> ete_: considering that only 35 % of the unarchived wikis on wikiapiary are actually unarchived, and that we have a list of 2500 unarchived wikis, probably not that many I must say :) [18:37] <Nemo_bis> WikiApiary got most of its URLs from us (and mutante) in a way or another :P [18:37] <ete_> right [18:38] <Nemo_bis> The other way I was just starting 100 parallel dumpgenerator.py over your lists. [18:38] <Nemo_bis> day [18:39] <Nemo_bis> Then I thought, hey, let's make a sanity check... and found out 65 % were already dumped, so I stopped. [18:39] <ete_> heh [18:39] <Nemo_bis> Because the other thousands are those which error out for various issues with our scripts [18:40] <ete_> right [18:41] <ete_> how much is dumped into WARC and how much to MW XML? [18:54] <Nemo_bis> ete_: no warcs, only XML and images [18:54] <Nemo_bis> well, files [18:56] <Nemo_bis> uh he's out :( [18:56] <Nemo_bis> but we'd like to collect some more data https://code.google.com/p/wikiteam/issues/detail?id=82