#wikiteam 2014-06-16,Mon

↑back Search

Time Nickname Message
16:49 🔗 ete_ Nemo_bis was a way to export all WikiApiary urls ever set up?
16:53 🔗 Nemo_bis ete_: not as far as I know
16:58 🔗 ete_ Okay, I'll add a link to a page with that to one of the archiveteam pages. What info do you want in the table? API URL, main page URL, anything else?
17:04 🔗 Nemo_bis ete_: api.php and index.php url, nothing else
17:04 🔗 ete_ Okay. And excluding sites marked as backed up by you guys is fine?
17:05 🔗 ete_ anything else I should exclude (WMF, Wikia, defunct sites?)
17:07 🔗 Nemo_bis ete_: WMF, Wikia, wiki-site and editthis
17:07 🔗 Nemo_bis Defunct I'd say no, in case someone has a backup
17:07 🔗 ete_ Okay
17:07 🔗 Nemo_bis Not sure about excluding archived or not, as currently the property is not reliable
17:08 🔗 ete_ If I don't exclude archived, there may be too many results
17:08 🔗 ete_ I'll test
17:09 🔗 Nemo_bis That's true as well :) and hopefully we'll get better at matching
17:09 🔗 ete_ yep :)
17:09 🔗 ete_ there's an alternate API url field available on wikiapiary now too
17:09 🔗 ete_ do you want that displayed?
17:17 🔗 Nemo_bis ete_: no, main is enough
17:17 🔗 ete_ okay
17:18 🔗 ete_ Is it okay if I exclude WMF by netblock organization, or does each WMF subfarm need excluding in case something's missed?
17:24 🔗 ete_ huh, wikiapiary is missing editthis and wikisite. I'll fix that too.
17:26 🔗 Nemo_bis ete_: probably netblock is enough
17:27 🔗 ete_ okay
18:16 🔗 ete_ Nemo_bis https://wikiapiary.com/wiki/User:Nemo_bis/WikiApiary_URL_export
18:16 🔗 ete_ it's super expensive so I put it in your userspace to reduce traffic
18:17 🔗 Nemo_bis indeed it's loaaaaaaaaaaaaading
18:17 🔗 Nemo_bis oh I got the <title> now
18:25 🔗 ete_ hm, there seems to be 2500 duplicate lines
18:27 🔗 ete_ good news: it's because the offset is too high in later queries and there's actually only ~4500 results
18:29 🔗 Nemo_bis Oh :)
18:29 🔗 Nemo_bis Yes, I reported that bug on #semantic-mediawiki the other day but I'm not sure it was filed
18:33 🔗 ete_ I wonder how many WA wikis will be new to the WikiTeam's collection
18:37 🔗 Nemo_bis ete_: considering that only 35 % of the unarchived wikis on wikiapiary are actually unarchived, and that we have a list of 2500 unarchived wikis, probably not that many I must say :)
18:37 🔗 Nemo_bis WikiApiary got most of its URLs from us (and mutante) in a way or another :P
18:37 🔗 ete_ right
18:38 🔗 Nemo_bis The other way I was just starting 100 parallel dumpgenerator.py over your lists.
18:38 🔗 Nemo_bis day
18:39 🔗 Nemo_bis Then I thought, hey, let's make a sanity check... and found out 65 % were already dumped, so I stopped.
18:39 🔗 ete_ heh
18:39 🔗 Nemo_bis Because the other thousands are those which error out for various issues with our scripts
18:40 🔗 ete_ right
18:41 🔗 ete_ how much is dumped into WARC and how much to MW XML?
18:54 🔗 Nemo_bis ete_: no warcs, only XML and images
18:54 🔗 Nemo_bis well, files
18:56 🔗 Nemo_bis uh he's out :(
18:56 🔗 Nemo_bis but we'd like to collect some more data https://code.google.com/p/wikiteam/issues/detail?id=82

irclogger-viewer