Time |
Nickname |
Message |
09:11
🔗
|
VoynichCr |
2,000 wikidots found so far, still finding more |
09:12
🔗
|
VoynichCr |
we could use warrior/tracker or a script ad-hoc to export wikicode+images, or both approaches |
09:13
🔗
|
VoynichCr |
i think wikidot is the largest wikifarm, non-mediawiki, without a dump |
09:14
🔗
|
VoynichCr |
the last one was wikispaces, and it was archived with warrior (400,000 wikispaces) and ad-hoc script (200,000 wikispaces) |
11:44
🔗
|
|
kiska18 has quit IRC (Ping timeout (120 seconds)) |
11:47
🔗
|
|
kiska18 has joined #wikiteam |
11:48
🔗
|
|
Iglooop1 sets mode: +o kiska18 |
13:33
🔗
|
VoynichCr |
3,000 wikidots found |
14:06
🔗
|
Nemo_bis |
VoynichCr: are you still using a Google scraper? |
14:29
🔗
|
|
LowLevelM has joined #wikiteam |
15:14
🔗
|
|
vicarage_ has joined #wikiteam |
15:16
🔗
|
vicarage_ |
Hi. I've just spent 9 months porting our 30000 page wiki from wikidot to mediawiki. It was not an easy process, so I suspect if |
15:16
🔗
|
vicarage_ |
wikidot goes done, most of its content will go with it |
15:16
🔗
|
vicarage_ |
wikidot goes down, most of its content will go with it |
15:55
🔗
|
vicarage_ |
We had to write special software to extract the page metadata (title, tags, aka categories) to combine with the wikidot provided backup page which just provided page content |
16:00
🔗
|
vicarage_ |
Backup can only be done by adminstrators, not merely members who've contributed to the each wiki |
16:21
🔗
|
Nemo_bis |
vicarage_: hello, welcome on #wikiteam |
16:21
🔗
|
Nemo_bis |
any and all code you used for your migration would be very welcome if published under a free license, so it can possibly be used for migration or export tools |
16:23
🔗
|
Nemo_bis |
LowLevelM: no, I don't think archivebot is able to cycle through all the content on wikidot wikis |
16:24
🔗
|
LowLevelM |
Maybe just the forum? |
16:24
🔗
|
vicarage_ |
Its the world's nastiest bash and sed script, combined with a colleagues nasty python. But you are welcome to it |
16:41
🔗
|
JAA |
ArchiveTeam *thrives* on nasty code. |
16:55
🔗
|
Nemo_bis |
heh |
16:56
🔗
|
Nemo_bis |
vicarage_: we can probably recycle regular expressions, for instance |
16:57
🔗
|
Nemo_bis |
LowLevelM: sure, the forum might be fine |
17:00
🔗
|
Nemo_bis |
Meanwhile, the admin of http://editthis.info/ mysteriously came back from the dead. Over five years ago I would have bet the death of the wiki farm was imminent. :) |
18:24
🔗
|
VoynichCr |
Nemo_bis: if you append /random-site.php to any wikidot wiki, you jump to other |
18:24
🔗
|
VoynichCr |
i used that, but i only got 3155 wikis |
18:25
🔗
|
VoynichCr |
i scraped the four wiki suggestions that sometimes are available at the bottom of any wiki |
18:25
🔗
|
VoynichCr |
3155 seems low to me... but i can't get more using this scheme |
18:29
🔗
|
VoynichCr |
wikidot list posted on wikiteam github |
18:37
🔗
|
vicarage_ |
That list definitely not complete, does not include my big wiki fancyclopedia.wikidot.com, not a trivial testfancy.wikidot.com. |
18:38
🔗
|
vicarage_ |
s/nor a/ |
18:40
🔗
|
vicarage_ |
My colleague wrote an api based download that uses a key which is only available to admistrators of paid sites, so not much use for external archivers |
18:44
🔗
|
vicarage_ |
When I tried the /random-page.php 5 times, I got all 5 in your file, so it suggests that it just gives access to a curated sub-selection |
19:57
🔗
|
Nemo_bis |
Which makes sense |
20:05
🔗
|
Nemo_bis |
Can the list of users be used for discovery? e.g. https://www.wikidot.com/user:info/andraholguin69 |
20:05
🔗
|
Nemo_bis |
The commoncrawl.org dataset seems to have quite a few, see e.g. https://index.commoncrawl.org/CC-MAIN-2019-47-index?url=*.wikidot.com&output=json (warning, big page) |
20:21
🔗
|
astrid |
JAA: yeah i was about to say :P |
20:26
🔗
|
vicarage_ |
Via a convoluted route. If you have a wiki, you can send invites, and after you've typed 2 characters, the user list appears, which you could scrape and see what wikis they were members of |
20:31
🔗
|
vicarage_ |
Note it only shows some 100 usernames, so for aa it gets to 'aaa*'. Sending a message, the dropdown list is much shorter, only 10 names |
20:45
🔗
|
vicarage_ |
A google search on a page on every wiki 'site:wikidot.com -www search:site' gives 129000 hits. But I expect you lot know more about subdomain finders than me |
20:47
🔗
|
Nemo_bis |
Usually we try to to search for some content that appears on every subdomain and doesn't fall foul of Google's deduplication (yes, contradictory requirement), a bit like Special:Version on MediaWiki |
20:57
🔗
|
vicarage_ |
site:wikidot.com "system:list-all-pages" gives 6400 clean results |
21:18
🔗
|
VoynichCr |
vicarage_: what is your estimate for number of wikis? |
21:19
🔗
|
VoynichCr |
according to wikidot mainpage, there are 80+ million pages |
21:19
🔗
|
VoynichCr |
even in my list there are a lot test wikis and spam ones |
21:22
🔗
|
VoynichCr |
Nemo_bis: the "member of" loads after click, via javascript |
21:23
🔗
|
VoynichCr |
never scraped dynamic content, tips welcome |
21:36
🔗
|
Nemo_bis |
VoynichCr: seems to be a simple POST request of the kind curl 'https://www.wikidot.com/ajax-module-connector.php' --data 'user_id=3657632&moduleName=userinfo%2FUserInfoMemberOfModule&callbackIndex=1&wikidot_token7=abcd' |
21:37
🔗
|
Nemo_bis |
So it might be enough to scrape the user pages from the commoncrawl.org list and then query the corresponding user IDs with this "API", at a reasonable speed |
21:37
🔗
|
vicarage_ |
wikipedia page suggests 150000 sites. I had a look, and no running tally seems available. |
21:45
🔗
|
vicarage_ |
The search engine list would be the most useful ones to save. http://community.wikidot.com/whatishot shows the activity level |
21:51
🔗
|
vicarage_ |
https://web.archive.org/web/20131202221632/http://www.wikidot.com/stats has 128000 sites in 2009 |