#wikiteam 2019-12-04,Wed

↑back Search

Time Nickname Message
09:11 🔗 VoynichCr 2,000 wikidots found so far, still finding more
09:12 🔗 VoynichCr we could use warrior/tracker or a script ad-hoc to export wikicode+images, or both approaches
09:13 🔗 VoynichCr i think wikidot is the largest wikifarm, non-mediawiki, without a dump
09:14 🔗 VoynichCr the last one was wikispaces, and it was archived with warrior (400,000 wikispaces) and ad-hoc script (200,000 wikispaces)
11:44 🔗 kiska18 has quit IRC (Ping timeout (120 seconds))
11:47 🔗 kiska18 has joined #wikiteam
11:48 🔗 Iglooop1 sets mode: +o kiska18
13:33 🔗 VoynichCr 3,000 wikidots found
14:06 🔗 Nemo_bis VoynichCr: are you still using a Google scraper?
14:29 🔗 LowLevelM has joined #wikiteam
15:14 🔗 vicarage_ has joined #wikiteam
15:16 🔗 vicarage_ Hi. I've just spent 9 months porting our 30000 page wiki from wikidot to mediawiki. It was not an easy process, so I suspect if
15:16 🔗 vicarage_ wikidot goes done, most of its content will go with it
15:16 🔗 vicarage_ wikidot goes down, most of its content will go with it
15:55 🔗 vicarage_ We had to write special software to extract the page metadata (title, tags, aka categories) to combine with the wikidot provided backup page which just provided page content
16:00 🔗 vicarage_ Backup can only be done by adminstrators, not merely members who've contributed to the each wiki
16:21 🔗 Nemo_bis vicarage_: hello, welcome on #wikiteam
16:21 🔗 Nemo_bis any and all code you used for your migration would be very welcome if published under a free license, so it can possibly be used for migration or export tools
16:23 🔗 Nemo_bis LowLevelM: no, I don't think archivebot is able to cycle through all the content on wikidot wikis
16:24 🔗 LowLevelM Maybe just the forum?
16:24 🔗 vicarage_ Its the world's nastiest bash and sed script, combined with a colleagues nasty python. But you are welcome to it
16:41 🔗 JAA ArchiveTeam *thrives* on nasty code.
16:55 🔗 Nemo_bis heh
16:56 🔗 Nemo_bis vicarage_: we can probably recycle regular expressions, for instance
16:57 🔗 Nemo_bis LowLevelM: sure, the forum might be fine
17:00 🔗 Nemo_bis Meanwhile, the admin of http://editthis.info/ mysteriously came back from the dead. Over five years ago I would have bet the death of the wiki farm was imminent. :)
18:24 🔗 VoynichCr Nemo_bis: if you append /random-site.php to any wikidot wiki, you jump to other
18:24 🔗 VoynichCr i used that, but i only got 3155 wikis
18:25 🔗 VoynichCr i scraped the four wiki suggestions that sometimes are available at the bottom of any wiki
18:25 🔗 VoynichCr 3155 seems low to me... but i can't get more using this scheme
18:29 🔗 VoynichCr wikidot list posted on wikiteam github
18:37 🔗 vicarage_ That list definitely not complete, does not include my big wiki fancyclopedia.wikidot.com, not a trivial testfancy.wikidot.com.
18:38 🔗 vicarage_ s/nor a/
18:40 🔗 vicarage_ My colleague wrote an api based download that uses a key which is only available to admistrators of paid sites, so not much use for external archivers
18:44 🔗 vicarage_ When I tried the /random-page.php 5 times, I got all 5 in your file, so it suggests that it just gives access to a curated sub-selection
19:57 🔗 Nemo_bis Which makes sense
20:05 🔗 Nemo_bis Can the list of users be used for discovery? e.g. https://www.wikidot.com/user:info/andraholguin69
20:05 🔗 Nemo_bis The commoncrawl.org dataset seems to have quite a few, see e.g. https://index.commoncrawl.org/CC-MAIN-2019-47-index?url=*.wikidot.com&output=json (warning, big page)
20:21 🔗 astrid JAA: yeah i was about to say :P
20:26 🔗 vicarage_ Via a convoluted route. If you have a wiki, you can send invites, and after you've typed 2 characters, the user list appears, which you could scrape and see what wikis they were members of
20:31 🔗 vicarage_ Note it only shows some 100 usernames, so for aa it gets to 'aaa*'. Sending a message, the dropdown list is much shorter, only 10 names
20:45 🔗 vicarage_ A google search on a page on every wiki 'site:wikidot.com -www search:site' gives 129000 hits. But I expect you lot know more about subdomain finders than me
20:47 🔗 Nemo_bis Usually we try to to search for some content that appears on every subdomain and doesn't fall foul of Google's deduplication (yes, contradictory requirement), a bit like Special:Version on MediaWiki
20:57 🔗 vicarage_ site:wikidot.com "system:list-all-pages" gives 6400 clean results
21:18 🔗 VoynichCr vicarage_: what is your estimate for number of wikis?
21:19 🔗 VoynichCr according to wikidot mainpage, there are 80+ million pages
21:19 🔗 VoynichCr even in my list there are a lot test wikis and spam ones
21:22 🔗 VoynichCr Nemo_bis: the "member of" loads after click, via javascript
21:23 🔗 VoynichCr never scraped dynamic content, tips welcome
21:36 🔗 Nemo_bis VoynichCr: seems to be a simple POST request of the kind curl 'https://www.wikidot.com/ajax-module-connector.php' --data 'user_id=3657632&moduleName=userinfo%2FUserInfoMemberOfModule&callbackIndex=1&wikidot_token7=abcd'
21:37 🔗 Nemo_bis So it might be enough to scrape the user pages from the commoncrawl.org list and then query the corresponding user IDs with this "API", at a reasonable speed
21:37 🔗 vicarage_ wikipedia page suggests 150000 sites. I had a look, and no running tally seems available.
21:45 🔗 vicarage_ The search engine list would be the most useful ones to save. http://community.wikidot.com/whatishot shows the activity level
21:51 🔗 vicarage_ https://web.archive.org/web/20131202221632/http://www.wikidot.com/stats has 128000 sites in 2009

irclogger-viewer