#wikiteam 2019-12-04,Wed

↑back Search

Time	Nickname	Message
09:11 ^🔗	VoynichCr	2,000 wikidots found so far, still finding more
09:12 ^🔗	VoynichCr	we could use warrior/tracker or a script ad-hoc to export wikicode+images, or both approaches
09:13 ^🔗	VoynichCr	i think wikidot is the largest wikifarm, non-mediawiki, without a dump
09:14 ^🔗	VoynichCr	the last one was wikispaces, and it was archived with warrior (400,000 wikispaces) and ad-hoc script (200,000 wikispaces)
11:44 ^🔗		kiska18 has quit IRC (Ping timeout (120 seconds))
11:47 ^🔗		kiska18 has joined #wikiteam
11:48 ^🔗		Iglooop1 sets mode: +o kiska18
13:33 ^🔗	VoynichCr	3,000 wikidots found
14:06 ^🔗	Nemo_bis	VoynichCr: are you still using a Google scraper?
14:29 ^🔗		LowLevelM has joined #wikiteam
15:14 ^🔗		vicarage_ has joined #wikiteam
15:16 ^🔗	vicarage_	Hi. I've just spent 9 months porting our 30000 page wiki from wikidot to mediawiki. It was not an easy process, so I suspect if
15:16 ^🔗	vicarage_	wikidot goes done, most of its content will go with it
15:16 ^🔗	vicarage_	wikidot goes down, most of its content will go with it
15:55 ^🔗	vicarage_	We had to write special software to extract the page metadata (title, tags, aka categories) to combine with the wikidot provided backup page which just provided page content
16:00 ^🔗	vicarage_	Backup can only be done by adminstrators, not merely members who've contributed to the each wiki
16:21 ^🔗	Nemo_bis	vicarage_: hello, welcome on #wikiteam
16:21 ^🔗	Nemo_bis	any and all code you used for your migration would be very welcome if published under a free license, so it can possibly be used for migration or export tools
16:23 ^🔗	Nemo_bis	LowLevelM: no, I don't think archivebot is able to cycle through all the content on wikidot wikis
16:24 ^🔗	LowLevelM	Maybe just the forum?
16:24 ^🔗	vicarage_	Its the world's nastiest bash and sed script, combined with a colleagues nasty python. But you are welcome to it
16:41 ^🔗	JAA	ArchiveTeam thrives on nasty code.
16:55 ^🔗	Nemo_bis	heh
16:56 ^🔗	Nemo_bis	vicarage_: we can probably recycle regular expressions, for instance
16:57 ^🔗	Nemo_bis	LowLevelM: sure, the forum might be fine
17:00 ^🔗	Nemo_bis	Meanwhile, the admin of http://editthis.info/ mysteriously came back from the dead. Over five years ago I would have bet the death of the wiki farm was imminent. :)
18:24 ^🔗	VoynichCr	Nemo_bis: if you append /random-site.php to any wikidot wiki, you jump to other
18:24 ^🔗	VoynichCr	i used that, but i only got 3155 wikis
18:25 ^🔗	VoynichCr	i scraped the four wiki suggestions that sometimes are available at the bottom of any wiki
18:25 ^🔗	VoynichCr	3155 seems low to me... but i can't get more using this scheme
18:29 ^🔗	VoynichCr	wikidot list posted on wikiteam github
18:37 ^🔗	vicarage_	That list definitely not complete, does not include my big wiki fancyclopedia.wikidot.com, not a trivial testfancy.wikidot.com.
18:38 ^🔗	vicarage_	s/nor a/
18:40 ^🔗	vicarage_	My colleague wrote an api based download that uses a key which is only available to admistrators of paid sites, so not much use for external archivers
18:44 ^🔗	vicarage_	When I tried the /random-page.php 5 times, I got all 5 in your file, so it suggests that it just gives access to a curated sub-selection
19:57 ^🔗	Nemo_bis	Which makes sense
20:05 ^🔗	Nemo_bis	Can the list of users be used for discovery? e.g. https://www.wikidot.com/user:info/andraholguin69
20:05 ^🔗	Nemo_bis	The commoncrawl.org dataset seems to have quite a few, see e.g. https://index.commoncrawl.org/CC-MAIN-2019-47-index?url=*.wikidot.com&output=json (warning, big page)
20:21 ^🔗	astrid	JAA: yeah i was about to say :P
20:26 ^🔗	vicarage_	Via a convoluted route. If you have a wiki, you can send invites, and after you've typed 2 characters, the user list appears, which you could scrape and see what wikis they were members of
20:31 ^🔗	vicarage_	Note it only shows some 100 usernames, so for aa it gets to 'aaa*'. Sending a message, the dropdown list is much shorter, only 10 names
20:45 ^🔗	vicarage_	A google search on a page on every wiki 'site:wikidot.com -www search:site' gives 129000 hits. But I expect you lot know more about subdomain finders than me
20:47 ^🔗	Nemo_bis	Usually we try to to search for some content that appears on every subdomain and doesn't fall foul of Google's deduplication (yes, contradictory requirement), a bit like Special:Version on MediaWiki
20:57 ^🔗	vicarage_	site:wikidot.com "system:list-all-pages" gives 6400 clean results
21:18 ^🔗	VoynichCr	vicarage_: what is your estimate for number of wikis?
21:19 ^🔗	VoynichCr	according to wikidot mainpage, there are 80+ million pages
21:19 ^🔗	VoynichCr	even in my list there are a lot test wikis and spam ones
21:22 ^🔗	VoynichCr	Nemo_bis: the "member of" loads after click, via javascript
21:23 ^🔗	VoynichCr	never scraped dynamic content, tips welcome
21:36 ^🔗	Nemo_bis	VoynichCr: seems to be a simple POST request of the kind curl 'https://www.wikidot.com/ajax-module-connector.php' --data 'user_id=3657632&moduleName=userinfo%2FUserInfoMemberOfModule&callbackIndex=1&wikidot_token7=abcd'
21:37 ^🔗	Nemo_bis	So it might be enough to scrape the user pages from the commoncrawl.org list and then query the corresponding user IDs with this "API", at a reasonable speed
21:37 ^🔗	vicarage_	wikipedia page suggests 150000 sites. I had a look, and no running tally seems available.
21:45 ^🔗	vicarage_	The search engine list would be the most useful ones to save. http://community.wikidot.com/whatishot shows the activity level
21:51 ^🔗	vicarage_	https://web.archive.org/web/20131202221632/http://www.wikidot.com/stats has 128000 sites in 2009

irclogger-viewer