[15:05] SketchCow: how hard would it be in your opinion for me to adapt your keywords-generator to wiki XML dumps? [15:06] It's not hard. [15:06] Give me a place to grab an .xml dump to test. [15:13] SketchCow: http://archive.org/download/wikia_dump_20140125/o.zip/originals_pages_current.xml.gz for instance [15:18] Ok, so first, I'm running this straight up at the XML. [15:18] I suspect this will be a d i s a s t e r [15:18] But we need a control group [15:32] :) The meat is between tags [15:33] Yeah. [15:34] http://fos.textfiles.com/explosion.txt [15:34] Well, that came out better than expected. [15:34] That's no changes whatsoever. [15:35] When I generate the things for internet archive, I take the top 10 single-words and the top 10 phrases. [15:39] Indeed, clean enough. [15:39] Half of the top10 is usernames I think? [15:45] Can't speak for that [19:19] so, couple of weeks/month later, http://imslp.org/ is now at 1.1GB downloaded :P [19:52] midas: if this can make you feel better, I'm at 60 MB downloaded from wiki-site.com after more than 8 months [19:55] * 28 MB of deduplicated 7z in 7 months [20:09] lol [20:10] I've almost finished the wikis named b* [20:12] mine stops every so many hours and i'll have to wait for 24 before i can restart them :p [20:14] with a 25 second delay already, but hey, wait and see [20:15] I'm using a 240 s delay + 720 s between each wiki and it still gets stuck often. :< [20:16] geez, thats very long between each request [20:17] They have a very annoying throttling, allegedly to combat spambots [20:19] yeah, i noticed the same with imslp [20:19] No, IMSLP is plainly a jerk :) [20:20] If you ask they tell you clearly that they don't like to be crawled [20:21] yeah emailed and send a twitter message just now [20:22] "Move over, i want to scroll your website." [20:22] crawl* [20:25] midas: I don't recommend that, usually the outcome is that your IP gets banned instantly [20:26] thats fine, i have enough ip's