[12:51] Nemo_bis: tested with vagrant and seems to work [13:08] ah great [13:13] I made a PR [13:14] already merged [13:14] balrog: this was the reason we added POST https://github.com/WikiTeam/wikiteam/issues/68 [13:14] no it's not [13:15] If you found any bug, report a new issue here (Google account required): http://code.google.com/p/wikiteam/issues/list [13:15] that needs to be changed :) [13:15] PR PR PR [13:16] meh, not worth it [13:16] can be fixed in the github editor even [13:19] Nemo_bis: has the wiki been migrated? [13:22] balrog: yes but the formatting has to be fixed still [13:22] ok :/ [13:22] https://github.com/WikiTeam/wikiteam/issues/102 is this reproducible? [13:26] Nemo_bis: dumpgenerator seems to be able to dump non-api wikis [13:26] via index.php [13:26] is this right? [13:35] Nemo_bis: ya there? [13:35] saveTitles -- the output = line [13:35] it's assuming input is in ascii [13:35] is this correct? [13:35] I'm having it crash on http://qed.princeton.edu/index.php [13:37] there are other places where u'text' is used [13:37] balrog: yes, it should be able to screenscrape all versions [13:38] that wiki crashes for another reason, or at least it did last time I tried [13:38] Nemo_bis: my point is, should all u'text' instances be changed to unicode('text', 'utf-8') ? [13:38] yeah, that's why I tried it :) [13:38] balrog: https://github.com/WikiTeam/wikiteam/issues/89 [13:38] I know, I got to it from that page [13:38] ah [13:38] sorry [13:38] quickly crashed with unicode though [13:38] specifically in saveTitles [13:39] That didn't happen to me ;/ [13:39] maybe they added a title with unicode in it [13:39] u'text' seems to interpret text as ascii [13:39] and if there's a unicode char in there, it just bails [13:40] I don't remember ever having problems with unicode characters in the last years [13:40] can you try scraping that site? [13:40] it should fail pretty quickly [13:40] or try this [13:41] I believe you :) [13:41] err, hang on. [13:41] I'm just thinking how to get more robust [13:42] open a python shell [13:42] k = (u"%s\n--END--" % 'Avalokite\xc5\x9bvara\nCour') [13:42] that unicode codes for an Å apparently [13:42] I'm told one should use "codecs" module to avoid unicode headaches in python [13:43] or migrate to python 3 ;) [13:43] Heh, can't do that yet though [13:43] heh why? [13:43] can you test that in a python shell, though? [13:43] see what you get [13:43] We have some users stuck with 2.6 even, or 2.7 [13:44] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 9: ordinal not in range(128) [13:44] yep [13:44] with Python 2.7.5 [13:46] this wiki has titles and stuff in utf-8 [13:52] this may not be the case from all websites though [14:10] balrog: I've archived all sorts of Unicode websites [14:10] with dumpgenerator? [14:11] yes [14:11] hundreds of Russian and Chinese wikis [14:12] that said, dumpgenerator is kept together with duct tape so I'm sure it does plenty of wrong things about unicode strings as well :) [14:13] are wikis never not in unicode? [14:13] ever* [14:14] I think the default was switched to utf-8 in 2004 or so [14:19] have you scraped a site without api which had unicode titles? [14:20] API might be urlencoding them [14:20] can you find me a small wiki with unicode titles? [14:24] balrog: try https://archive.org/search.php?query=collection%3Awikiteam%20language%3Arussian maybe? [14:26] got one [14:33] fixed [14:36] see PR [14:37] Nemo_bis: ^ [14:38] are you sure undoHTMLEntities shouldn't be called earlier? [14:40] no [14:42] I think it should be [14:44] Nemo_bis: is this a bug? [14:44] https://github.com/WikiTeam/wikiteam/commit/d60e560571cb412f9296cc7a09919f7c47f8d9ac#diff-d1709eb16c251acb84fe017e89703340L258 [14:44] req2 = urllib2.Request(url=url, headers={'User-Agent': getUserAgent()}) [14:45] raw2 = urllib2.urlopen(req).read() [14:45] req2 isn't ever used [14:47] embarrassing [14:47] that's surely one of my mistakes [15:14] not very productive update https://github.com/WikiTeam/wikiteam/commit/c420d4d843feb1bf66d0b284d87f07b20035ea5e [15:30] Nemo_bis: can you fix that mistake? [15:34] ok, done, credited you in summary https://github.com/WikiTeam/wikiteam/commit/62d961fa979649423d95a5ca0dd1a5ba081d8cf2 [15:34] I didn't want to steal your typospotting :) [15:37] more like spotted by python :) [15:38] it seems this would be a good candidate to break into multiple repos [15:42] balrog: emijrp asked opinions on how to split the repo at https://groups.google.com/forum/#!forum/wikiteam-discuss , it would be useful if you could reply there [15:50] ahhh ok [15:50] Nemo_bis: I don't think that test is necessary [15:50] the reason I used it with an api-enabled wiki was so I could see the difference in the output between api and scraped [15:50] and my fix makes the output the same [16:11] Nemo_bis: I'm testing on qed.princeton.edu [16:12] cute [16:13] Nemo_bis: it would be nice if we could store dumpgenerator output. [16:13] would certainly make debugging easier. [16:13] should I add such a feature? :) [16:16] balrog: does redirection / tee not work? :) [16:16] yes but no one does it :P [16:17] But yes, it would be useful. Think of a way compatible with launcher.py if possible [16:17] how about dumping output into the same location as the wiki dump? [16:17] hah you can just replace sys.stdout with your version :P [16:18] or you can use the logging module which is less hacky [16:18] don't ask me for "less hacky" things [16:18] Oh, I forgot, there are WINDOWS USERS [16:18] No idea what they do [16:19] ugh.