#wikiteam 2014-06-27,Fri

↑back Search

Time Nickname Message
12:51 🔗 balrog Nemo_bis: tested with vagrant and seems to work
13:08 🔗 Nemo_bis ah great
13:13 🔗 balrog I made a PR
13:14 🔗 Nemo_bis already merged
13:14 🔗 Nemo_bis balrog: this was the reason we added POST https://github.com/WikiTeam/wikiteam/issues/68
13:14 🔗 Nemo_bis no it's not
13:15 🔗 balrog If you found any bug, report a new issue here (Google account required): http://code.google.com/p/wikiteam/issues/list
13:15 🔗 balrog that needs to be changed :)
13:15 🔗 Nemo_bis PR PR PR
13:16 🔗 balrog meh, not worth it
13:16 🔗 balrog can be fixed in the github editor even
13:19 🔗 balrog Nemo_bis: has the wiki been migrated?
13:22 🔗 Nemo_bis balrog: yes but the formatting has to be fixed still
13:22 🔗 balrog ok :/
13:22 🔗 balrog https://github.com/WikiTeam/wikiteam/issues/102 is this reproducible?
13:26 🔗 balrog Nemo_bis: dumpgenerator seems to be able to dump non-api wikis
13:26 🔗 balrog via index.php
13:26 🔗 balrog is this right?
13:35 🔗 balrog Nemo_bis: ya there?
13:35 🔗 balrog saveTitles -- the output = line
13:35 🔗 balrog it's assuming input is in ascii
13:35 🔗 balrog is this correct?
13:35 🔗 balrog I'm having it crash on http://qed.princeton.edu/index.php
13:37 🔗 balrog there are other places where u'text' is used
13:37 🔗 Nemo_bis balrog: yes, it should be able to screenscrape all versions
13:38 🔗 Nemo_bis that wiki crashes for another reason, or at least it did last time I tried
13:38 🔗 balrog Nemo_bis: my point is, should all u'text' instances be changed to unicode('text', 'utf-8') ?
13:38 🔗 balrog yeah, that's why I tried it :)
13:38 🔗 Nemo_bis balrog: https://github.com/WikiTeam/wikiteam/issues/89
13:38 🔗 balrog I know, I got to it from that page
13:38 🔗 Nemo_bis ah
13:38 🔗 Nemo_bis sorry
13:38 🔗 balrog quickly crashed with unicode though
13:38 🔗 balrog specifically in saveTitles
13:39 🔗 Nemo_bis That didn't happen to me ;/
13:39 🔗 balrog maybe they added a title with unicode in it
13:39 🔗 balrog u'text' seems to interpret text as ascii
13:39 🔗 balrog and if there's a unicode char in there, it just bails
13:40 🔗 Nemo_bis I don't remember ever having problems with unicode characters in the last years
13:40 🔗 balrog can you try scraping that site?
13:40 🔗 balrog it should fail pretty quickly
13:40 🔗 balrog or try this
13:41 🔗 Nemo_bis I believe you :)
13:41 🔗 balrog err, hang on.
13:41 🔗 Nemo_bis I'm just thinking how to get more robust
13:42 🔗 balrog open a python shell
13:42 🔗 balrog k = (u"%s\n--END--" % 'Avalokite\xc5\x9bvara\nCour')
13:42 🔗 balrog that unicode codes for an ś apparently
13:42 🔗 Nemo_bis I'm told one should use "codecs" module to avoid unicode headaches in python
13:43 🔗 balrog or migrate to python 3 ;)
13:43 🔗 Nemo_bis Heh, can't do that yet though
13:43 🔗 balrog heh why?
13:43 🔗 balrog can you test that in a python shell, though?
13:43 🔗 balrog see what you get
13:43 🔗 Nemo_bis We have some users stuck with 2.6 even, or 2.7
13:44 🔗 Nemo_bis UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 9: ordinal not in range(128)
13:44 🔗 balrog yep
13:44 🔗 Nemo_bis with Python 2.7.5
13:46 🔗 balrog this wiki has titles and stuff in utf-8
13:52 🔗 balrog this may not be the case from all websites though
14:10 🔗 Nemo_bis balrog: I've archived all sorts of Unicode websites
14:10 🔗 balrog with dumpgenerator?
14:11 🔗 Nemo_bis yes
14:11 🔗 Nemo_bis hundreds of Russian and Chinese wikis
14:12 🔗 Nemo_bis that said, dumpgenerator is kept together with duct tape so I'm sure it does plenty of wrong things about unicode strings as well :)
14:13 🔗 balrog are wikis never not in unicode?
14:13 🔗 balrog ever*
14:14 🔗 Nemo_bis I think the default was switched to utf-8 in 2004 or so
14:19 🔗 balrog have you scraped a site without api which had unicode titles?
14:20 🔗 balrog API might be urlencoding them
14:20 🔗 balrog can you find me a small wiki with unicode titles?
14:24 🔗 Nemo_bis balrog: try https://archive.org/search.php?query=collection%3Awikiteam%20language%3Arussian maybe?
14:26 🔗 balrog got one
14:33 🔗 balrog fixed
14:36 🔗 balrog see PR
14:37 🔗 balrog Nemo_bis: ^
14:38 🔗 balrog are you sure undoHTMLEntities shouldn't be called earlier?
14:40 🔗 Nemo_bis no
14:42 🔗 balrog I think it should be
14:44 🔗 balrog Nemo_bis: is this a bug?
14:44 🔗 balrog https://github.com/WikiTeam/wikiteam/commit/d60e560571cb412f9296cc7a09919f7c47f8d9ac#diff-d1709eb16c251acb84fe017e89703340L258
14:44 🔗 balrog req2 = urllib2.Request(url=url, headers={'User-Agent': getUserAgent()})
14:45 🔗 balrog raw2 = urllib2.urlopen(req).read()
14:45 🔗 balrog req2 isn't ever used
14:47 🔗 Nemo_bis embarrassing
14:47 🔗 Nemo_bis that's surely one of my mistakes
15:14 🔗 Nemo_bis not very productive update https://github.com/WikiTeam/wikiteam/commit/c420d4d843feb1bf66d0b284d87f07b20035ea5e
15:30 🔗 balrog Nemo_bis: can you fix that mistake?
15:34 🔗 Nemo_bis ok, done, credited you in summary https://github.com/WikiTeam/wikiteam/commit/62d961fa979649423d95a5ca0dd1a5ba081d8cf2
15:34 🔗 Nemo_bis I didn't want to steal your typospotting :)
15:37 🔗 balrog more like spotted by python :)
15:38 🔗 balrog it seems this would be a good candidate to break into multiple repos
15:42 🔗 Nemo_bis balrog: emijrp asked opinions on how to split the repo at https://groups.google.com/forum/#!forum/wikiteam-discuss , it would be useful if you could reply there
15:50 🔗 balrog ahhh ok
15:50 🔗 balrog Nemo_bis: I don't think that test is necessary
15:50 🔗 balrog the reason I used it with an api-enabled wiki was so I could see the difference in the output between api and scraped
15:50 🔗 balrog and my fix makes the output the same
16:11 🔗 balrog Nemo_bis: I'm testing on qed.princeton.edu
16:12 🔗 Nemo_bis cute
16:13 🔗 balrog Nemo_bis: it would be nice if we could store dumpgenerator output.
16:13 🔗 balrog would certainly make debugging easier.
16:13 🔗 balrog should I add such a feature? :)
16:16 🔗 Nemo_bis balrog: does redirection / tee not work? :)
16:16 🔗 balrog yes but no one does it :P
16:17 🔗 Nemo_bis But yes, it would be useful. Think of a way compatible with launcher.py if possible
16:17 🔗 balrog how about dumping output into the same location as the wiki dump?
16:17 🔗 balrog hah you can just replace sys.stdout with your version :P
16:18 🔗 balrog or you can use the logging module which is less hacky
16:18 🔗 Nemo_bis don't ask me for "less hacky" things
16:18 🔗 Nemo_bis Oh, I forgot, there are WINDOWS USERS
16:18 🔗 Nemo_bis No idea what they do
16:19 🔗 balrog ugh.

irclogger-viewer