Time |
Nickname |
Message |
12:51
🔗
|
balrog |
Nemo_bis: tested with vagrant and seems to work |
13:08
🔗
|
Nemo_bis |
ah great |
13:13
🔗
|
balrog |
I made a PR |
13:14
🔗
|
Nemo_bis |
already merged |
13:14
🔗
|
Nemo_bis |
balrog: this was the reason we added POST https://github.com/WikiTeam/wikiteam/issues/68 |
13:14
🔗
|
Nemo_bis |
no it's not |
13:15
🔗
|
balrog |
If you found any bug, report a new issue here (Google account required): http://code.google.com/p/wikiteam/issues/list |
13:15
🔗
|
balrog |
that needs to be changed :) |
13:15
🔗
|
Nemo_bis |
PR PR PR |
13:16
🔗
|
balrog |
meh, not worth it |
13:16
🔗
|
balrog |
can be fixed in the github editor even |
13:19
🔗
|
balrog |
Nemo_bis: has the wiki been migrated? |
13:22
🔗
|
Nemo_bis |
balrog: yes but the formatting has to be fixed still |
13:22
🔗
|
balrog |
ok :/ |
13:22
🔗
|
balrog |
https://github.com/WikiTeam/wikiteam/issues/102 is this reproducible? |
13:26
🔗
|
balrog |
Nemo_bis: dumpgenerator seems to be able to dump non-api wikis |
13:26
🔗
|
balrog |
via index.php |
13:26
🔗
|
balrog |
is this right? |
13:35
🔗
|
balrog |
Nemo_bis: ya there? |
13:35
🔗
|
balrog |
saveTitles -- the output = line |
13:35
🔗
|
balrog |
it's assuming input is in ascii |
13:35
🔗
|
balrog |
is this correct? |
13:35
🔗
|
balrog |
I'm having it crash on http://qed.princeton.edu/index.php |
13:37
🔗
|
balrog |
there are other places where u'text' is used |
13:37
🔗
|
Nemo_bis |
balrog: yes, it should be able to screenscrape all versions |
13:38
🔗
|
Nemo_bis |
that wiki crashes for another reason, or at least it did last time I tried |
13:38
🔗
|
balrog |
Nemo_bis: my point is, should all u'text' instances be changed to unicode('text', 'utf-8') ? |
13:38
🔗
|
balrog |
yeah, that's why I tried it :) |
13:38
🔗
|
Nemo_bis |
balrog: https://github.com/WikiTeam/wikiteam/issues/89 |
13:38
🔗
|
balrog |
I know, I got to it from that page |
13:38
🔗
|
Nemo_bis |
ah |
13:38
🔗
|
Nemo_bis |
sorry |
13:38
🔗
|
balrog |
quickly crashed with unicode though |
13:38
🔗
|
balrog |
specifically in saveTitles |
13:39
🔗
|
Nemo_bis |
That didn't happen to me ;/ |
13:39
🔗
|
balrog |
maybe they added a title with unicode in it |
13:39
🔗
|
balrog |
u'text' seems to interpret text as ascii |
13:39
🔗
|
balrog |
and if there's a unicode char in there, it just bails |
13:40
🔗
|
Nemo_bis |
I don't remember ever having problems with unicode characters in the last years |
13:40
🔗
|
balrog |
can you try scraping that site? |
13:40
🔗
|
balrog |
it should fail pretty quickly |
13:40
🔗
|
balrog |
or try this |
13:41
🔗
|
Nemo_bis |
I believe you :) |
13:41
🔗
|
balrog |
err, hang on. |
13:41
🔗
|
Nemo_bis |
I'm just thinking how to get more robust |
13:42
🔗
|
balrog |
open a python shell |
13:42
🔗
|
balrog |
k = (u"%s\n--END--" % 'Avalokite\xc5\x9bvara\nCour') |
13:42
🔗
|
balrog |
that unicode codes for an ś apparently |
13:42
🔗
|
Nemo_bis |
I'm told one should use "codecs" module to avoid unicode headaches in python |
13:43
🔗
|
balrog |
or migrate to python 3 ;) |
13:43
🔗
|
Nemo_bis |
Heh, can't do that yet though |
13:43
🔗
|
balrog |
heh why? |
13:43
🔗
|
balrog |
can you test that in a python shell, though? |
13:43
🔗
|
balrog |
see what you get |
13:43
🔗
|
Nemo_bis |
We have some users stuck with 2.6 even, or 2.7 |
13:44
🔗
|
Nemo_bis |
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 9: ordinal not in range(128) |
13:44
🔗
|
balrog |
yep |
13:44
🔗
|
Nemo_bis |
with Python 2.7.5 |
13:46
🔗
|
balrog |
this wiki has titles and stuff in utf-8 |
13:52
🔗
|
balrog |
this may not be the case from all websites though |
14:10
🔗
|
Nemo_bis |
balrog: I've archived all sorts of Unicode websites |
14:10
🔗
|
balrog |
with dumpgenerator? |
14:11
🔗
|
Nemo_bis |
yes |
14:11
🔗
|
Nemo_bis |
hundreds of Russian and Chinese wikis |
14:12
🔗
|
Nemo_bis |
that said, dumpgenerator is kept together with duct tape so I'm sure it does plenty of wrong things about unicode strings as well :) |
14:13
🔗
|
balrog |
are wikis never not in unicode? |
14:13
🔗
|
balrog |
ever* |
14:14
🔗
|
Nemo_bis |
I think the default was switched to utf-8 in 2004 or so |
14:19
🔗
|
balrog |
have you scraped a site without api which had unicode titles? |
14:20
🔗
|
balrog |
API might be urlencoding them |
14:20
🔗
|
balrog |
can you find me a small wiki with unicode titles? |
14:24
🔗
|
Nemo_bis |
balrog: try https://archive.org/search.php?query=collection%3Awikiteam%20language%3Arussian maybe? |
14:26
🔗
|
balrog |
got one |
14:33
🔗
|
balrog |
fixed |
14:36
🔗
|
balrog |
see PR |
14:37
🔗
|
balrog |
Nemo_bis: ^ |
14:38
🔗
|
balrog |
are you sure undoHTMLEntities shouldn't be called earlier? |
14:40
🔗
|
Nemo_bis |
no |
14:42
🔗
|
balrog |
I think it should be |
14:44
🔗
|
balrog |
Nemo_bis: is this a bug? |
14:44
🔗
|
balrog |
https://github.com/WikiTeam/wikiteam/commit/d60e560571cb412f9296cc7a09919f7c47f8d9ac#diff-d1709eb16c251acb84fe017e89703340L258 |
14:44
🔗
|
balrog |
req2 = urllib2.Request(url=url, headers={'User-Agent': getUserAgent()}) |
14:45
🔗
|
balrog |
raw2 = urllib2.urlopen(req).read() |
14:45
🔗
|
balrog |
req2 isn't ever used |
14:47
🔗
|
Nemo_bis |
embarrassing |
14:47
🔗
|
Nemo_bis |
that's surely one of my mistakes |
15:14
🔗
|
Nemo_bis |
not very productive update https://github.com/WikiTeam/wikiteam/commit/c420d4d843feb1bf66d0b284d87f07b20035ea5e |
15:30
🔗
|
balrog |
Nemo_bis: can you fix that mistake? |
15:34
🔗
|
Nemo_bis |
ok, done, credited you in summary https://github.com/WikiTeam/wikiteam/commit/62d961fa979649423d95a5ca0dd1a5ba081d8cf2 |
15:34
🔗
|
Nemo_bis |
I didn't want to steal your typospotting :) |
15:37
🔗
|
balrog |
more like spotted by python :) |
15:38
🔗
|
balrog |
it seems this would be a good candidate to break into multiple repos |
15:42
🔗
|
Nemo_bis |
balrog: emijrp asked opinions on how to split the repo at https://groups.google.com/forum/#!forum/wikiteam-discuss , it would be useful if you could reply there |
15:50
🔗
|
balrog |
ahhh ok |
15:50
🔗
|
balrog |
Nemo_bis: I don't think that test is necessary |
15:50
🔗
|
balrog |
the reason I used it with an api-enabled wiki was so I could see the difference in the output between api and scraped |
15:50
🔗
|
balrog |
and my fix makes the output the same |
16:11
🔗
|
balrog |
Nemo_bis: I'm testing on qed.princeton.edu |
16:12
🔗
|
Nemo_bis |
cute |
16:13
🔗
|
balrog |
Nemo_bis: it would be nice if we could store dumpgenerator output. |
16:13
🔗
|
balrog |
would certainly make debugging easier. |
16:13
🔗
|
balrog |
should I add such a feature? :) |
16:16
🔗
|
Nemo_bis |
balrog: does redirection / tee not work? :) |
16:16
🔗
|
balrog |
yes but no one does it :P |
16:17
🔗
|
Nemo_bis |
But yes, it would be useful. Think of a way compatible with launcher.py if possible |
16:17
🔗
|
balrog |
how about dumping output into the same location as the wiki dump? |
16:17
🔗
|
balrog |
hah you can just replace sys.stdout with your version :P |
16:18
🔗
|
balrog |
or you can use the logging module which is less hacky |
16:18
🔗
|
Nemo_bis |
don't ask me for "less hacky" things |
16:18
🔗
|
Nemo_bis |
Oh, I forgot, there are WINDOWS USERS |
16:18
🔗
|
Nemo_bis |
No idea what they do |
16:19
🔗
|
balrog |
ugh. |