[10:37] *** vitzli has joined #wikiteam
[10:40] <vitzli> hello, I am trying to dump lurkmore.to with dumpgenerator from https://github.com/WikiTeam/wikiteam, I've changed useragents[0] to useragents[1] in dumpgenerator.py, but now I get "XML for "Main_Page" is wrong. Waiting 20 seconds and reloading..." - I've replaced Main_Page with other name, but I got the same result. How can I fix this? Is it possible to use debug mode for that page?
[10:42] <vitzli> It's a bit of an urgent action, since they are moving wiki to read-only mode, (there is some probability that this wiki would disappear one day, because of censorship issues in Russia)
[11:21] <behind_yo> if it is read-only mode: NO PANIC
[11:21] <behind_yo> you can still get the data then, vitzli 
[11:22] <vitzli> yes, I hope so, just a little bit nervous about it
[11:23] <behind_yo> please provide the exact comman you execute, so others can retry
[11:23] <behind_yo> +d
[11:26] <vitzli> I'm doing python dumpgenerator.py --api=https://lurkmore.co/api.php --index=https://lurkmore.co/index.php --xml --curonly  (I'm using this command from issue #201: https://github.com/WikiTeam/wikiteam/issues/201) it does not work with useragents[0], but when I'm using other UA string (except that wget gives 404 and googlebot is a bit tricky) - it does work. UA could be firefox for windows, ff for linux, chrome AFAIK
[11:30] <vitzli> debian stretch, did pip --user -r requirements.txt setup, kitchen from debian repo, everything else from pip, I could put there some junk since I'm doing this inside VM
[11:37] <behind_yo> thx, please wait ntil someone from the team looks at it, I can't execute something right now
[11:37] *** behind_yo is now known as Erkan
[11:38] <vitzli> Should I open issue on github?
[11:39] <vitzli> It seems that this problem relates to issue #241
[11:40] <Erkan> I'd first check if you can normally export a page via Special:Export
[11:40] <Erkan> just to be sure nothing wrong there
[11:45] <vitzli> Yes, it does work, but uses translated cyrillic name for Special:Export - "Служебная:Export"
[11:53] <Erkan> ok, that's fine then
[11:55] <vitzli> https://lurkmore.to/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:Export/%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0 - sorry for quoted chars
[12:29] <vitzli> Erkan, I'm sorry for the panic, I think I found the problem! It seems something wrong with either python-requests or webserver, when I added 'Accept-Encoding': 'deflate' to http request headers - the program started to download xml pages
[12:30] <vitzli> by default, it seems to be 'gzip,deflate'
[12:32] <vitzli> It could be caused by squid too, I have transparent proxy between my VM and website
[12:35] *** Erkan has quit IRC (Ping timeout: 512 seconds)
[12:45] *** behind_yo has joined #wikiteam
[12:48] <vitzli> behind_yo,  I'm sorry for the panic, I think I found the problem! It seems something wrong with either python-requests or webserver, when I added 'Accept-Encoding': 'deflate' to http request headers - the program started to download xml pages. Default accept-encoding seems to be 'gzip,deflate'. It could be caused by my proxy too, I have a transparent one between my VM and website
[12:48] *** behind_yo is now known as Erkan
[12:48] <Erkan> hi, the last I saw was: "It could be caused..."
[12:48] <Erkan> ah, I see you summarized all now
[12:50] <vitzli> ~3600 pages downloaded now
[13:00] <Erkan> yay
[13:12] <vitzli> tested on (very) remote VPS - it looks like squid is not the reason, when I added Accept-Encoding: deflate - it did the trick again
[13:13] <vitzli> it is either caused by python-requests lib or remote webserver itself
[13:14] <vitzli> about 5000 pages downloaded, crashed 2 times
[14:37] <vitzli> ~18000 pages downloaded
[14:59] <Fletcher> when I use --images does that only backup images or all uploaded files?
[17:15] <Nemo_bis> Fletcher: all files
[17:16] <Fletcher> cheers Nemo_bis
[17:16] <Nemo_bis> vitzli: iirc w let puython-requests set the deflate header, weird that it doesn't work
[17:22] <vitzli> Nemo_bis, I've actually added 'Accept-Encoding':'deflate','Accept-Language':'ru,en;q=0.5' everywhere when session.headers.update({'User-Agent': getUserAgent()}) is called, just to be sure, but I think that Accept-Encoding worked alone
[17:24] <vitzli> it seems that it is an issue with a webserver software, I found it when I tried to use pure requests lib to get xml data and got some unreadable piece of junk.
[17:25] <vitzli> run network monitor in Firefox, saw its headers, added them to requests code and it worked!
[17:26] <vitzli> then patched dumpgenerator and managed to download almost all text pages, about 30000 maybe. Unfortunately, I had to split namespaces between 2 hosts
[17:51] <vitzli> may I post the link for .7z archive?
[17:55] *** vitzli has quit IRC (Quit: Leaving)
[18:13] <xmc> one can always post links in archiveteam channels
[21:40] <Nemo_bis> so maybe they blocked our user-agent
[21:41] <Nemo_bis> Or they require ru in Accept-Language, but that would be weird :)
[22:13] *** Command-S has quit IRC (Read error: Connection reset by peer)
[22:13] *** Control-S has joined #wikiteam