[10:37] *** vitzli has joined #wikiteam [10:40] hello, I am trying to dump lurkmore.to with dumpgenerator from https://github.com/WikiTeam/wikiteam, I've changed useragents[0] to useragents[1] in dumpgenerator.py, but now I get "XML for "Main_Page" is wrong. Waiting 20 seconds and reloading..." - I've replaced Main_Page with other name, but I got the same result. How can I fix this? Is it possible to use debug mode for that page? [10:42] It's a bit of an urgent action, since they are moving wiki to read-only mode, (there is some probability that this wiki would disappear one day, because of censorship issues in Russia) [11:21] if it is read-only mode: NO PANIC [11:21] you can still get the data then, vitzli [11:22] yes, I hope so, just a little bit nervous about it [11:23] please provide the exact comman you execute, so others can retry [11:23] +d [11:26] I'm doing python dumpgenerator.py --api=https://lurkmore.co/api.php --index=https://lurkmore.co/index.php --xml --curonly (I'm using this command from issue #201: https://github.com/WikiTeam/wikiteam/issues/201) it does not work with useragents[0], but when I'm using other UA string (except that wget gives 404 and googlebot is a bit tricky) - it does work. UA could be firefox for windows, ff for linux, chrome AFAIK [11:30] debian stretch, did pip --user -r requirements.txt setup, kitchen from debian repo, everything else from pip, I could put there some junk since I'm doing this inside VM [11:37] thx, please wait ntil someone from the team looks at it, I can't execute something right now [11:37] *** behind_yo is now known as Erkan [11:38] Should I open issue on github? [11:39] It seems that this problem relates to issue #241 [11:40] I'd first check if you can normally export a page via Special:Export [11:40] just to be sure nothing wrong there [11:45] Yes, it does work, but uses translated cyrillic name for Special:Export - "Служебная:Export" [11:53] ok, that's fine then [11:55] https://lurkmore.to/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:Export/%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0 - sorry for quoted chars [12:29] Erkan, I'm sorry for the panic, I think I found the problem! It seems something wrong with either python-requests or webserver, when I added 'Accept-Encoding': 'deflate' to http request headers - the program started to download xml pages [12:30] by default, it seems to be 'gzip,deflate' [12:32] It could be caused by squid too, I have transparent proxy between my VM and website [12:35] *** Erkan has quit IRC (Ping timeout: 512 seconds) [12:45] *** behind_yo has joined #wikiteam [12:48] behind_yo, I'm sorry for the panic, I think I found the problem! It seems something wrong with either python-requests or webserver, when I added 'Accept-Encoding': 'deflate' to http request headers - the program started to download xml pages. Default accept-encoding seems to be 'gzip,deflate'. It could be caused by my proxy too, I have a transparent one between my VM and website [12:48] *** behind_yo is now known as Erkan [12:48] hi, the last I saw was: "It could be caused..." [12:48] ah, I see you summarized all now [12:50] ~3600 pages downloaded now [13:00] yay [13:12] tested on (very) remote VPS - it looks like squid is not the reason, when I added Accept-Encoding: deflate - it did the trick again [13:13] it is either caused by python-requests lib or remote webserver itself [13:14] about 5000 pages downloaded, crashed 2 times [14:37] ~18000 pages downloaded [14:59] when I use --images does that only backup images or all uploaded files? [17:15] Fletcher: all files [17:16] cheers Nemo_bis [17:16] vitzli: iirc w let puython-requests set the deflate header, weird that it doesn't work [17:22] Nemo_bis, I've actually added 'Accept-Encoding':'deflate','Accept-Language':'ru,en;q=0.5' everywhere when session.headers.update({'User-Agent': getUserAgent()}) is called, just to be sure, but I think that Accept-Encoding worked alone [17:24] it seems that it is an issue with a webserver software, I found it when I tried to use pure requests lib to get xml data and got some unreadable piece of junk. [17:25] run network monitor in Firefox, saw its headers, added them to requests code and it worked! [17:26] then patched dumpgenerator and managed to download almost all text pages, about 30000 maybe. Unfortunately, I had to split namespaces between 2 hosts [17:51] may I post the link for .7z archive? [17:55] *** vitzli has quit IRC (Quit: Leaving) [18:13] one can always post links in archiveteam channels [21:40] so maybe they blocked our user-agent [21:41] Or they require ru in Accept-Language, but that would be weird :) [22:13] *** Command-S has quit IRC (Read error: Connection reset by peer) [22:13] *** Control-S has joined #wikiteam