#wikiteam 2015-06-22,Mon

↑back Search

Time Nickname Message
10:37 πŸ”— vitzli has joined #wikiteam
10:40 πŸ”— vitzli hello, I am trying to dump lurkmore.to with dumpgenerator from https://github.com/WikiTeam/wikiteam, I've changed useragents[0] to useragents[1] in dumpgenerator.py, but now I get "XML for "Main_Page" is wrong. Waiting 20 seconds and reloading..." - I've replaced Main_Page with other name, but I got the same result. How can I fix this? Is it possible to use debug mode for that page?
10:42 πŸ”— vitzli It's a bit of an urgent action, since they are moving wiki to read-only mode, (there is some probability that this wiki would disappear one day, because of censorship issues in Russia)
11:21 πŸ”— behind_yo if it is read-only mode: NO PANIC
11:21 πŸ”— behind_yo you can still get the data then, vitzli
11:22 πŸ”— vitzli yes, I hope so, just a little bit nervous about it
11:23 πŸ”— behind_yo please provide the exact comman you execute, so others can retry
11:23 πŸ”— behind_yo +d
11:26 πŸ”— vitzli I'm doing python dumpgenerator.py --api=https://lurkmore.co/api.php --index=https://lurkmore.co/index.php --xml --curonly (I'm using this command from issue #201: https://github.com/WikiTeam/wikiteam/issues/201) it does not work with useragents[0], but when I'm using other UA string (except that wget gives 404 and googlebot is a bit tricky) - it does work. UA could be firefox for windows, ff for linux, chrome AFAIK
11:30 πŸ”— vitzli debian stretch, did pip --user -r requirements.txt setup, kitchen from debian repo, everything else from pip, I could put there some junk since I'm doing this inside VM
11:37 πŸ”— behind_yo thx, please wait ntil someone from the team looks at it, I can't execute something right now
11:37 πŸ”— behind_yo is now known as Erkan
11:38 πŸ”— vitzli Should I open issue on github?
11:39 πŸ”— vitzli It seems that this problem relates to issue #241
11:40 πŸ”— Erkan I'd first check if you can normally export a page via Special:Export
11:40 πŸ”— Erkan just to be sure nothing wrong there
11:45 πŸ”— vitzli Yes, it does work, but uses translated cyrillic name for Special:Export - "БлуТСбная:Export"
11:53 πŸ”— Erkan ok, that's fine then
11:55 πŸ”— vitzli https://lurkmore.to/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:Export/%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0 - sorry for quoted chars
12:29 πŸ”— vitzli Erkan, I'm sorry for the panic, I think I found the problem! It seems something wrong with either python-requests or webserver, when I added 'Accept-Encoding': 'deflate' to http request headers - the program started to download xml pages
12:30 πŸ”— vitzli by default, it seems to be 'gzip,deflate'
12:32 πŸ”— vitzli It could be caused by squid too, I have transparent proxy between my VM and website
12:35 πŸ”— Erkan has quit IRC (Ping timeout: 512 seconds)
12:45 πŸ”— behind_yo has joined #wikiteam
12:48 πŸ”— vitzli behind_yo, I'm sorry for the panic, I think I found the problem! It seems something wrong with either python-requests or webserver, when I added 'Accept-Encoding': 'deflate' to http request headers - the program started to download xml pages. Default accept-encoding seems to be 'gzip,deflate'. It could be caused by my proxy too, I have a transparent one between my VM and website
12:48 πŸ”— behind_yo is now known as Erkan
12:48 πŸ”— Erkan hi, the last I saw was: "It could be caused..."
12:48 πŸ”— Erkan ah, I see you summarized all now
12:50 πŸ”— vitzli ~3600 pages downloaded now
13:00 πŸ”— Erkan yay
13:12 πŸ”— vitzli tested on (very) remote VPS - it looks like squid is not the reason, when I added Accept-Encoding: deflate - it did the trick again
13:13 πŸ”— vitzli it is either caused by python-requests lib or remote webserver itself
13:14 πŸ”— vitzli about 5000 pages downloaded, crashed 2 times
14:37 πŸ”— vitzli ~18000 pages downloaded
14:59 πŸ”— Fletcher when I use --images does that only backup images or all uploaded files?
17:15 πŸ”— Nemo_bis Fletcher: all files
17:16 πŸ”— Fletcher cheers Nemo_bis
17:16 πŸ”— Nemo_bis vitzli: iirc w let puython-requests set the deflate header, weird that it doesn't work
17:22 πŸ”— vitzli Nemo_bis, I've actually added 'Accept-Encoding':'deflate','Accept-Language':'ru,en;q=0.5' everywhere when session.headers.update({'User-Agent': getUserAgent()}) is called, just to be sure, but I think that Accept-Encoding worked alone
17:24 πŸ”— vitzli it seems that it is an issue with a webserver software, I found it when I tried to use pure requests lib to get xml data and got some unreadable piece of junk.
17:25 πŸ”— vitzli run network monitor in Firefox, saw its headers, added them to requests code and it worked!
17:26 πŸ”— vitzli then patched dumpgenerator and managed to download almost all text pages, about 30000 maybe. Unfortunately, I had to split namespaces between 2 hosts
17:51 πŸ”— vitzli may I post the link for .7z archive?
17:55 πŸ”— vitzli has quit IRC (Quit: Leaving)
18:13 πŸ”— xmc one can always post links in archiveteam channels
21:40 πŸ”— Nemo_bis so maybe they blocked our user-agent
21:41 πŸ”— Nemo_bis Or they require ru in Accept-Language, but that would be weird :)
22:13 πŸ”— Command-S has quit IRC (Read error: Connection reset by peer)
22:13 πŸ”— Control-S has joined #wikiteam

irclogger-viewer