#wikiteam 2015-06-22,Mon

↑back Search

Time	Nickname	Message
10:37 ^🔗		vitzli has joined #wikiteam
10:40 ^🔗	vitzli	hello, I am trying to dump lurkmore.to with dumpgenerator from https://github.com/WikiTeam/wikiteam, I've changed useragents[0] to useragents[1] in dumpgenerator.py, but now I get "XML for "Main_Page" is wrong. Waiting 20 seconds and reloading..." - I've replaced Main_Page with other name, but I got the same result. How can I fix this? Is it possible to use debug mode for that page?
10:42 ^🔗	vitzli	It's a bit of an urgent action, since they are moving wiki to read-only mode, (there is some probability that this wiki would disappear one day, because of censorship issues in Russia)
11:21 ^🔗	behind_yo	if it is read-only mode: NO PANIC
11:21 ^🔗	behind_yo	you can still get the data then, vitzli
11:22 ^🔗	vitzli	yes, I hope so, just a little bit nervous about it
11:23 ^🔗	behind_yo	please provide the exact comman you execute, so others can retry
11:23 ^🔗	behind_yo	+d
11:26 ^🔗	vitzli	I'm doing python dumpgenerator.py --api=https://lurkmore.co/api.php --index=https://lurkmore.co/index.php --xml --curonly (I'm using this command from issue #201: https://github.com/WikiTeam/wikiteam/issues/201) it does not work with useragents[0], but when I'm using other UA string (except that wget gives 404 and googlebot is a bit tricky) - it does work. UA could be firefox for windows, ff for linux, chrome AFAIK
11:30 ^🔗	vitzli	debian stretch, did pip --user -r requirements.txt setup, kitchen from debian repo, everything else from pip, I could put there some junk since I'm doing this inside VM
11:37 ^🔗	behind_yo	thx, please wait ntil someone from the team looks at it, I can't execute something right now
11:37 ^🔗		behind_yo is now known as Erkan
11:38 ^🔗	vitzli	Should I open issue on github?
11:39 ^🔗	vitzli	It seems that this problem relates to issue #241
11:40 ^🔗	Erkan	I'd first check if you can normally export a page via Special:Export
11:40 ^🔗	Erkan	just to be sure nothing wrong there
11:45 ^🔗	vitzli	Yes, it does work, but uses translated cyrillic name for Special:Export - "Служебная:Export"
11:53 ^🔗	Erkan	ok, that's fine then
11:55 ^🔗	vitzli	https://lurkmore.to/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:Export/%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0 - sorry for quoted chars
12:29 ^🔗	vitzli	Erkan, I'm sorry for the panic, I think I found the problem! It seems something wrong with either python-requests or webserver, when I added 'Accept-Encoding': 'deflate' to http request headers - the program started to download xml pages
12:30 ^🔗	vitzli	by default, it seems to be 'gzip,deflate'
12:32 ^🔗	vitzli	It could be caused by squid too, I have transparent proxy between my VM and website
12:35 ^🔗		Erkan has quit IRC (Ping timeout: 512 seconds)
12:45 ^🔗		behind_yo has joined #wikiteam
12:48 ^🔗	vitzli	behind_yo, I'm sorry for the panic, I think I found the problem! It seems something wrong with either python-requests or webserver, when I added 'Accept-Encoding': 'deflate' to http request headers - the program started to download xml pages. Default accept-encoding seems to be 'gzip,deflate'. It could be caused by my proxy too, I have a transparent one between my VM and website
12:48 ^🔗		behind_yo is now known as Erkan
12:48 ^🔗	Erkan	hi, the last I saw was: "It could be caused..."
12:48 ^🔗	Erkan	ah, I see you summarized all now
12:50 ^🔗	vitzli	~3600 pages downloaded now
13:00 ^🔗	Erkan	yay
13:12 ^🔗	vitzli	tested on (very) remote VPS - it looks like squid is not the reason, when I added Accept-Encoding: deflate - it did the trick again
13:13 ^🔗	vitzli	it is either caused by python-requests lib or remote webserver itself
13:14 ^🔗	vitzli	about 5000 pages downloaded, crashed 2 times
14:37 ^🔗	vitzli	~18000 pages downloaded
14:59 ^🔗	Fletcher	when I use --images does that only backup images or all uploaded files?
17:15 ^🔗	Nemo_bis	Fletcher: all files
17:16 ^🔗	Fletcher	cheers Nemo_bis
17:16 ^🔗	Nemo_bis	vitzli: iirc w let puython-requests set the deflate header, weird that it doesn't work
17:22 ^🔗	vitzli	Nemo_bis, I've actually added 'Accept-Encoding':'deflate','Accept-Language':'ru,en;q=0.5' everywhere when session.headers.update({'User-Agent': getUserAgent()}) is called, just to be sure, but I think that Accept-Encoding worked alone
17:24 ^🔗	vitzli	it seems that it is an issue with a webserver software, I found it when I tried to use pure requests lib to get xml data and got some unreadable piece of junk.
17:25 ^🔗	vitzli	run network monitor in Firefox, saw its headers, added them to requests code and it worked!
17:26 ^🔗	vitzli	then patched dumpgenerator and managed to download almost all text pages, about 30000 maybe. Unfortunately, I had to split namespaces between 2 hosts
17:51 ^🔗	vitzli	may I post the link for .7z archive?
17:55 ^🔗		vitzli has quit IRC (Quit: Leaving)
18:13 ^🔗	xmc	one can always post links in archiveteam channels
21:40 ^🔗	Nemo_bis	so maybe they blocked our user-agent
21:41 ^🔗	Nemo_bis	Or they require ru in Accept-Language, but that would be weird :)
22:13 ^🔗		Command-S has quit IRC (Read error: Connection reset by peer)
22:13 ^🔗		Control-S has joined #wikiteam

irclogger-viewer