Time |
Nickname |
Message |
10:37
π
|
|
vitzli has joined #wikiteam |
10:40
π
|
vitzli |
hello, I am trying to dump lurkmore.to with dumpgenerator from https://github.com/WikiTeam/wikiteam, I've changed useragents[0] to useragents[1] in dumpgenerator.py, but now I get "XML for "Main_Page" is wrong. Waiting 20 seconds and reloading..." - I've replaced Main_Page with other name, but I got the same result. How can I fix this? Is it possible to use debug mode for that page? |
10:42
π
|
vitzli |
It's a bit of an urgent action, since they are moving wiki to read-only mode, (there is some probability that this wiki would disappear one day, because of censorship issues in Russia) |
11:21
π
|
behind_yo |
if it is read-only mode: NO PANIC |
11:21
π
|
behind_yo |
you can still get the data then, vitzli |
11:22
π
|
vitzli |
yes, I hope so, just a little bit nervous about it |
11:23
π
|
behind_yo |
please provide the exact comman you execute, so others can retry |
11:23
π
|
behind_yo |
+d |
11:26
π
|
vitzli |
I'm doing python dumpgenerator.py --api=https://lurkmore.co/api.php --index=https://lurkmore.co/index.php --xml --curonly (I'm using this command from issue #201: https://github.com/WikiTeam/wikiteam/issues/201) it does not work with useragents[0], but when I'm using other UA string (except that wget gives 404 and googlebot is a bit tricky) - it does work. UA could be firefox for windows, ff for linux, chrome AFAIK |
11:30
π
|
vitzli |
debian stretch, did pip --user -r requirements.txt setup, kitchen from debian repo, everything else from pip, I could put there some junk since I'm doing this inside VM |
11:37
π
|
behind_yo |
thx, please wait ntil someone from the team looks at it, I can't execute something right now |
11:37
π
|
|
behind_yo is now known as Erkan |
11:38
π
|
vitzli |
Should I open issue on github? |
11:39
π
|
vitzli |
It seems that this problem relates to issue #241 |
11:40
π
|
Erkan |
I'd first check if you can normally export a page via Special:Export |
11:40
π
|
Erkan |
just to be sure nothing wrong there |
11:45
π
|
vitzli |
Yes, it does work, but uses translated cyrillic name for Special:Export - "Π‘Π»ΡΠΆΠ΅Π±Π½Π°Ρ:Export" |
11:53
π
|
Erkan |
ok, that's fine then |
11:55
π
|
vitzli |
https://lurkmore.to/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:Export/%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0 - sorry for quoted chars |
12:29
π
|
vitzli |
Erkan, I'm sorry for the panic, I think I found the problem! It seems something wrong with either python-requests or webserver, when I added 'Accept-Encoding': 'deflate' to http request headers - the program started to download xml pages |
12:30
π
|
vitzli |
by default, it seems to be 'gzip,deflate' |
12:32
π
|
vitzli |
It could be caused by squid too, I have transparent proxy between my VM and website |
12:35
π
|
|
Erkan has quit IRC (Ping timeout: 512 seconds) |
12:45
π
|
|
behind_yo has joined #wikiteam |
12:48
π
|
vitzli |
behind_yo, I'm sorry for the panic, I think I found the problem! It seems something wrong with either python-requests or webserver, when I added 'Accept-Encoding': 'deflate' to http request headers - the program started to download xml pages. Default accept-encoding seems to be 'gzip,deflate'. It could be caused by my proxy too, I have a transparent one between my VM and website |
12:48
π
|
|
behind_yo is now known as Erkan |
12:48
π
|
Erkan |
hi, the last I saw was: "It could be caused..." |
12:48
π
|
Erkan |
ah, I see you summarized all now |
12:50
π
|
vitzli |
~3600 pages downloaded now |
13:00
π
|
Erkan |
yay |
13:12
π
|
vitzli |
tested on (very) remote VPS - it looks like squid is not the reason, when I added Accept-Encoding: deflate - it did the trick again |
13:13
π
|
vitzli |
it is either caused by python-requests lib or remote webserver itself |
13:14
π
|
vitzli |
about 5000 pages downloaded, crashed 2 times |
14:37
π
|
vitzli |
~18000 pages downloaded |
14:59
π
|
Fletcher |
when I use --images does that only backup images or all uploaded files? |
17:15
π
|
Nemo_bis |
Fletcher: all files |
17:16
π
|
Fletcher |
cheers Nemo_bis |
17:16
π
|
Nemo_bis |
vitzli: iirc w let puython-requests set the deflate header, weird that it doesn't work |
17:22
π
|
vitzli |
Nemo_bis, I've actually added 'Accept-Encoding':'deflate','Accept-Language':'ru,en;q=0.5' everywhere when session.headers.update({'User-Agent': getUserAgent()}) is called, just to be sure, but I think that Accept-Encoding worked alone |
17:24
π
|
vitzli |
it seems that it is an issue with a webserver software, I found it when I tried to use pure requests lib to get xml data and got some unreadable piece of junk. |
17:25
π
|
vitzli |
run network monitor in Firefox, saw its headers, added them to requests code and it worked! |
17:26
π
|
vitzli |
then patched dumpgenerator and managed to download almost all text pages, about 30000 maybe. Unfortunately, I had to split namespaces between 2 hosts |
17:51
π
|
vitzli |
may I post the link for .7z archive? |
17:55
π
|
|
vitzli has quit IRC (Quit: Leaving) |
18:13
π
|
xmc |
one can always post links in archiveteam channels |
21:40
π
|
Nemo_bis |
so maybe they blocked our user-agent |
21:41
π
|
Nemo_bis |
Or they require ru in Accept-Language, but that would be weird :) |
22:13
π
|
|
Command-S has quit IRC (Read error: Connection reset by peer) |
22:13
π
|
|
Control-S has joined #wikiteam |