#wikiteam 2014-06-04,Wed

↑back Search

Time Nickname Message
10:09 🔗 midas right so, the delay is different?
10:17 🔗 Nemo_bis midas: till some time ago --delay was not applied to all kind of requests
10:18 🔗 Nemo_bis I *think* I've added it everywhere now
10:18 🔗 midas lol :p ill check if my current version is different from the one online now
10:18 🔗 Nemo_bis But in addition to that there is some automatic delay when the server feels slow
10:19 🔗 Nemo_bis On top of that, the python delay function used doesn't literally delay as much as you tell it to, AFAIK
10:19 🔗 midas and to top that, it's IMSLP, seriously not cool
10:19 🔗 midas but fuckit, im going to grab it.
10:33 🔗 danneh_ Nemo_bis: Just wondering, I haven't done too much wiki archiving before, but I've got some old dumps uploaded on Archive.org (automatically uploaded by uploader.py)
10:34 🔗 danneh_ If I do some new dumps of the same wikis and upload them with uploader.py , will it conflict with the old dumps already up there?
10:34 🔗 danneh_ The old dumps were done at least a few months ago or so, just wanna try updating them
11:07 🔗 SketchCow No.
11:07 🔗 SketchCow Just be careful you're not doing dumps of massive wikis frequently
11:08 🔗 danneh_ Fair enough. Will do, thanks for the advice
11:09 🔗 danneh_ I can't throw up the archiveteam.org grab I just finished using uploader.py, but given it says "Item already exists.", I'm guessing it's just because someone else already 'has' the archiveteam wiki archive item, and they need to upload it themselves (or I could, changing the metadata/etc)
11:11 🔗 Nemo_bis danneh_: that's because you don't have the rights to edit/manage those items
11:11 🔗 Nemo_bis Can you tell me what wikis those are?
11:13 🔗 Nemo_bis Ouch, I had downloaded it twice in about a month by mistake :P https://archive.org/details/wiki-archiveteamorg
11:13 🔗 Nemo_bis But it's a small wiki anyway ;)
11:13 🔗 danneh_ Yep. The one that's failing is the ArchiveTeam.org archive I just did, but if all else fails I'll just leave it 'til whoever normally archives/manages it does it
11:13 🔗 danneh_ Aha, fair enough
11:14 🔗 Nemo_bis If you upload your dump anywhere I'll update the item and delete one of mine
11:14 🔗 Nemo_bis In general, if you want to archive many wikis we can just make you admin :)
11:14 🔗 danneh_ It's on my server at home, not too sure where I'd be able to easily upload it via command line
11:16 🔗 Nemo_bis If you want to try some mass downloading we'd need some of those: https://code.google.com/p/wikiteam/source/browse/trunk/batchdownload/taskforce/mediawikis_pavlo.alive.filtered.todo.txt
11:16 🔗 Nemo_bis Let me know if you're interested and I'll update the list
11:17 🔗 danneh_ Ah, those are currently undownloaded?
11:17 🔗 Nemo_bis Those are wikis that failed download for me, for one reason or another
11:18 🔗 danneh_ I've got about 250GB I can spare right now, unfortunately rest is allocated to archiving personal junk
11:18 🔗 danneh_ But I'll do my best and letcha know if I'm able to grab stuff
11:19 🔗 midas hmm, we should be thinking about organising this a little bit so we dont do them over and over again (multiple people grabbing the same wiki and such)
11:19 🔗 danneh_ launcher.py is giving me a little trouble, though: http://pastebin.com/niyNQfH8
11:20 🔗 danneh_ Running Python 2.7.6 on Arch, dumpgenerator.py itself works (just used it to do that grab)
11:20 🔗 danneh_ I'll have a look and try to see where it's failing, got some spare time
11:21 🔗 midas seems you need to add some extra information in your command, but i never used launcher.py yet
11:22 🔗 danneh_ oh, wait
11:23 🔗 danneh_ it's because here: subprocess.call('python dumpgenerator.py --api=%s --xml --images' % wiki, shell=True)
11:23 🔗 danneh_ that line assumes python is python2, which it is for pretty well everything except Arch (where python defaults to py3)
11:24 🔗 Nemo_bis hmpf, thought I had fixed that
11:25 🔗 Nemo_bis can just replace "python " with "./"?
11:25 🔗 danneh_ that's what I was thinking, lemme try it out
11:25 🔗 Nemo_bis What happens on Windows (horror!) I have no idea
11:26 🔗 midas probably default to blue screen
11:26 🔗 danneh_ Yep, that works fine when it's ./dumpgenerator.py instead of the python ...
11:27 🔗 danneh_ Windows is a bit silly, especially in that it's not usually in their $PATH by default (have to manually add it, at least last time I installed it on Win)
11:31 🔗 Nemo_bis ./ will stop working if one forgot to make the file executable, but we can live with that I suppose
11:31 🔗 danneh_ ah yeah, I had to chmod +x it
11:32 🔗 danneh_ do file permissions carry over with svn source control?
11:39 🔗 Nemo_bis I hope so on UNIX, but again no idea what happens on Windows
11:50 🔗 danneh_ If we're really looking to go all-out on Windows support, you'd probably wanna do something like: locate_python() that finds the system python and throws the path in front of the command
11:50 🔗 danneh_ we couldn't just do argv[-1] or something silly like that, to get the actual name/path of the program that started running launcher.py, maybe
11:54 🔗 danneh_ ah, but even if we could that wouldn't work for ./launcher.py
11:55 🔗 danneh_ we'd probably just need to do the function, which returns "./" on Posix-like and otherwise searches the usual Win32 directories for it
11:55 🔗 danneh_ if we super wanted to
11:56 🔗 Nemo_bis if something breaks and there is someone in the world who cares, we'll get to know them at last
11:56 🔗 Nemo_bis danneh_: https://code.google.com/p/wikiteam/source/browse/trunk/batchdownload/taskforce/mediawikis_pavlo.alive.filtered.todo.txt is all for you, I removed some 100 wikis which completed in the last few months
11:57 🔗 Nemo_bis It's always like that, 2000 wikis get done in one day and then the last 100 take months
11:57 🔗 danneh_ Nemo_bis: if you're on a Windows system, mind trying this for me? http://pastebin.com/uWXLtaES
11:57 🔗 Nemo_bis I'm not
11:57 🔗 danneh_ hopefully that should be a quick, simple answer to it
11:57 🔗 Nemo_bis Didn't touch Windows in many years
11:57 🔗 danneh_ Ah, I'll try it at work tomorrow and letcha know if it returns the right thing
11:58 🔗 danneh_ Also, thanks for the list. Know whether most of them are smallish, only got about 250GB to kick around right now unfortunately
11:59 🔗 danneh_ I'll probably just glance through before I try downloading, get some sorta measure of their size
11:59 🔗 Nemo_bis Sob, many of those wikis fail for certificate errors
12:01 🔗 SketchCow We are their last hope.
12:02 🔗 danneh_ I'll go through and do my best
12:02 🔗 danneh_ 'specially the ones just running off direct IPs
12:02 🔗 Nemo_bis You could try some s/https/http/
12:02 🔗 danneh_ Yeah, that's what I'm hoping
12:02 🔗 Nemo_bis I have no idea what makes most of those fail :)
12:03 🔗 danneh_ All else fails, I'll just go through and try to coax Py to ignore the cert errors :)
13:09 🔗 danneh_ Nemo_bis: Currently grabbing 163.13.175.46 , the first one failed with "Error. You forget mandatory parameters:" (suspect it's due to api.php not being enabled), and the second one failed to grab the main page ~5 times before I killed it and skipped it manually
13:10 🔗 danneh_ But ah well, this one has about 60k pages, some Chinese baseball wiki so I'll leave it running for a few days and see how it goes
13:24 🔗 Nemo_bis danneh_: ah, many many of those fail because they redirect to different URLs. If you also want to do some coding, you can try replacing urllib with Requests and see how many more wikis work https://code.google.com/p/wikiteam/issues/detail?id=104
13:26 🔗 danneh_ aha, Requests is the best
13:27 🔗 danneh_ I'll do my best, hopefully it shouldn't be too difficult/annoying to port over
13:28 🔗 danneh_ I'm pretty busy right now, so can't promise anything, but I'll do my best
13:30 🔗 Nemo_bis Sure, I'm just giving ideas :)
21:45 🔗 danneh_ Nemo_bis: Had things fail like this much before? http://pastebin.com/GrG3wckD
22:02 🔗 danneh_ .n
22:02 🔗 danneh_ Also, sorry 'bout bugging you in particular, you've just seemed to do most dev stuff from what I've seen in here!

irclogger-viewer