#wikiteam 2011-08-04,Thu

↑back Search

Time Nickname Message
02:08 🔗 underscor XML for "Suggestions" is wrong. Waiting 80 seconds and reloading...
02:08 🔗 underscor What would cause this error to occur?
02:09 🔗 underscor I guess emijrp is probably the best to ask
18:43 🔗 underscor chronomex: DoubleJ ersi Nemo_bis ops please
18:43 🔗 ersi sure
18:43 🔗 underscor gracias
18:45 🔗 underscor I've been meeting with Alexis this morning (The collections manager at the archive)
18:46 🔗 underscor Eventually what they'd like to have are between 6 and 12 months or so backups of all the wikis we can find
18:47 🔗 underscor Obviously this will require more power than whatall we can do, so she's having ops set up a machine that I can use to download wikis too
18:47 🔗 underscor 24TB space, dual gigabit, fun stuff
18:48 🔗 underscor Anyway
18:48 🔗 underscor Did I mention this software is amazing, emijrp?
18:50 🔗 ersi Neat ~
18:50 🔗 emijrp lolololololololo
18:50 🔗 emijrp 24tb for us?
18:51 🔗 emijrp underscor: can you post a thread about this in the mailing list?
18:51 🔗 underscor Yeah
18:51 🔗 emijrp i saw yesterday a collection at IA for wikiteam, i guess you created it
18:51 🔗 emijrp thanks
18:51 🔗 underscor Yep
18:52 🔗 underscor I'm interning at the archive thanks to jason, so I have admin power on the site now
18:52 🔗 emijrp getting paid?
18:52 🔗 underscor Nope
18:52 🔗 underscor But I do get to meet a ton of people
18:52 🔗 emijrp lucky anyway
18:53 🔗 underscor and if it goes well they want to hire me when I graduate
18:53 🔗 emijrp : )
18:53 🔗 underscor I'm so excited
18:53 🔗 underscor I love doing stuff like this
18:53 🔗 emijrp give them my email for WikiTeam tools details
18:53 🔗 emijrp if neccesary
18:54 🔗 emijrp what is that spreadsheet?
18:55 🔗 underscor It's to manage all the wiki's we're tracking
18:55 🔗 underscor The archive wants 6 month xml backups, yearly image backups of all the wikis we can find
18:56 🔗 underscor That's part of why we get the server
18:56 🔗 underscor I have to write some automation magic
18:56 🔗 emijrp great \o/
18:56 🔗 emijrp google code 4GB limit sucks
18:57 🔗 underscor hehe
18:57 🔗 emijrp any feature you need, file a request on google code
18:58 🔗 underscor http://www.archive.org/~tracey/mrtg/df-day.png
18:58 🔗 underscor We have a ways to go before we fill it up
18:58 🔗 underscor ;)
18:58 🔗 emijrp and ,please, check dumps before upload, there is a section on the FAQ/Tutorial
18:58 🔗 underscor Is --namespaces=all by default? It looks like it's checking them all, but I wasn't sure
19:02 🔗 emijrp yes
19:02 🔗 emijrp all namespaces, complete histories, by default
19:05 🔗 underscor Ok, excellent
19:06 🔗 underscor I've found a couple wiki's where the dumpgenerator fails
19:06 🔗 underscor Should I just file a bug report?
19:06 🔗 underscor It goes "Server is slow. Retrying in some time"
19:06 🔗 underscor and then it waits a while
19:06 🔗 underscor then says the backup failed, try resuming, etc
19:06 🔗 underscor It's only foreign language ones (Specifically japanese and korean) so I'm guessing it's a character encoding issue
19:08 🔗 emijrp yes, file an issue
19:10 🔗 underscor ok
19:10 🔗 underscor I'll do that in a bit
19:12 🔗 underscor emijrp: Is wikanda like wiki in spanish?
19:13 🔗 underscor or is it just a name of a site?
19:20 🔗 underscor 53592 titles retrieved in the namespace 0
19:21 🔗 underscor wow
19:21 🔗 emijrp i have wikanda dumps
19:21 🔗 emijrp but if you want to redo
19:22 🔗 emijrp wikanda is an encyclopedia about Andalusia region in Spain
19:22 🔗 emijrp wiki + anda
19:24 🔗 underscor aha
19:24 🔗 underscor Did you do image dumps?
19:25 🔗 emijrp yes
19:25 🔗 emijrp but i cant upload them from home
19:26 🔗 emijrp maybe from university (but next month), so do what you want : )
19:27 🔗 underscor Okay :)
19:27 🔗 emijrp there are some wikifarms
19:27 🔗 emijrp shoutwiki has troubles
19:28 🔗 underscor Do we have lists?
19:28 🔗 emijrp inthe repository
19:28 🔗 underscor Archiving wikis, archiving all day, the archiving wiki game is fun to play!
19:28 🔗 emijrp Nemo_bis: was working downloading shoutwikis, but no news since long time
19:28 🔗 * underscor sings
19:28 🔗 underscor http://i.imgur.com/eBGYE.png
19:28 🔗 underscor http://i.imgur.com/V7K68.png
19:28 🔗 underscor http://i.imgur.com/xXh0n.png
19:28 🔗 underscor Wheeee
19:30 🔗 emijrp are you redonwloading all that with images?
19:30 🔗 emijrp oh, i see the timestamps
19:30 🔗 emijrp ok
19:31 🔗 underscor 0 12:25PM:abuie@teamarchive-0:/1/UNDERTHESTAIRS/wikiteam 147 π du -sh .
19:31 🔗 underscor 6.6G .
19:32 🔗 underscor haha
19:32 🔗 emijrp the config.txt file is not neccesary
19:32 🔗 underscor I've already broken the google code limit
19:32 🔗 underscor Oh, when packing the archive?
19:32 🔗 emijrp yep
19:32 🔗 emijrp it contains the parameters you used to call dumpgenerator, the path to the directory, etc
19:32 🔗 emijrp so, if you dont want to show your path, remove it
19:33 🔗 underscor Ah, I see
19:33 🔗 underscor Thanks
19:35 🔗 emijrp there is a script to download wikipedia dumps
19:36 🔗 underscor It's running :)
19:36 🔗 emijrp ok
19:36 🔗 emijrp the wikiadownloader for wikia.com is broken i guess, they removed periodical dumps and only are created when requested
19:38 🔗 emijrp be careful with wikipediadownloader, it doesnt download items marked as "dump in progress" http://dumps.wikimedia.org/backup-index.html
19:41 🔗 underscor oh I see
19:41 🔗 underscor Is it smart enough to not redownload stuff it's already downloaded?
19:41 🔗 emijrp yes
19:41 🔗 emijrp it uses wget -c
19:42 🔗 underscor delicious
19:42 🔗 emijrp sort by project/date
19:42 🔗 emijrp check md5
19:42 🔗 emijrp etc
19:42 🔗 underscor Is it just me or is shoutwiki incredibly slow?
19:42 🔗 emijrp sorts* checks*
19:43 🔗 emijrp another tip, if you download several wikis from a server, you can crash it
19:44 🔗 emijrp now wikanda wikis are a bit slow, probably because you are donwloading all 8 wikis from there http://huelvapedia.wikanda.es/wiki/Portada
19:45 🔗 emijrp so, better paralel downloads but from several servers (distintcs farms)
19:45 🔗 underscor Okay
19:47 🔗 emijrp give me a gmail account, and i give you committer access at google code
19:47 🔗 emijrp for any list of wikis, tutorial fixes, batch scripts you can add
19:48 🔗 underscor abuie@kwdservices.com
19:50 🔗 emijrp tst if you can edit wiki pages, and commit
19:50 🔗 underscor Ok
19:50 🔗 underscor btw, is this bad?
19:50 🔗 underscor ATTENTION: This wiki does not allow some parameters in Special:Export, so, pages with large histories may be truncated
19:51 🔗 underscor Yep, I can commit
19:51 🔗 underscor (and edit wiki pages)
19:52 🔗 emijrp it is bad if the wiki has long histories, not very usual... but possible. When that error appears, the script must truncate discarding the older versions of the long histories
19:52 🔗 underscor So is it like an option that people turn off, or what?
19:52 🔗 emijrp old mediawikis
19:53 🔗 underscor Oh I see
19:53 🔗 emijrp post the url
19:53 🔗 underscor http://enwada.es/api.php
19:53 🔗 underscor http://enwada.es/wiki/Especial:Exportar
19:54 🔗 emijrp that mediawiki is not so old, 1.15, wait
19:54 🔗 emijrp http://enwada.es/wiki/Especial:Versi%C3%B3n
19:55 🔗 underscor Hm
19:55 🔗 emijrp ah ok, old mediawikis o mediawikis which don't allow to download histories in batches
19:55 🔗 emijrp so, truncated the old revisions of histories
19:56 🔗 emijrp but there is no chance, you get what the server give you, or nothing
19:56 🔗 emijrp so, it truncates*
19:58 🔗 emijrp but it is a warning, it is not bad always, only when wiki has long histories
19:59 🔗 emijrp 21:55:05 <underscor> Hm
19:59 🔗 emijrp 21:55:36 <emijrp> ah ok, old mediawikis o mediawikis which don't allow to download histories in batches
19:59 🔗 emijrp 21:55:44 <emijrp> so, truncated the old revisions of histories
19:59 🔗 emijrp 21:56:01 <emijrp> but there is no chance, you get what the server give you, or nothing
19:59 🔗 emijrp 21:56:20 <emijrp> so, it truncates*
19:59 🔗 emijrp 21:58:02 <emijrp> but it is a warning, it is not bad always, only when wiki has long histories
20:00 🔗 underscor Thanks
20:00 🔗 underscor Okay, so yeah
20:00 🔗 underscor looks like most of these are only 1 or 2 edits anyways
20:00 🔗 emijrp yep
20:01 🔗 emijrp man, this tool is a "patch" for this bloody wiki destruction, we cant want to download an entire site without fails
20:02 🔗 underscor hehe
20:07 🔗 emijrp at the end it is the integry check http://code.google.com/p/wikiteam/wiki/Tutorial
20:08 🔗 emijrp it is not a hard check, but it helps to find broken dumps
20:08 🔗 emijrp i have had only a few broken dumps, and they were big wikis or slow servers with fail (resume and resume and resume)
20:09 🔗 underscor Will do
20:10 🔗 emijrp a broken dump is not a disaster (the data is inside, but you will have problems while trying to import it to a cleaning mediawiki)
20:11 🔗 emijrp i developed a tiny script to remove corrupted <page></page> XML items inside broken dumps, but, dumps are usually ok (not needed)
20:12 🔗 underscor http://wiki.greasespot.net/api.php
20:12 🔗 underscor Error in api.php, please, provide a correct path to api.php
20:12 🔗 underscor Any idea why it would say that?
20:13 🔗 underscor That api looks correct
20:13 🔗 emijrp no api in that url
20:13 🔗 emijrp There is currently no text in this page. You can search for this page title in other pages, or search the related logs.
20:13 🔗 emijrp api is this http://en.wikipedia.org/w/api.php
20:14 🔗 underscor http://i.imgur.com/FsDvr.png
20:14 🔗 emijrp url you posted contains a monkey
20:15 🔗 emijrp http://wiki.greasespot.net/api.php this move to http://wiki.greasespot.net/Api.php and shows an empty page
20:16 🔗 emijrp weird
20:16 🔗 emijrp use --index:http://wiki.greasespot.net/index.php
20:16 🔗 emijrp use --index=http://wiki.greasespot.net/index.php
20:17 🔗 emijrp --api is better, but when it fails, use --index
20:18 🔗 underscor ok
20:18 🔗 underscor How does index differ from api?
20:18 🔗 underscor (Aside from one using the API)
20:19 🔗 underscor Rather, what makes --api better?
20:19 🔗 emijrp index option scrapes html, api is xml/json
20:19 🔗 emijrp I mean, it scrapes the page titles
20:20 🔗 emijrp later, the content is exporting using Special:Export as usual (as done with api)
20:20 🔗 emijrp but the page titles are scrapes, and it is not cool
20:22 🔗 underscor Oh I see
20:24 🔗 emijrp all these questions are good to add to the FAQ
20:25 🔗 underscor :)
20:25 🔗 underscor I'll work on adding them as I have time today
20:25 🔗 underscor Getting ready to go out with family
20:26 🔗 emijrp ok
20:41 🔗 emijrp seeya

irclogger-viewer