[02:08] XML for "Suggestions" is wrong. Waiting 80 seconds and reloading... [02:08] What would cause this error to occur? [02:09] I guess emijrp is probably the best to ask [18:43] chronomex: DoubleJ ersi Nemo_bis ops please [18:43] sure [18:43] gracias [18:45] I've been meeting with Alexis this morning (The collections manager at the archive) [18:46] Eventually what they'd like to have are between 6 and 12 months or so backups of all the wikis we can find [18:47] Obviously this will require more power than whatall we can do, so she's having ops set up a machine that I can use to download wikis too [18:47] 24TB space, dual gigabit, fun stuff [18:48] Anyway [18:48] Did I mention this software is amazing, emijrp? [18:50] Neat ~ [18:50] lolololololololo [18:50] 24tb for us? [18:51] underscor: can you post a thread about this in the mailing list? [18:51] Yeah [18:51] i saw yesterday a collection at IA for wikiteam, i guess you created it [18:51] thanks [18:51] Yep [18:52] I'm interning at the archive thanks to jason, so I have admin power on the site now [18:52] getting paid? [18:52] Nope [18:52] But I do get to meet a ton of people [18:52] lucky anyway [18:53] and if it goes well they want to hire me when I graduate [18:53] : ) [18:53] I'm so excited [18:53] I love doing stuff like this [18:53] give them my email for WikiTeam tools details [18:53] if neccesary [18:54] what is that spreadsheet? [18:55] It's to manage all the wiki's we're tracking [18:55] The archive wants 6 month xml backups, yearly image backups of all the wikis we can find [18:56] That's part of why we get the server [18:56] I have to write some automation magic [18:56] great \o/ [18:56] google code 4GB limit sucks [18:57] hehe [18:57] any feature you need, file a request on google code [18:58] http://www.archive.org/~tracey/mrtg/df-day.png [18:58] We have a ways to go before we fill it up [18:58] ;) [18:58] and ,please, check dumps before upload, there is a section on the FAQ/Tutorial [18:58] Is --namespaces=all by default? It looks like it's checking them all, but I wasn't sure [19:02] yes [19:02] all namespaces, complete histories, by default [19:05] Ok, excellent [19:06] I've found a couple wiki's where the dumpgenerator fails [19:06] Should I just file a bug report? [19:06] It goes "Server is slow. Retrying in some time" [19:06] and then it waits a while [19:06] then says the backup failed, try resuming, etc [19:06] It's only foreign language ones (Specifically japanese and korean) so I'm guessing it's a character encoding issue [19:08] yes, file an issue [19:10] ok [19:10] I'll do that in a bit [19:12] emijrp: Is wikanda like wiki in spanish? [19:13] or is it just a name of a site? [19:20] 53592 titles retrieved in the namespace 0 [19:21] wow [19:21] i have wikanda dumps [19:21] but if you want to redo [19:22] wikanda is an encyclopedia about Andalusia region in Spain [19:22] wiki + anda [19:24] aha [19:24] Did you do image dumps? [19:25] yes [19:25] but i cant upload them from home [19:26] maybe from university (but next month), so do what you want : ) [19:27] Okay :) [19:27] there are some wikifarms [19:27] shoutwiki has troubles [19:28] Do we have lists? [19:28] inthe repository [19:28] Archiving wikis, archiving all day, the archiving wiki game is fun to play! [19:28] Nemo_bis: was working downloading shoutwikis, but no news since long time [19:28] * underscor sings [19:28] http://i.imgur.com/eBGYE.png [19:28] http://i.imgur.com/V7K68.png [19:28] http://i.imgur.com/xXh0n.png [19:28] Wheeee [19:30] are you redonwloading all that with images? [19:30] oh, i see the timestamps [19:30] ok [19:31] 0 12:25PM:abuie@teamarchive-0:/1/UNDERTHESTAIRS/wikiteam 147 π du -sh . [19:31] 6.6G . [19:32] haha [19:32] the config.txt file is not neccesary [19:32] I've already broken the google code limit [19:32] Oh, when packing the archive? [19:32] yep [19:32] it contains the parameters you used to call dumpgenerator, the path to the directory, etc [19:32] so, if you dont want to show your path, remove it [19:33] Ah, I see [19:33] Thanks [19:35] there is a script to download wikipedia dumps [19:36] It's running :) [19:36] ok [19:36] the wikiadownloader for wikia.com is broken i guess, they removed periodical dumps and only are created when requested [19:38] be careful with wikipediadownloader, it doesnt download items marked as "dump in progress" http://dumps.wikimedia.org/backup-index.html [19:41] oh I see [19:41] Is it smart enough to not redownload stuff it's already downloaded? [19:41] yes [19:41] it uses wget -c [19:42] delicious [19:42] sort by project/date [19:42] check md5 [19:42] etc [19:42] Is it just me or is shoutwiki incredibly slow? [19:42] sorts* checks* [19:43] another tip, if you download several wikis from a server, you can crash it [19:44] now wikanda wikis are a bit slow, probably because you are donwloading all 8 wikis from there http://huelvapedia.wikanda.es/wiki/Portada [19:45] so, better paralel downloads but from several servers (distintcs farms) [19:45] Okay [19:47] give me a gmail account, and i give you committer access at google code [19:47] for any list of wikis, tutorial fixes, batch scripts you can add [19:48] abuie@kwdservices.com [19:50] tst if you can edit wiki pages, and commit [19:50] Ok [19:50] btw, is this bad? [19:50] ATTENTION: This wiki does not allow some parameters in Special:Export, so, pages with large histories may be truncated [19:51] Yep, I can commit [19:51] (and edit wiki pages) [19:52] it is bad if the wiki has long histories, not very usual... but possible. When that error appears, the script must truncate discarding the older versions of the long histories [19:52] So is it like an option that people turn off, or what? [19:52] old mediawikis [19:53] Oh I see [19:53] post the url [19:53] http://enwada.es/api.php [19:53] http://enwada.es/wiki/Especial:Exportar [19:54] that mediawiki is not so old, 1.15, wait [19:54] http://enwada.es/wiki/Especial:Versi%C3%B3n [19:55] Hm [19:55] ah ok, old mediawikis o mediawikis which don't allow to download histories in batches [19:55] so, truncated the old revisions of histories [19:56] but there is no chance, you get what the server give you, or nothing [19:56] so, it truncates* [19:58] but it is a warning, it is not bad always, only when wiki has long histories [19:59] 21:55:05 Hm [19:59] 21:55:36 ah ok, old mediawikis o mediawikis which don't allow to download histories in batches [19:59] 21:55:44 so, truncated the old revisions of histories [19:59] 21:56:01 but there is no chance, you get what the server give you, or nothing [19:59] 21:56:20 so, it truncates* [19:59] 21:58:02 but it is a warning, it is not bad always, only when wiki has long histories [20:00] Thanks [20:00] Okay, so yeah [20:00] looks like most of these are only 1 or 2 edits anyways [20:00] yep [20:01] man, this tool is a "patch" for this bloody wiki destruction, we cant want to download an entire site without fails [20:02] hehe [20:07] at the end it is the integry check http://code.google.com/p/wikiteam/wiki/Tutorial [20:08] it is not a hard check, but it helps to find broken dumps [20:08] i have had only a few broken dumps, and they were big wikis or slow servers with fail (resume and resume and resume) [20:09] Will do [20:10] a broken dump is not a disaster (the data is inside, but you will have problems while trying to import it to a cleaning mediawiki) [20:11] i developed a tiny script to remove corrupted XML items inside broken dumps, but, dumps are usually ok (not needed) [20:12] http://wiki.greasespot.net/api.php [20:12] Error in api.php, please, provide a correct path to api.php [20:12] Any idea why it would say that? [20:13] That api looks correct [20:13] no api in that url [20:13] There is currently no text in this page. You can search for this page title in other pages, or search the related logs. [20:13] api is this http://en.wikipedia.org/w/api.php [20:14] http://i.imgur.com/FsDvr.png [20:14] url you posted contains a monkey [20:15] http://wiki.greasespot.net/api.php this move to http://wiki.greasespot.net/Api.php and shows an empty page [20:16] weird [20:16] use --index:http://wiki.greasespot.net/index.php [20:16] use --index=http://wiki.greasespot.net/index.php [20:17] --api is better, but when it fails, use --index [20:18] ok [20:18] How does index differ from api? [20:18] (Aside from one using the API) [20:19] Rather, what makes --api better? [20:19] index option scrapes html, api is xml/json [20:19] I mean, it scrapes the page titles [20:20] later, the content is exporting using Special:Export as usual (as done with api) [20:20] but the page titles are scrapes, and it is not cool [20:22] Oh I see [20:24] all these questions are good to add to the FAQ [20:25] :) [20:25] I'll work on adding them as I have time today [20:25] Getting ready to go out with family [20:26] ok [20:41] seeya