Time |
Nickname |
Message |
02:08
🔗
|
underscor |
XML for "Suggestions" is wrong. Waiting 80 seconds and reloading... |
02:08
🔗
|
underscor |
What would cause this error to occur? |
02:09
🔗
|
underscor |
I guess emijrp is probably the best to ask |
18:43
🔗
|
underscor |
chronomex: DoubleJ ersi Nemo_bis ops please |
18:43
🔗
|
ersi |
sure |
18:43
🔗
|
underscor |
gracias |
18:45
🔗
|
underscor |
I've been meeting with Alexis this morning (The collections manager at the archive) |
18:46
🔗
|
underscor |
Eventually what they'd like to have are between 6 and 12 months or so backups of all the wikis we can find |
18:47
🔗
|
underscor |
Obviously this will require more power than whatall we can do, so she's having ops set up a machine that I can use to download wikis too |
18:47
🔗
|
underscor |
24TB space, dual gigabit, fun stuff |
18:48
🔗
|
underscor |
Anyway |
18:48
🔗
|
underscor |
Did I mention this software is amazing, emijrp? |
18:50
🔗
|
ersi |
Neat ~ |
18:50
🔗
|
emijrp |
lolololololololo |
18:50
🔗
|
emijrp |
24tb for us? |
18:51
🔗
|
emijrp |
underscor: can you post a thread about this in the mailing list? |
18:51
🔗
|
underscor |
Yeah |
18:51
🔗
|
emijrp |
i saw yesterday a collection at IA for wikiteam, i guess you created it |
18:51
🔗
|
emijrp |
thanks |
18:51
🔗
|
underscor |
Yep |
18:52
🔗
|
underscor |
I'm interning at the archive thanks to jason, so I have admin power on the site now |
18:52
🔗
|
emijrp |
getting paid? |
18:52
🔗
|
underscor |
Nope |
18:52
🔗
|
underscor |
But I do get to meet a ton of people |
18:52
🔗
|
emijrp |
lucky anyway |
18:53
🔗
|
underscor |
and if it goes well they want to hire me when I graduate |
18:53
🔗
|
emijrp |
: ) |
18:53
🔗
|
underscor |
I'm so excited |
18:53
🔗
|
underscor |
I love doing stuff like this |
18:53
🔗
|
emijrp |
give them my email for WikiTeam tools details |
18:53
🔗
|
emijrp |
if neccesary |
18:54
🔗
|
emijrp |
what is that spreadsheet? |
18:55
🔗
|
underscor |
It's to manage all the wiki's we're tracking |
18:55
🔗
|
underscor |
The archive wants 6 month xml backups, yearly image backups of all the wikis we can find |
18:56
🔗
|
underscor |
That's part of why we get the server |
18:56
🔗
|
underscor |
I have to write some automation magic |
18:56
🔗
|
emijrp |
great \o/ |
18:56
🔗
|
emijrp |
google code 4GB limit sucks |
18:57
🔗
|
underscor |
hehe |
18:57
🔗
|
emijrp |
any feature you need, file a request on google code |
18:58
🔗
|
underscor |
http://www.archive.org/~tracey/mrtg/df-day.png |
18:58
🔗
|
underscor |
We have a ways to go before we fill it up |
18:58
🔗
|
underscor |
;) |
18:58
🔗
|
emijrp |
and ,please, check dumps before upload, there is a section on the FAQ/Tutorial |
18:58
🔗
|
underscor |
Is --namespaces=all by default? It looks like it's checking them all, but I wasn't sure |
19:02
🔗
|
emijrp |
yes |
19:02
🔗
|
emijrp |
all namespaces, complete histories, by default |
19:05
🔗
|
underscor |
Ok, excellent |
19:06
🔗
|
underscor |
I've found a couple wiki's where the dumpgenerator fails |
19:06
🔗
|
underscor |
Should I just file a bug report? |
19:06
🔗
|
underscor |
It goes "Server is slow. Retrying in some time" |
19:06
🔗
|
underscor |
and then it waits a while |
19:06
🔗
|
underscor |
then says the backup failed, try resuming, etc |
19:06
🔗
|
underscor |
It's only foreign language ones (Specifically japanese and korean) so I'm guessing it's a character encoding issue |
19:08
🔗
|
emijrp |
yes, file an issue |
19:10
🔗
|
underscor |
ok |
19:10
🔗
|
underscor |
I'll do that in a bit |
19:12
🔗
|
underscor |
emijrp: Is wikanda like wiki in spanish? |
19:13
🔗
|
underscor |
or is it just a name of a site? |
19:20
🔗
|
underscor |
53592 titles retrieved in the namespace 0 |
19:21
🔗
|
underscor |
wow |
19:21
🔗
|
emijrp |
i have wikanda dumps |
19:21
🔗
|
emijrp |
but if you want to redo |
19:22
🔗
|
emijrp |
wikanda is an encyclopedia about Andalusia region in Spain |
19:22
🔗
|
emijrp |
wiki + anda |
19:24
🔗
|
underscor |
aha |
19:24
🔗
|
underscor |
Did you do image dumps? |
19:25
🔗
|
emijrp |
yes |
19:25
🔗
|
emijrp |
but i cant upload them from home |
19:26
🔗
|
emijrp |
maybe from university (but next month), so do what you want : ) |
19:27
🔗
|
underscor |
Okay :) |
19:27
🔗
|
emijrp |
there are some wikifarms |
19:27
🔗
|
emijrp |
shoutwiki has troubles |
19:28
🔗
|
underscor |
Do we have lists? |
19:28
🔗
|
emijrp |
inthe repository |
19:28
🔗
|
underscor |
Archiving wikis, archiving all day, the archiving wiki game is fun to play! |
19:28
🔗
|
emijrp |
Nemo_bis: was working downloading shoutwikis, but no news since long time |
19:28
🔗
|
* |
underscor sings |
19:28
🔗
|
underscor |
http://i.imgur.com/eBGYE.png |
19:28
🔗
|
underscor |
http://i.imgur.com/V7K68.png |
19:28
🔗
|
underscor |
http://i.imgur.com/xXh0n.png |
19:28
🔗
|
underscor |
Wheeee |
19:30
🔗
|
emijrp |
are you redonwloading all that with images? |
19:30
🔗
|
emijrp |
oh, i see the timestamps |
19:30
🔗
|
emijrp |
ok |
19:31
🔗
|
underscor |
0 12:25PM:abuie@teamarchive-0:/1/UNDERTHESTAIRS/wikiteam 147 Ï du -sh . |
19:31
🔗
|
underscor |
6.6G . |
19:32
🔗
|
underscor |
haha |
19:32
🔗
|
emijrp |
the config.txt file is not neccesary |
19:32
🔗
|
underscor |
I've already broken the google code limit |
19:32
🔗
|
underscor |
Oh, when packing the archive? |
19:32
🔗
|
emijrp |
yep |
19:32
🔗
|
emijrp |
it contains the parameters you used to call dumpgenerator, the path to the directory, etc |
19:32
🔗
|
emijrp |
so, if you dont want to show your path, remove it |
19:33
🔗
|
underscor |
Ah, I see |
19:33
🔗
|
underscor |
Thanks |
19:35
🔗
|
emijrp |
there is a script to download wikipedia dumps |
19:36
🔗
|
underscor |
It's running :) |
19:36
🔗
|
emijrp |
ok |
19:36
🔗
|
emijrp |
the wikiadownloader for wikia.com is broken i guess, they removed periodical dumps and only are created when requested |
19:38
🔗
|
emijrp |
be careful with wikipediadownloader, it doesnt download items marked as "dump in progress" http://dumps.wikimedia.org/backup-index.html |
19:41
🔗
|
underscor |
oh I see |
19:41
🔗
|
underscor |
Is it smart enough to not redownload stuff it's already downloaded? |
19:41
🔗
|
emijrp |
yes |
19:41
🔗
|
emijrp |
it uses wget -c |
19:42
🔗
|
underscor |
delicious |
19:42
🔗
|
emijrp |
sort by project/date |
19:42
🔗
|
emijrp |
check md5 |
19:42
🔗
|
emijrp |
etc |
19:42
🔗
|
underscor |
Is it just me or is shoutwiki incredibly slow? |
19:42
🔗
|
emijrp |
sorts* checks* |
19:43
🔗
|
emijrp |
another tip, if you download several wikis from a server, you can crash it |
19:44
🔗
|
emijrp |
now wikanda wikis are a bit slow, probably because you are donwloading all 8 wikis from there http://huelvapedia.wikanda.es/wiki/Portada |
19:45
🔗
|
emijrp |
so, better paralel downloads but from several servers (distintcs farms) |
19:45
🔗
|
underscor |
Okay |
19:47
🔗
|
emijrp |
give me a gmail account, and i give you committer access at google code |
19:47
🔗
|
emijrp |
for any list of wikis, tutorial fixes, batch scripts you can add |
19:48
🔗
|
underscor |
abuie@kwdservices.com |
19:50
🔗
|
emijrp |
tst if you can edit wiki pages, and commit |
19:50
🔗
|
underscor |
Ok |
19:50
🔗
|
underscor |
btw, is this bad? |
19:50
🔗
|
underscor |
ATTENTION: This wiki does not allow some parameters in Special:Export, so, pages with large histories may be truncated |
19:51
🔗
|
underscor |
Yep, I can commit |
19:51
🔗
|
underscor |
(and edit wiki pages) |
19:52
🔗
|
emijrp |
it is bad if the wiki has long histories, not very usual... but possible. When that error appears, the script must truncate discarding the older versions of the long histories |
19:52
🔗
|
underscor |
So is it like an option that people turn off, or what? |
19:52
🔗
|
emijrp |
old mediawikis |
19:53
🔗
|
underscor |
Oh I see |
19:53
🔗
|
emijrp |
post the url |
19:53
🔗
|
underscor |
http://enwada.es/api.php |
19:53
🔗
|
underscor |
http://enwada.es/wiki/Especial:Exportar |
19:54
🔗
|
emijrp |
that mediawiki is not so old, 1.15, wait |
19:54
🔗
|
emijrp |
http://enwada.es/wiki/Especial:Versi%C3%B3n |
19:55
🔗
|
underscor |
Hm |
19:55
🔗
|
emijrp |
ah ok, old mediawikis o mediawikis which don't allow to download histories in batches |
19:55
🔗
|
emijrp |
so, truncated the old revisions of histories |
19:56
🔗
|
emijrp |
but there is no chance, you get what the server give you, or nothing |
19:56
🔗
|
emijrp |
so, it truncates* |
19:58
🔗
|
emijrp |
but it is a warning, it is not bad always, only when wiki has long histories |
19:59
🔗
|
emijrp |
21:55:05 <underscor> Hm |
19:59
🔗
|
emijrp |
21:55:36 <emijrp> ah ok, old mediawikis o mediawikis which don't allow to download histories in batches |
19:59
🔗
|
emijrp |
21:55:44 <emijrp> so, truncated the old revisions of histories |
19:59
🔗
|
emijrp |
21:56:01 <emijrp> but there is no chance, you get what the server give you, or nothing |
19:59
🔗
|
emijrp |
21:56:20 <emijrp> so, it truncates* |
19:59
🔗
|
emijrp |
21:58:02 <emijrp> but it is a warning, it is not bad always, only when wiki has long histories |
20:00
🔗
|
underscor |
Thanks |
20:00
🔗
|
underscor |
Okay, so yeah |
20:00
🔗
|
underscor |
looks like most of these are only 1 or 2 edits anyways |
20:00
🔗
|
emijrp |
yep |
20:01
🔗
|
emijrp |
man, this tool is a "patch" for this bloody wiki destruction, we cant want to download an entire site without fails |
20:02
🔗
|
underscor |
hehe |
20:07
🔗
|
emijrp |
at the end it is the integry check http://code.google.com/p/wikiteam/wiki/Tutorial |
20:08
🔗
|
emijrp |
it is not a hard check, but it helps to find broken dumps |
20:08
🔗
|
emijrp |
i have had only a few broken dumps, and they were big wikis or slow servers with fail (resume and resume and resume) |
20:09
🔗
|
underscor |
Will do |
20:10
🔗
|
emijrp |
a broken dump is not a disaster (the data is inside, but you will have problems while trying to import it to a cleaning mediawiki) |
20:11
🔗
|
emijrp |
i developed a tiny script to remove corrupted <page></page> XML items inside broken dumps, but, dumps are usually ok (not needed) |
20:12
🔗
|
underscor |
http://wiki.greasespot.net/api.php |
20:12
🔗
|
underscor |
Error in api.php, please, provide a correct path to api.php |
20:12
🔗
|
underscor |
Any idea why it would say that? |
20:13
🔗
|
underscor |
That api looks correct |
20:13
🔗
|
emijrp |
no api in that url |
20:13
🔗
|
emijrp |
There is currently no text in this page. You can search for this page title in other pages, or search the related logs. |
20:13
🔗
|
emijrp |
api is this http://en.wikipedia.org/w/api.php |
20:14
🔗
|
underscor |
http://i.imgur.com/FsDvr.png |
20:14
🔗
|
emijrp |
url you posted contains a monkey |
20:15
🔗
|
emijrp |
http://wiki.greasespot.net/api.php this move to http://wiki.greasespot.net/Api.php and shows an empty page |
20:16
🔗
|
emijrp |
weird |
20:16
🔗
|
emijrp |
use --index:http://wiki.greasespot.net/index.php |
20:16
🔗
|
emijrp |
use --index=http://wiki.greasespot.net/index.php |
20:17
🔗
|
emijrp |
--api is better, but when it fails, use --index |
20:18
🔗
|
underscor |
ok |
20:18
🔗
|
underscor |
How does index differ from api? |
20:18
🔗
|
underscor |
(Aside from one using the API) |
20:19
🔗
|
underscor |
Rather, what makes --api better? |
20:19
🔗
|
emijrp |
index option scrapes html, api is xml/json |
20:19
🔗
|
emijrp |
I mean, it scrapes the page titles |
20:20
🔗
|
emijrp |
later, the content is exporting using Special:Export as usual (as done with api) |
20:20
🔗
|
emijrp |
but the page titles are scrapes, and it is not cool |
20:22
🔗
|
underscor |
Oh I see |
20:24
🔗
|
emijrp |
all these questions are good to add to the FAQ |
20:25
🔗
|
underscor |
:) |
20:25
🔗
|
underscor |
I'll work on adding them as I have time today |
20:25
🔗
|
underscor |
Getting ready to go out with family |
20:26
🔗
|
emijrp |
ok |
20:41
🔗
|
emijrp |
seeya |