#wikiteam 2011-07-15,Fri

↑back Search

Time Nickname Message
08:18 🔗 Nemo_bis http://www.archive.org/search.php?query=subject%3A%22wikiteam%22
20:13 🔗 * Nemo_bis uploading a 20 GiB image dump of a wiki to IA :-O
20:14 🔗 Nemo_bis emijrp, http://www.archive.org/search.php?query=subject%3A%22wikiteam%22
20:15 🔗 emijrp nice
20:16 🔗 emijrp im going to contact jason scott, to try add wikiteam items onto archive team collection http://www.archive.org/details/archiveteam
20:17 🔗 Nemo_bis Yes, I wanted to ask you about it.
20:17 🔗 Nemo_bis I tried to upload them there but it says I have no permission (obviously).
20:18 🔗 Nemo_bis By the way, I'm using http://www.archive.org/help/abouts3.txt after trying browser and FTP it's sooo sweet
20:20 🔗 emijrp i check .xml before upload them to google code
20:20 🔗 emijrp i use this line
20:20 🔗 emijrp inside dump directory
20:20 🔗 emijrp wc -l *.txt;grep "<title>" *.xml -c;grep "<page>" *.xml -c;grep "</page>" *.xml -c;grep "<revision>" *.xml -c;grep "</revision>" *.xml -c
20:21 🔗 emijrp it first 3 numbers are equal, and last two too, them, the dump is ok
20:21 🔗 emijrp most times there is no problem
20:21 🔗 emijrp but, check them before upload
20:22 🔗 emijrp you can remove the first wc -l
20:22 🔗 emijrp grep "<title>" *.xml -c;grep "<page>" *.xml -c;grep "</page>" *.xml -c;grep "<revision>" *.xml -c;grep "</revision>" *.xml -c
20:22 🔗 emijrp if first 3 numbers*
20:43 🔗 Nemo_bis Thank you.
20:43 🔗 Nemo_bis Did my dumps have problems?
20:45 🔗 emijrp i have no check them
20:46 🔗 Nemo_bis ok. It would be useful to add to the Tutorials which changes/checks you do un dumps before uploading them, so we can save you some time.
20:46 🔗 Nemo_bis It's also quite boring to uncompress multi-GiB 7z archives to run grep on them.
20:47 🔗 emijrp not much, only the integrity check
20:48 🔗 Nemo_bis that is?
20:51 🔗 emijrp the grep thing
20:51 🔗 emijrp last section http://code.google.com/p/wikiteam/wiki/Tutorial
20:55 🔗 emijrp 30513
20:55 🔗 emijrp 30513
20:55 🔗 emijrp 75202
20:55 🔗 Nemo_bis yes, I understand how it works
20:55 🔗 * Nemo_bis loves grep
20:55 🔗 emijrp infictivecom-20110712-history is ok
20:58 🔗 Nemo_bis what CPU do you have?
20:58 🔗 Nemo_bis mine is very slow, but I could uncompress some of them and check myself
20:58 🔗 emijrp dualcore
20:58 🔗 emijrp 2ghz
20:59 🔗 emijrp acwikkiinet_w-20110712-history is ok too
21:00 🔗 Nemo_bis there's also https://docs.google.com/leaf?id=0By9Ct0yopDdVNzExNzIxMWQtN2Q5Ny00NzQzLTgyOWQtMTdkZjcwNDNhY2E0&hl=it
21:01 🔗 emijrp what the hell you saved 911dataset? that site had 700k pages
21:01 🔗 emijrp sure you did to entirely?
21:02 🔗 Nemo_bis took a while but it was quite fast
21:02 🔗 Nemo_bis not completely sure, though
21:03 🔗 emijrp by the way the site is now offline
21:03 🔗 emijrp i hope you did not waste their bandwidth lol
21:03 🔗 Nemo_bis :-D
21:03 🔗 Nemo_bis 6 days
21:04 🔗 Nemo_bis but only 274 MB ??
21:04 🔗 Nemo_bis perhaps something got lost
21:05 🔗 Nemo_bis Can you access http://www.wikilib.com/ ? Either it's offline or they blocked my IP
21:05 🔗 Nemo_bis couldn't comkplete download
21:05 🔗 emijrp what conecction do you have?
21:06 🔗 Nemo_bis 10 Mb/s full duplex, fiber
21:06 🔗 Nemo_bis Fastweb (Milan)
21:07 🔗 Nemo_bis If those websites had enough bandwidth I'd use some PC at university with 60-100 Mb/s connection...
21:07 🔗 emijrp openwetwareorg-20110712-history is ok
21:08 🔗 emijrp sometimes the bottleneck is on cpu, special:export is resources consiming
21:09 🔗 emijrp boobpedia crashes if you dont add a delay
21:09 🔗 emijrp wikilib error 503
21:10 🔗 emijrp i developed a script to repair corrupt dumps
21:11 🔗 emijrp it removes corrupt pages
21:11 🔗 emijrp but i want to test it before uploading it to svn
21:13 🔗 Nemo_bis ok
21:14 🔗 emijrp dumpgenerator.py includes some comprobations before merge a page to the whole dump
21:14 🔗 emijrp but man, this is a script to recover wikis, not websites with 100,000 pages some of then with hundreds of revisions
21:15 🔗 emijrp : P
21:16 🔗 emijrp a user reported an issue because he cant download all the images of English Wikipedia using dumpgenerator.py, maaaannn
21:16 🔗 Nemo_bis :-D
21:17 🔗 Nemo_bis just had a memoryerror on some huge page
21:17 🔗 Nemo_bis but was only 900 MiB RAM, :-p I had more free :-/
21:19 🔗 emijrp you saved some big wikis
21:20 🔗 emijrp wikieducator users requesting dump http://wikieducator.org/WikiEducator:Community_Council/Meetings/First/Motiondump
21:21 🔗 Nemo_bis lol
21:21 🔗 Nemo_bis didn't have much support
21:21 🔗 Nemo_bis wowpedia dump is currently at 12 Gib and increasing
21:21 🔗 Nemo_bis *GiB
21:22 🔗 emijrp you didnt upload images for wikieducator
21:22 🔗 emijrp wowpedia is a wikia site? they offer backups in that case
21:23 🔗 Nemo_bis I'm uploading still images
21:23 🔗 emijrp i use to search of google like this: dump site:wikieducator.org
21:24 🔗 emijrp looking for official dumps
21:24 🔗 Nemo_bis yes
21:24 🔗 emijrp before run the script
21:24 🔗 Nemo_bis www.wowpedia.org doesn't seem to be Wikia
21:24 🔗 emijrp wikieducatororg-20110712-history is ok
21:25 🔗 Nemo_bis argh, why do people let pages grow at 20000 150 KB revisions??
21:26 🔗 Nemo_bis I need all my RAM for that page .-/
21:27 🔗 Nemo_bis I think this project could be a threat: dump your wiki or I'll DoS it � with a good use. .-p
21:29 🔗 emijrp also, you can add links to backups in wikiindex
21:30 🔗 Nemo_bis boooring
21:30 🔗 Nemo_bis We need some automated system for this, sooner or later
21:30 🔗 Nemo_bis (I mean, the whole project)
21:31 🔗 emijrp look the infobox last parameter http://wikiindex.org/Rezepte-Wiki
21:33 🔗 Nemo_bis yes. wikiindex also needs an API parameter in infobox
21:34 🔗 Nemo_bis although some evil wikis disable API or are just too outdated
21:39 🔗 emijrp thats a problem
21:42 🔗 emijrp what is the limit of google doc hosting?
21:43 🔗 emijrp s23org_w-20110707-wikidump is ok
21:46 🔗 Nemo_bis Google Docs is only 1 GiB
21:46 🔗 Nemo_bis although you can expand
21:46 🔗 Nemo_bis I have several Google accounts :-p
21:58 🔗 emijrp strategywiki is oko
22:14 🔗 Nemo_bis for some reason can't export http://wiki.guildwars.com/index.php?title=Guild:Its_Mocha_(historical)&action=history
22:14 🔗 Nemo_bis (only last revision export works)
22:17 🔗 emijrp stupidediaorg-20110712-history is ok
22:17 🔗 emijrp all the dumps you uploaded to IA are ok, please check future dumps before upload them

irclogger-viewer