Time |
Nickname |
Message |
08:18
🔗
|
Nemo_bis |
http://www.archive.org/search.php?query=subject%3A%22wikiteam%22 |
20:13
🔗
|
* |
Nemo_bis uploading a 20 GiB image dump of a wiki to IA :-O |
20:14
🔗
|
Nemo_bis |
emijrp, http://www.archive.org/search.php?query=subject%3A%22wikiteam%22 |
20:15
🔗
|
emijrp |
nice |
20:16
🔗
|
emijrp |
im going to contact jason scott, to try add wikiteam items onto archive team collection http://www.archive.org/details/archiveteam |
20:17
🔗
|
Nemo_bis |
Yes, I wanted to ask you about it. |
20:17
🔗
|
Nemo_bis |
I tried to upload them there but it says I have no permission (obviously). |
20:18
🔗
|
Nemo_bis |
By the way, I'm using http://www.archive.org/help/abouts3.txt after trying browser and FTP it's sooo sweet |
20:20
🔗
|
emijrp |
i check .xml before upload them to google code |
20:20
🔗
|
emijrp |
i use this line |
20:20
🔗
|
emijrp |
inside dump directory |
20:20
🔗
|
emijrp |
wc -l *.txt;grep "<title>" *.xml -c;grep "<page>" *.xml -c;grep "</page>" *.xml -c;grep "<revision>" *.xml -c;grep "</revision>" *.xml -c |
20:21
🔗
|
emijrp |
it first 3 numbers are equal, and last two too, them, the dump is ok |
20:21
🔗
|
emijrp |
most times there is no problem |
20:21
🔗
|
emijrp |
but, check them before upload |
20:22
🔗
|
emijrp |
you can remove the first wc -l |
20:22
🔗
|
emijrp |
grep "<title>" *.xml -c;grep "<page>" *.xml -c;grep "</page>" *.xml -c;grep "<revision>" *.xml -c;grep "</revision>" *.xml -c |
20:22
🔗
|
emijrp |
if first 3 numbers* |
20:43
🔗
|
Nemo_bis |
Thank you. |
20:43
🔗
|
Nemo_bis |
Did my dumps have problems? |
20:45
🔗
|
emijrp |
i have no check them |
20:46
🔗
|
Nemo_bis |
ok. It would be useful to add to the Tutorials which changes/checks you do un dumps before uploading them, so we can save you some time. |
20:46
🔗
|
Nemo_bis |
It's also quite boring to uncompress multi-GiB 7z archives to run grep on them. |
20:47
🔗
|
emijrp |
not much, only the integrity check |
20:48
🔗
|
Nemo_bis |
that is? |
20:51
🔗
|
emijrp |
the grep thing |
20:51
🔗
|
emijrp |
last section http://code.google.com/p/wikiteam/wiki/Tutorial |
20:55
🔗
|
emijrp |
30513 |
20:55
🔗
|
emijrp |
30513 |
20:55
🔗
|
emijrp |
75202 |
20:55
🔗
|
Nemo_bis |
yes, I understand how it works |
20:55
🔗
|
* |
Nemo_bis loves grep |
20:55
🔗
|
emijrp |
infictivecom-20110712-history is ok |
20:58
🔗
|
Nemo_bis |
what CPU do you have? |
20:58
🔗
|
Nemo_bis |
mine is very slow, but I could uncompress some of them and check myself |
20:58
🔗
|
emijrp |
dualcore |
20:58
🔗
|
emijrp |
2ghz |
20:59
🔗
|
emijrp |
acwikkiinet_w-20110712-history is ok too |
21:00
🔗
|
Nemo_bis |
there's also https://docs.google.com/leaf?id=0By9Ct0yopDdVNzExNzIxMWQtN2Q5Ny00NzQzLTgyOWQtMTdkZjcwNDNhY2E0&hl=it |
21:01
🔗
|
emijrp |
what the hell you saved 911dataset? that site had 700k pages |
21:01
🔗
|
emijrp |
sure you did to entirely? |
21:02
🔗
|
Nemo_bis |
took a while but it was quite fast |
21:02
🔗
|
Nemo_bis |
not completely sure, though |
21:03
🔗
|
emijrp |
by the way the site is now offline |
21:03
🔗
|
emijrp |
i hope you did not waste their bandwidth lol |
21:03
🔗
|
Nemo_bis |
:-D |
21:03
🔗
|
Nemo_bis |
6 days |
21:04
🔗
|
Nemo_bis |
but only 274 MB ?? |
21:04
🔗
|
Nemo_bis |
perhaps something got lost |
21:05
🔗
|
Nemo_bis |
Can you access http://www.wikilib.com/ ? Either it's offline or they blocked my IP |
21:05
🔗
|
Nemo_bis |
couldn't comkplete download |
21:05
🔗
|
emijrp |
what conecction do you have? |
21:06
🔗
|
Nemo_bis |
10 Mb/s full duplex, fiber |
21:06
🔗
|
Nemo_bis |
Fastweb (Milan) |
21:07
🔗
|
Nemo_bis |
If those websites had enough bandwidth I'd use some PC at university with 60-100 Mb/s connection... |
21:07
🔗
|
emijrp |
openwetwareorg-20110712-history is ok |
21:08
🔗
|
emijrp |
sometimes the bottleneck is on cpu, special:export is resources consiming |
21:09
🔗
|
emijrp |
boobpedia crashes if you dont add a delay |
21:09
🔗
|
emijrp |
wikilib error 503 |
21:10
🔗
|
emijrp |
i developed a script to repair corrupt dumps |
21:11
🔗
|
emijrp |
it removes corrupt pages |
21:11
🔗
|
emijrp |
but i want to test it before uploading it to svn |
21:13
🔗
|
Nemo_bis |
ok |
21:14
🔗
|
emijrp |
dumpgenerator.py includes some comprobations before merge a page to the whole dump |
21:14
🔗
|
emijrp |
but man, this is a script to recover wikis, not websites with 100,000 pages some of then with hundreds of revisions |
21:15
🔗
|
emijrp |
: P |
21:16
🔗
|
emijrp |
a user reported an issue because he cant download all the images of English Wikipedia using dumpgenerator.py, maaaannn |
21:16
🔗
|
Nemo_bis |
:-D |
21:17
🔗
|
Nemo_bis |
just had a memoryerror on some huge page |
21:17
🔗
|
Nemo_bis |
but was only 900 MiB RAM, :-p I had more free :-/ |
21:19
🔗
|
emijrp |
you saved some big wikis |
21:20
🔗
|
emijrp |
wikieducator users requesting dump http://wikieducator.org/WikiEducator:Community_Council/Meetings/First/Motiondump |
21:21
🔗
|
Nemo_bis |
lol |
21:21
🔗
|
Nemo_bis |
didn't have much support |
21:21
🔗
|
Nemo_bis |
wowpedia dump is currently at 12 Gib and increasing |
21:21
🔗
|
Nemo_bis |
*GiB |
21:22
🔗
|
emijrp |
you didnt upload images for wikieducator |
21:22
🔗
|
emijrp |
wowpedia is a wikia site? they offer backups in that case |
21:23
🔗
|
Nemo_bis |
I'm uploading still images |
21:23
🔗
|
emijrp |
i use to search of google like this: dump site:wikieducator.org |
21:24
🔗
|
emijrp |
looking for official dumps |
21:24
🔗
|
Nemo_bis |
yes |
21:24
🔗
|
emijrp |
before run the script |
21:24
🔗
|
Nemo_bis |
www.wowpedia.org doesn't seem to be Wikia |
21:24
🔗
|
emijrp |
wikieducatororg-20110712-history is ok |
21:25
🔗
|
Nemo_bis |
argh, why do people let pages grow at 20000 150 KB revisions?? |
21:26
🔗
|
Nemo_bis |
I need all my RAM for that page .-/ |
21:27
🔗
|
Nemo_bis |
I think this project could be a threat: dump your wiki or I'll DoS it � with a good use. .-p |
21:29
🔗
|
emijrp |
also, you can add links to backups in wikiindex |
21:30
🔗
|
Nemo_bis |
boooring |
21:30
🔗
|
Nemo_bis |
We need some automated system for this, sooner or later |
21:30
🔗
|
Nemo_bis |
(I mean, the whole project) |
21:31
🔗
|
emijrp |
look the infobox last parameter http://wikiindex.org/Rezepte-Wiki |
21:33
🔗
|
Nemo_bis |
yes. wikiindex also needs an API parameter in infobox |
21:34
🔗
|
Nemo_bis |
although some evil wikis disable API or are just too outdated |
21:39
🔗
|
emijrp |
thats a problem |
21:42
🔗
|
emijrp |
what is the limit of google doc hosting? |
21:43
🔗
|
emijrp |
s23org_w-20110707-wikidump is ok |
21:46
🔗
|
Nemo_bis |
Google Docs is only 1 GiB |
21:46
🔗
|
Nemo_bis |
although you can expand |
21:46
🔗
|
Nemo_bis |
I have several Google accounts :-p |
21:58
🔗
|
emijrp |
strategywiki is oko |
22:14
🔗
|
Nemo_bis |
for some reason can't export http://wiki.guildwars.com/index.php?title=Guild:Its_Mocha_(historical)&action=history |
22:14
🔗
|
Nemo_bis |
(only last revision export works) |
22:17
🔗
|
emijrp |
stupidediaorg-20110712-history is ok |
22:17
🔗
|
emijrp |
all the dumps you uploaded to IA are ok, please check future dumps before upload them |