| Time |
Nickname |
Message |
|
08:18
🔗
|
Nemo_bis |
http://www.archive.org/search.php?query=subject%3A%22wikiteam%22 |
|
20:13
🔗
|
* |
Nemo_bis uploading a 20 GiB image dump of a wiki to IA :-O |
|
20:14
🔗
|
Nemo_bis |
emijrp, http://www.archive.org/search.php?query=subject%3A%22wikiteam%22 |
|
20:15
🔗
|
emijrp |
nice |
|
20:16
🔗
|
emijrp |
im going to contact jason scott, to try add wikiteam items onto archive team collection http://www.archive.org/details/archiveteam |
|
20:17
🔗
|
Nemo_bis |
Yes, I wanted to ask you about it. |
|
20:17
🔗
|
Nemo_bis |
I tried to upload them there but it says I have no permission (obviously). |
|
20:18
🔗
|
Nemo_bis |
By the way, I'm using http://www.archive.org/help/abouts3.txt after trying browser and FTP it's sooo sweet |
|
20:20
🔗
|
emijrp |
i check .xml before upload them to google code |
|
20:20
🔗
|
emijrp |
i use this line |
|
20:20
🔗
|
emijrp |
inside dump directory |
|
20:20
🔗
|
emijrp |
wc -l *.txt;grep "<title>" *.xml -c;grep "<page>" *.xml -c;grep "</page>" *.xml -c;grep "<revision>" *.xml -c;grep "</revision>" *.xml -c |
|
20:21
🔗
|
emijrp |
it first 3 numbers are equal, and last two too, them, the dump is ok |
|
20:21
🔗
|
emijrp |
most times there is no problem |
|
20:21
🔗
|
emijrp |
but, check them before upload |
|
20:22
🔗
|
emijrp |
you can remove the first wc -l |
|
20:22
🔗
|
emijrp |
grep "<title>" *.xml -c;grep "<page>" *.xml -c;grep "</page>" *.xml -c;grep "<revision>" *.xml -c;grep "</revision>" *.xml -c |
|
20:22
🔗
|
emijrp |
if first 3 numbers* |
|
20:43
🔗
|
Nemo_bis |
Thank you. |
|
20:43
🔗
|
Nemo_bis |
Did my dumps have problems? |
|
20:45
🔗
|
emijrp |
i have no check them |
|
20:46
🔗
|
Nemo_bis |
ok. It would be useful to add to the Tutorials which changes/checks you do un dumps before uploading them, so we can save you some time. |
|
20:46
🔗
|
Nemo_bis |
It's also quite boring to uncompress multi-GiB 7z archives to run grep on them. |
|
20:47
🔗
|
emijrp |
not much, only the integrity check |
|
20:48
🔗
|
Nemo_bis |
that is? |
|
20:51
🔗
|
emijrp |
the grep thing |
|
20:51
🔗
|
emijrp |
last section http://code.google.com/p/wikiteam/wiki/Tutorial |
|
20:55
🔗
|
emijrp |
30513 |
|
20:55
🔗
|
emijrp |
30513 |
|
20:55
🔗
|
emijrp |
75202 |
|
20:55
🔗
|
Nemo_bis |
yes, I understand how it works |
|
20:55
🔗
|
* |
Nemo_bis loves grep |
|
20:55
🔗
|
emijrp |
infictivecom-20110712-history is ok |
|
20:58
🔗
|
Nemo_bis |
what CPU do you have? |
|
20:58
🔗
|
Nemo_bis |
mine is very slow, but I could uncompress some of them and check myself |
|
20:58
🔗
|
emijrp |
dualcore |
|
20:58
🔗
|
emijrp |
2ghz |
|
20:59
🔗
|
emijrp |
acwikkiinet_w-20110712-history is ok too |
|
21:00
🔗
|
Nemo_bis |
there's also https://docs.google.com/leaf?id=0By9Ct0yopDdVNzExNzIxMWQtN2Q5Ny00NzQzLTgyOWQtMTdkZjcwNDNhY2E0&hl=it |
|
21:01
🔗
|
emijrp |
what the hell you saved 911dataset? that site had 700k pages |
|
21:01
🔗
|
emijrp |
sure you did to entirely? |
|
21:02
🔗
|
Nemo_bis |
took a while but it was quite fast |
|
21:02
🔗
|
Nemo_bis |
not completely sure, though |
|
21:03
🔗
|
emijrp |
by the way the site is now offline |
|
21:03
🔗
|
emijrp |
i hope you did not waste their bandwidth lol |
|
21:03
🔗
|
Nemo_bis |
:-D |
|
21:03
🔗
|
Nemo_bis |
6 days |
|
21:04
🔗
|
Nemo_bis |
but only 274 MB ?? |
|
21:04
🔗
|
Nemo_bis |
perhaps something got lost |
|
21:05
🔗
|
Nemo_bis |
Can you access http://www.wikilib.com/ ? Either it's offline or they blocked my IP |
|
21:05
🔗
|
Nemo_bis |
couldn't comkplete download |
|
21:05
🔗
|
emijrp |
what conecction do you have? |
|
21:06
🔗
|
Nemo_bis |
10 Mb/s full duplex, fiber |
|
21:06
🔗
|
Nemo_bis |
Fastweb (Milan) |
|
21:07
🔗
|
Nemo_bis |
If those websites had enough bandwidth I'd use some PC at university with 60-100 Mb/s connection... |
|
21:07
🔗
|
emijrp |
openwetwareorg-20110712-history is ok |
|
21:08
🔗
|
emijrp |
sometimes the bottleneck is on cpu, special:export is resources consiming |
|
21:09
🔗
|
emijrp |
boobpedia crashes if you dont add a delay |
|
21:09
🔗
|
emijrp |
wikilib error 503 |
|
21:10
🔗
|
emijrp |
i developed a script to repair corrupt dumps |
|
21:11
🔗
|
emijrp |
it removes corrupt pages |
|
21:11
🔗
|
emijrp |
but i want to test it before uploading it to svn |
|
21:13
🔗
|
Nemo_bis |
ok |
|
21:14
🔗
|
emijrp |
dumpgenerator.py includes some comprobations before merge a page to the whole dump |
|
21:14
🔗
|
emijrp |
but man, this is a script to recover wikis, not websites with 100,000 pages some of then with hundreds of revisions |
|
21:15
🔗
|
emijrp |
: P |
|
21:16
🔗
|
emijrp |
a user reported an issue because he cant download all the images of English Wikipedia using dumpgenerator.py, maaaannn |
|
21:16
🔗
|
Nemo_bis |
:-D |
|
21:17
🔗
|
Nemo_bis |
just had a memoryerror on some huge page |
|
21:17
🔗
|
Nemo_bis |
but was only 900 MiB RAM, :-p I had more free :-/ |
|
21:19
🔗
|
emijrp |
you saved some big wikis |
|
21:20
🔗
|
emijrp |
wikieducator users requesting dump http://wikieducator.org/WikiEducator:Community_Council/Meetings/First/Motiondump |
|
21:21
🔗
|
Nemo_bis |
lol |
|
21:21
🔗
|
Nemo_bis |
didn't have much support |
|
21:21
🔗
|
Nemo_bis |
wowpedia dump is currently at 12 Gib and increasing |
|
21:21
🔗
|
Nemo_bis |
*GiB |
|
21:22
🔗
|
emijrp |
you didnt upload images for wikieducator |
|
21:22
🔗
|
emijrp |
wowpedia is a wikia site? they offer backups in that case |
|
21:23
🔗
|
Nemo_bis |
I'm uploading still images |
|
21:23
🔗
|
emijrp |
i use to search of google like this: dump site:wikieducator.org |
|
21:24
🔗
|
emijrp |
looking for official dumps |
|
21:24
🔗
|
Nemo_bis |
yes |
|
21:24
🔗
|
emijrp |
before run the script |
|
21:24
🔗
|
Nemo_bis |
www.wowpedia.org doesn't seem to be Wikia |
|
21:24
🔗
|
emijrp |
wikieducatororg-20110712-history is ok |
|
21:25
🔗
|
Nemo_bis |
argh, why do people let pages grow at 20000 150 KB revisions?? |
|
21:26
🔗
|
Nemo_bis |
I need all my RAM for that page .-/ |
|
21:27
🔗
|
Nemo_bis |
I think this project could be a threat: dump your wiki or I'll DoS it � with a good use. .-p |
|
21:29
🔗
|
emijrp |
also, you can add links to backups in wikiindex |
|
21:30
🔗
|
Nemo_bis |
boooring |
|
21:30
🔗
|
Nemo_bis |
We need some automated system for this, sooner or later |
|
21:30
🔗
|
Nemo_bis |
(I mean, the whole project) |
|
21:31
🔗
|
emijrp |
look the infobox last parameter http://wikiindex.org/Rezepte-Wiki |
|
21:33
🔗
|
Nemo_bis |
yes. wikiindex also needs an API parameter in infobox |
|
21:34
🔗
|
Nemo_bis |
although some evil wikis disable API or are just too outdated |
|
21:39
🔗
|
emijrp |
thats a problem |
|
21:42
🔗
|
emijrp |
what is the limit of google doc hosting? |
|
21:43
🔗
|
emijrp |
s23org_w-20110707-wikidump is ok |
|
21:46
🔗
|
Nemo_bis |
Google Docs is only 1 GiB |
|
21:46
🔗
|
Nemo_bis |
although you can expand |
|
21:46
🔗
|
Nemo_bis |
I have several Google accounts :-p |
|
21:58
🔗
|
emijrp |
strategywiki is oko |
|
22:14
🔗
|
Nemo_bis |
for some reason can't export http://wiki.guildwars.com/index.php?title=Guild:Its_Mocha_(historical)&action=history |
|
22:14
🔗
|
Nemo_bis |
(only last revision export works) |
|
22:17
🔗
|
emijrp |
stupidediaorg-20110712-history is ok |
|
22:17
🔗
|
emijrp |
all the dumps you uploaded to IA are ok, please check future dumps before upload them |