Time |
Nickname |
Message |
15:05
🔗
|
Nemo_bis |
SketchCow: how hard would it be in your opinion for me to adapt your keywords-generator to wiki XML dumps? |
15:06
🔗
|
SketchCow |
It's not hard. |
15:06
🔗
|
SketchCow |
Give me a place to grab an .xml dump to test. |
15:13
🔗
|
Nemo_bis |
SketchCow: http://archive.org/download/wikia_dump_20140125/o.zip/originals_pages_current.xml.gz for instance |
15:18
🔗
|
SketchCow |
Ok, so first, I'm running this straight up at the XML. |
15:18
🔗
|
SketchCow |
I suspect this will be a d i s a s t e r |
15:18
🔗
|
SketchCow |
But we need a control group |
15:32
🔗
|
Nemo_bis |
:) The meat is between <text*></text> tags |
15:33
🔗
|
SketchCow |
Yeah. |
15:34
🔗
|
SketchCow |
http://fos.textfiles.com/explosion.txt |
15:34
🔗
|
SketchCow |
Well, that came out better than expected. |
15:34
🔗
|
SketchCow |
That's no changes whatsoever. |
15:35
🔗
|
SketchCow |
When I generate the things for internet archive, I take the top 10 single-words and the top 10 phrases. |
15:39
🔗
|
Nemo_bis |
Indeed, clean enough. |
15:39
🔗
|
Nemo_bis |
Half of the top10 is usernames I think? |
15:45
🔗
|
SketchCow |
Can't speak for that |
19:19
🔗
|
midas |
so, couple of weeks/month later, http://imslp.org/ is now at 1.1GB downloaded :P |
19:52
🔗
|
Nemo_bis |
midas: if this can make you feel better, I'm at 60 MB downloaded from wiki-site.com after more than 8 months |
19:55
🔗
|
Nemo_bis |
* 28 MB of deduplicated 7z in 7 months |
20:09
🔗
|
midas |
lol |
20:10
🔗
|
Nemo_bis |
I've almost finished the wikis named b* |
20:12
🔗
|
midas |
mine stops every so many hours and i'll have to wait for 24 before i can restart them :p |
20:14
🔗
|
midas |
with a 25 second delay already, but hey, wait and see |
20:15
🔗
|
Nemo_bis |
I'm using a 240 s delay + 720 s between each wiki and it still gets stuck often. :< |
20:16
🔗
|
midas |
geez, thats very long between each request |
20:17
🔗
|
Nemo_bis |
They have a very annoying throttling, allegedly to combat spambots |
20:19
🔗
|
midas |
yeah, i noticed the same with imslp |
20:19
🔗
|
Nemo_bis |
No, IMSLP is plainly a jerk :) |
20:20
🔗
|
Nemo_bis |
If you ask they tell you clearly that they don't like to be crawled |
20:21
🔗
|
midas |
yeah emailed and send a twitter message just now |
20:22
🔗
|
midas |
"Move over, i want to scroll your website." |
20:22
🔗
|
midas |
crawl* |
20:25
🔗
|
Nemo_bis |
midas: I don't recommend that, usually the outcome is that your IP gets banned instantly |
20:26
🔗
|
midas |
thats fine, i have enough ip's |