#newsgrabber 2017-07-08,Sat

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
jrwrarkiver: on a key that does not exist
Ill return just a plain 0
[00:00]
JAAI've been thinking about the "only store one hash" thing again. Do we only want to deduplicate for a single URL or also across URLs?
Example: I had some ArchiveBot jobs recently where the website would append a random "token" to every static URL.
So it retrieved stylesheets, scripts, etc. hundreds of times because of different token strings.
Of course, the content was always identical.
Do we want to account for such things?
[00:03]
If so, we'd need to use only the payload hash as the key but also store the full URL for each hash so we can populate the WARC-Refers-To-Target-URI field. Of course, this would inflate the value stored in the DB massively for potentially relatively little gain. [00:10]
jrwrYa [00:19]
........ (idle for 38mn)
arkiver: http://163.172.138.207/5d2fd6db77cc8e8461cc5c51077f361960f6fc9aac1b57f083a4d51bf057cdbd
its a bullshit record
but it has everything
at rest, when it hits the cache, its 0.00011 -- when its a miss its 0.00352
ill fill the DB with a million records and see how it holds
arkiver: if the service returns anything but that OK (mostly a -1 or a 0) its a MISS on the DeDupe
or a error
[00:57]
Im shoving 5 million records into the DB right now [01:14]
................................................. (idle for 4h3mn)
***kyan has quit IRC (Read error: Operation timed out) [05:17]
.... (idle for 16mn)
kyan has joined #newsgrabber [05:33]
.................................... (idle for 2h55mn)
HCross2Does this db need a lot of disk IO? [08:28]
............ (idle for 58mn)
***kyan has quit IRC (Read error: Operation timed out) [09:26]
kyan has joined #newsgrabber [09:31]
........... (idle for 52mn)
kyan has quit IRC (Read error: Operation timed out) [10:23]
HCross2jrwr: my concern about running dedupe on the same server is the IO impact
Doesnt take too much to topple over 3 HDDs - especially when megawarc is thrashing away
[10:24]
***kyan has joined #newsgrabber [10:27]
........... (idle for 52mn)
kyan has quit IRC (Read error: Operation timed out) [11:19]
............................................................................................... (idle for 7h52mn)
jrwrHCross2: Kind of
HCross2: That is why Im setting it up on a fast VPS
[19:11]
HCross2ah nice [19:12]
jrwrI've got a key return of 0.00001 Seconds currenty
and thats single requests, ontop of a memcached cache that handles it even quicker
Inserts are a little more costly at 0.0001s per insert
I would like to test it, Im going to look into a seige of it, and try to randomly access all 5 million records I shoved into it last night
[19:12]
HCross2jrwr: Would you mind if I tested it from around the world as well to get an idea of how latency affects things? [19:15]
jrwrYa
So I sha256 1-5000000
thats the keys
[19:16]
HCross2I've got some stuff in Singapore I can thrash it with [19:16]
jrwr5000001 was the last key
the first hit will pull out of the DB, then Nginx's Cache from there
[19:16]
HCross2ah right [19:17]
jrwryou will find a header in there
X-MainCache
a HIT is nginx, MISS is DB, the time right after it is the raw processing time
[19:17]
HCross2ah right
Whats the API URl?
[19:18]
jrwrhttp://163.172.138.207/00007b5005f865adc26eea2ef79d06607ecc8bc0a460cf3c3fda0ad1cf8417b7
thats a example
OK: is a vaild record, anything else (0, -1, -9[Let me know if you get -9])
[19:18]
HCross2Will do - im thinking... for grabbers that are doing a lot of concurrent - could we run "local" dedupe caches? [19:20]
jrwrof course
I was thinking a hour or two max
to handle the current job
its got a 2G Memcache store, the whole database will fit into it
shouldn't be too bad
[19:21]
HCross2: if you use siege ill give you the command line
I'm making the URLs file now
[19:29]
..... (idle for 23mn)
You know when shit is real, when Im getting 40% CPU Usage on Memcache!
1500 Concurrency will do that
[19:52]
..... (idle for 20mn)
HCross2jrwr: yes please on the command [20:13]
jrwrits a 400MB URL File
so you might run into ram issues
https://jrwr.io/urls.txt
Im doing some more tweaking as Im only getting about 2000req/s
[20:14]
siege -b --time=1M -c 200 -f urls.txt -v [20:29]
HCross2jrwr: yea I will have RAM issues.. ive only got 512MB in Singapore [20:31]
jrwrSwap
ALL THE SWAP
8GB does the trick
[20:33]
HCross2jrwr: where is jrwr.io hosted out of interest? [20:34]
jrwra Scaleway
Paris, what?
why?
[20:35]
HCross2heh, I hit over 200Mbps to Singapore [20:35]
jrwrA scaleway will be hosting the dedupe server as well [20:35]
HCross2yep, it just gets killed [20:36]
jrwr: did I just murder it all [20:47]
jrwrI did
well I am
[20:48]
HCross2its better now [20:48]
jrwrhttps://hastebin.com/helevowewe.swift
thats my seige of it
im testing tweaks currently
the 502s are interesting, I don't know why its timing out
[20:48]
HCross2https://www.irccloud.com/pastebin/il8l1uvG/
hm, I get those stats not the codes
[20:50]
jrwr850.34 trans/sec
thats not bad!
[20:51]
HCross2from Miami as well [20:51]
jrwrIm getting 2500/s in the same datacenter
I think it can handle newsgrabber, dont you think?
[20:51]
HCross2yep [20:51]
jrwrFunny part is.... Im using MariaDB as the backing store, with PHP writing into memcache
fucking memcache IN nginx was slower!
its using LevelDB as the database store on Disk, but I've got 5m keys in storage right now
[20:52]
HCross2I've got a feeling M247 just nuked my port [20:53]
jrwrI might shove 100m keys into it overnight
and see how it holds
I kind of want to see how redis might do
[20:53]
HCross2do it lol
I'm out of town tomorrow, but ill be around depending how well my Surface + 4G tether hold up
jrwr: my sieging VM in Miami crashes before your database
[20:56]
jrwrLOL
and its a ARM64 Virtual Machine
3GB ram
40GB Swap (dedicated disk, 400Mb/s)
[20:58]
............... (idle for 1h13mn)
moved to redis, storing everything as hashes, working very fast [22:12]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)