#newsgrabber 2017-09-03,Sun

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
arkiverjrwr: million? [01:03]
jrwrBillion with a B [01:03]
............................................................................................................................................ (idle for 11h38mn)
arkiver1.76 percent of URLs downloaded through the wayback machine last month is from newsbuddy [12:41]
....... (idle for 30mn)
HCross2https://archive.org/details/archiveteam_newssites&tab=about 70 million views. Next stop, 100 million
I recon we can do that by Christmas
arkiver: do you know if I can get the file size of an item over the API?
[13:11]
..... (idle for 20mn)
JAAHCross2: Not sure if this is what you're talking about, but you get the size of each file in the item's JSON, so you could just sum that up? [13:37]
HCross2ah yea, that may be the best way [13:37]
.......................... (idle for 2h5mn)
jrwrSo
Dedupe is slowing down, I've reached the limits of my box
im doing 1200 req/s
wait, thats per worker
6900 req/s
[15:42]
........... (idle for 53mn)
HCross2jrwr: what would help? Doing some sort of geo dns splitting the load over a EU and US node? [16:39]
jrwrmaybe
split the databases
ssdb does support that
so far on disk the DB is 35GB
and my box (with 8 xeon cores) is at 8.45 load just from it
so a stupid fast disk would help a ton
[16:39]
HCross2so some sort of nvme
jrwr: is it purely disk io bound or also cpu bound?
[16:41]
jrwrdisk bound
CPU is at 5%
[16:42]
HCross2hang on a second... let me check something [16:42]
jrwr: would something like this work? https://usercontent.irccloud-cdn.com/file/5gdqrzrj/this [16:47]
jrwrThis might work [16:47]
HCross2jrwr: what is IO usage like atm? Any ideas what it would peak at please? Im talking to a few provider friends trying to come up with ideas [16:48]
jrwrhttp://158.69.248.17/munin/localdomain/localhost.localdomain/index.html#disk [16:49]
HCross2jrwr: trying to work out a way to do this affordably [16:51]
jrwrhttp://158.69.248.17/log.txt
http://158.69.248.17/log2.txt
thats the stats from my server
[16:52]
HCross2jrwr: is there a good ish way that I can simulate the load and see how well something holds up? [16:59]
jrwrmake a ssdb database
load it with about 600 million K:V
(or about 13GB disk used)
and start randomly accessing it as fast as you can the keys
its a redis interface so its not too hard
total_calls
863136302410
thats how many commands in the last week
[16:59]
HCross2jrwr: I might have a solution... [17:03]
jrwrIm doing about 6200 req/s right now [17:04]
HCross2jrwr: im looking at just bunging it into the OVH Public Cloud [17:04]
jrwrmaybe
SSDB supports slaves
and does it very well
[17:06]
Smileyoh god
HCross2 is gonna break _all_ of ovh now ;)
[17:10]
HCross2jrwr: http://158.69.248.17/munin/localdomain/localhost.localdomain/if_eth0.html am I reading this right... avg 200Mbps [17:11]
jrwrya
shove everything into google app engine
jrwr is trying to go to the moon after all
[17:11]
HCross2jrwr: dedicated hardware may be our best bet
OVH PCloud cheap instances lock down to 102Mbps
[17:15]
jrwrmy current box is 34$/mo
but has spinning disks
[17:16]
HCross2jrwr: something like https://www.soyoustart.com/en/offers/143sys10.xml ? [17:17]
jrwrya
thats the same one I have
WAIT
WAAAIIITTT
Oh man
IDEA ITEM IS GREAT
so
HCross2: arkiver
what about switching to DHT
https://en.wikipedia.org/wiki/Distributed_hash_table
[17:18]
HCross2I dont think the warrior is too setup for that, as its setup as a "pipeline" to just execute a set of instructions over and over again
unless we had a set of instances purely doing DHT
[17:20]
jrwrits a python script
it can do anything
we will have public nodes ofcourse
[17:21]
HCross2it would be an idea, could we do a small scale test? [17:22]
Smileythat's something maybe my server could actually do :O
it'#s a arm server with sloooooow disk :/
[17:22]
jrwrhttps://en.wikipedia.org/wiki/Kademlia
Ill look into it today
for now my server will hold
[17:22]
HCross2so would each client only hold a small part of what is going on? [17:23]
jrwryes
since its very static data
and in theory
they can insert the data INTO the keystore them selfes
having them all work together
[17:23]
HCross2ah right, so the instance that first discovers a URL stores that URL [17:24]
jrwrya [17:24]
Smileysounds awesome
i just wish i was clever enough to help :/
[17:24]
HCross2^ [17:25]
jrwrhttps://github.com/bmuller/kademlia
ill do some hacking today
ill setup a ton of virtual nodes
feed the ENTIRE k:v database I have now into it
and see how it works out
see what kind of IOPS I can get
[17:25]
HCross2jrwr: would it be an idea to also test latency between nodes [17:26]
jrwrya
its the same tech torrents use
[17:26]
HCross2as say you may have some in EU and some in USA [17:26]
jrwrto figure out peers [17:26]
HCross2ah
except on private trackers :p
[17:26]
jrwrevery warrior is a peer as well
so if your key is copied to a close by guy
its even faster
kv : "00000056f950809576b4db5c0a2ae928bda9967057ed5163eb473b9ff5a3f61f" - "ffffffe5cc356ddd9cfb3f1c0627295402cfc125ab9d2b3816cc911e5197d977"
thats the "rang" of keys im storing on SSDB right now
ill make some mods to the warriors and ill have you do some testing for dedup
I want to multi-thread out the dedupe as well
this is a great test before we get started with soundcloud
because fuck me our traffic will be insane
[17:26]
HCross2arkiver: ive got a feeling that if this works... Brewster may be interested (I presume youve read his blogs on distributed web/locking the web open) [17:28]
jrwr: I know its slightly offtopic.. but watch https://archive.org/details/BrewsterKahleTNWConferenceEurope2015 when you get a chance - its about 21 minutes long but I think youll find it interesting [17:34]
jrwralready seen it
this is why I'm rooting for IPFS, but it doesn't solve the interactive web issue
key/value republish it every 24 hours.
so I can still keep a central K:V store and dump it to the DHT
to republish everything
[17:35]
HCross2ideally.. id like to be able to take the whole thing and stick it in the IA [17:37]
jrwrya
oh man
get the IA to put their dedupe database onto the DHT network
that would be awesome
thats also stupid large
[17:38]
you can set a TXT record to have some values as well
for the bootstrap
we call them DNS Seeds in the Bitcoin world
[17:49]
HCross2jrwr: do I annoy OVH and fill *.harrycross.me with records with all our KVs in it :p [17:51]
jrwroh god
do they have a API?
[17:51]
HCross2yes [17:51]
jrwrhahah
DO IT
that would kill their DNS server
[17:51]
HCross2jrwr: "their dns server" you mean their 20 location strong anycast cluster [17:52]
jrwrwanna try it? [17:52]
HCross2id rather not have every member of OVHs DNS team outside my house with pitchforks [17:52]
.... (idle for 16mn)
jrwrlol
so
I have some basic code
that interfaces a webserver to the DHT network
wanna try it?
[18:08]
HCross2sure [18:09]
jrwrpip install kademlia
https://github.com/bmuller/kademlia/blob/master/examples/webserver.tac
[18:10]
***Aranje has joined #newsgrabber [18:10]
jrwredit bootstrap line to 142.44.174.241
same port
twistd -noy webserver.tac
ls
[18:10]
HCross2do I need an existing webserveR? [18:13]
jrwrno
it spawns one
at the bottom is example usage
https://raw.githubusercontent.com/bmuller/kademlia/master/examples/webserver.tac
find kserver.bootstrap([(
and change it to 142.44.174.241
wow pretty fast
well locally
once I see a few more peers
ill start loading data
[18:13]
HCross2im bringing things on
jrwr: .82 is coming online now, and is at M247 in Manchester, UK
[18:20]
jrwrok
the node can be anywhere, behind firewalls and such
it dont care
[18:21]
HCross2so.. in my house :p
jrwr: do you see me?
[18:21]
jrwrdid you edit the bootstrap line? [18:25]
HCross2yup, and I saw 2017-09-03 19:23:27+0100 [KademliaProtocol] [INFO] got response from 142.44.174.241:8468, adding to router [18:25]
jrwrcool
cu
curl http://localhost:8080/one
what does it say
[18:26]
HCross2"hi there" [18:26]
jrwrbam
you have my data
jrwr starts to load up the ingester
inb4 we crash it
[18:27]
HCross2jrwr: can this work on Windows? [18:28]
jrwryes
go nuts
add as many clients you want
[18:28]
HCross2do I need to open any ports
Windows fyi - https://www.microsoft.com/en-us/download/details.aspx?id=44266 is needed
[18:31]
jrwrno ports are needed
it has nat punching
curl http://127.0.0.1:8181/4870e9ba8df185bba3a3b43fc4bf9658a614785a39cd75f290a81cef7257f223
example url
to test speed
change port of course
[18:35]
HCross2works :)
the terminal is going nuts
[18:40]
jrwrlol
so
im loading it with data from the IA
on my end, its pegged a CPU
[18:43]
HCross220% at M247 [18:44]
jrwrwhat IP [18:44]
HCross2194.187.248.82
did .17 die?
[18:44]
jrwrno
doing science
[18:45]
HCross2now... that .82 is doing things
jrwr: try to make .53.99 cry.. youve got a whole i7 7700 to play with
[18:46]
jrwrfull IP please
and does it have 8080 open :)
on the router
[18:47]
HCross2its a dedi, give me a second [18:47]
jrwrit should
just gimmie the ip
[18:47]
HCross2http://94.130.53.99:8080 [18:48]
jrwrit does not [18:48]
HCross2yeah, its Windows
try now?
Chrome is using more cpu :p
[18:48]
jrwrnot open [18:50]
HCross2I get a page here
http://94.130.53.99:8080/4870e9ba8df185bba3a3b43fc4bf9658a614785a39cd75f290a81cef7257f223 shows me something
[18:50]
jrwrINCOMING
anyway
the more nodes you add
the better the system becomes
[18:51]
HCross2you peaked at 38% but thats overall, with a monero client and Deluge running [18:52]
jrwrits using the same DHT table as bittorrent and IPFS
setup about 4-5 bootstrap servers
I think this could be used for warriors
ill wait for the whole index to be put in
[18:53]
HCross2Ive got Falkenstein DE, Manchester UK, and Los Angeles CA in. Ill find something in Asia [18:58]
jrwrhttp://142.44.174.241:8383/4870e9ba8df185bba3a3b43fc4bf9658a614785a39cd75f290a81cef7257f223
its damn speedy
cold cache pulled it in under 1ms
[19:01]
HCross2Singapore is online [19:02]
jrwrim going to work [19:03]
HCross2ok [19:03]
jrwrill leave this running
afk 2hr
[19:03]
HCross2jrwr: ill probably be asleep by the time you get back.. but ill leave this running [19:04]
jrwrok [19:04]
............ (idle for 58mn)
so I need to do some storage
its doing it IN-MEMORY currently
and is trash
[20:02]
HCross2My instances are low ram [20:02]
jrwrim doing some science hang on
Im going to mod the storage interface to do some leveldb science
[20:03]
***Aranje has quit IRC (Ping timeout: 245 seconds) [20:08]
jrwrgo ahead and purge your clients HCross2
im doing some science
[20:16]
HRM
this is hard
I want something that has only python depends
since its going on warriors
[20:28]
......................................... (idle for 3h23mn)
I give up
arkiver: ill need your help
[23:51]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)