#newsgrabber 2017-07-13,Thu

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***Etamin has quit IRC (Ping timeout: 260 seconds) [04:18]
...................................................................................... (idle for 7h6mn)
gk_1wm_su has joined #newsgrabber
gk_1wm_su has left
[11:24]
................................................................ (idle for 5h15mn)
jrwrarkiver: I installed ia on the box
ready when you are
[16:39]
arkiverah yeah, let's test now
arkiver is ssh'ing in
[16:40]
started
jrwr: data is now being added
[16:46]
jrwrnice [16:47]
arkiverthe added items can be found in /home/ark/NewsGrabber-Deduplication-Feeder/indexed [16:50]
jrwrcool
the script is already working on the frontend
[16:51]
arkiveryes, it's adding entries
how is everything holding up?
this is going to take a bit of time
[16:51]
jrwrGood
http://163.172.138.207/18ceeefe89d42e22ea26d10582d4a6b87f48ff71a7365af36ce4c2310e25c52a
Right now we are at 116k Keys
[16:52]
arkiverhmm
sure?
that's low
or wait
it doesn't add revisit records
I'll do some testing locally
[16:52]
jrwr167k now [16:53]
arkiverit's looking good
will add this to the warrior project now
[16:58]
jrwrok [16:58]
arkiverjrwr: are you fine with me hardcoding 163.172.138.207 in the script? [16:59]
jrwrYa
Og nuts
[16:59]
http://163.172.138.207/status [17:06]
***trvz has joined #newsgrabber [17:13]
arkiverit's working great in my test with one URL [17:13]
jrwrGood [17:14]
trvz124143 done + 19297 out + 1117099 to do
those numbers are correct?
[17:14]
jrwrI would add a little small cache on the pipeline
keep a few hot keys in memory
[17:15]
arkiverwhat hot keys do you mean
ah, sorry
I see
URLs often requested
[17:15]
jrwrif its within the same job, it shouldn't pull more then once from the dedupe
so it doesn't have to peg every single time for /favicon
[17:15]
HCross2Would hot keys be stuff like the BBC logo etc? [17:16]
jrwr:)
ya
Mostly just a small local cache
[17:16]
arkiverI see, yes that would be good to do
I first want to get this new version started, and then look into local caching for the warrior
if that's fine with you
jrwr: is it correct that the the requests number on http://163.172.138.207/status shows the number of times http://163.172.138.207/status is requested?
[17:17]
jrwrya
since its "totals"
[17:20]
arkiverok
cool
[17:20]
.... (idle for 18mn)
scripts are updated!!!
so exciting :D
HCross2: Do you think we should start fresh?
since we have a very large backlog
now that we should be faster, might be nice to start with 0 to do items again
[17:38]
HCross2I don't see why not [17:40]
arkiverclearing items now
all gone
newsbuddy being restarted
[17:40]
***newsbuddy has joined #newsgrabber [17:43]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [17:43]
Kazuh
ooh
[17:44]
***newsbuddy has quit IRC (Remote host closed the connection) [17:44]
arkiversorry, have to clean something up [17:44]
***newsbuddy has joined #newsgrabber [17:50]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [17:50]
arkiverthere we go [17:50]
***newsbuddy has quit IRC (Remote host closed the connection) [17:56]
arkiverAlso increasing this to 50 URLs/item
actually no, current average item size is already pretty high
we might move to 50 later on if needed
[17:56]
***newsbuddy has joined #newsgrabber [17:57]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [17:57]
jrwrafk 2h [18:00]
arkiverok
everything should be running fine
[18:00]
HCross2arkiver: so it's about an hour and a bit until we see results now [18:04]
arkiveryes
exciting :D
[18:04]
trvzon the warrior page, search for "&& install -y" - should be "&& apt-get install -y"
*warrior github page
[18:06]
arkiverthanks, fixed [18:07]
........... (idle for 53mn)
started :D [19:00]
HCross2Turning things on now
arkiver: think we can run this as default project?
[19:04]
arkiveryes, it's the default now [19:12]
trvzwhat's a good concurrency for E3's with HT? [19:14]
HCross2trvz: try 20 and see what CPU performance looks like [19:15]
jrwrAh, here comes the storm
look at dem stats
[19:15]
HCross2It'll use a lot of CPU initially as everything will be grabbing at once, but it'll set itself into a rythm [19:15]
trvzI've got a few at Online.net, that network fine with you? [19:15]
HCross2Sure
It's where I am
All you really need is a moderately good throughput to OVH
[19:16]
jrwrtrvz: dedupe is at online.net paris
so
[19:16]
HCross2jrwr: my rsync ingest is "having fun" [19:16]
jrwrBwhahaha [19:16]
trvzso I need a backbone that isn [19:17]
jrwrWhat kind of "Load" [19:17]
trvzso I need a backbone that isn't congested to peers who offer free peering [19:17]
HCross2jrwr: 0.11 load now.. but wait until we megawarc and upload [19:18]
jrwrYa
Thats what killed my EG-16
Dedupe is handling it well
[19:18]
HCross2arkiver: we've got 7.7tb of something on this server [19:18]
jrwrthe cache is already warm! [19:18]
arkiverHCross2: hmm, strange
maybe old projects
[19:18]
trvzNo HTTP response received from tracker. The tracker is probably overloaded. Retrying after 60 seconds... [19:19]
arkiverhaha, we're doing a few hundred URLs/second [19:19]
trvzthat kcaj kid is bad news [19:20]
HCross2arkiver: trying to track down the disk usage now [19:22]
arkiverok [19:24]
jrwrdedupe server is doing amazing http://163.172.138.207/status [19:24]
HCross2Upload just started spewing hundreds of Mbps towards the Archives [19:24]
arkiverjrwr: it's crazy fast [19:26]
jrwrBwhahahha
I made fast things
Its what I do
[19:26]
arkiveryes :) [19:28]
jrwroh man, traffic just doubled! [19:28]
Aoedeweee [19:28]
HCross2arkiver: we've still got over 3 TB of Panoramio [19:28]
arkiverreally... [19:28]
HCross2Yep [19:28]
arkiverI'll get that uploaded now [19:29]
trvzwhy did the items to do drop to 30k? [19:29]
HCross2And a fair bit of FTP government [19:29]
jrwrtrvz: requeueing data
we have a new dedupe we are using, and new code
[19:29]
arkiveroh and we removed old items
wanted to start clean
[19:29]
trvzbut it'll go up again? [19:30]
HCross2arkiver: fair bit of Google Code in /var/www
And photosynyh
Photosynth
[19:30]
arkivernice [19:31]
jrwrI will admit, the disks at Scaleway do have really high IOPs [19:31]
HCross2jrwr: I've got some nvme stuff in London that may be better [19:32]
jrwrWe are at 1.01 Load
handling all requests at about 1ms
nginx is spending more time writing to the TCP stack then reading from disk/ram
[19:32]
HCross2We need a custom TCP stack now :p [19:33]
jrwrarkiver wasn't kidding on the number of keys, I'm glad I tested with 100 Million [19:33]
arkiverhow many we have now
HCross2: I moved default project to somewhere else again. I tried it myself and the warrior didn't run it
[19:34]
jrwr2171742 [19:35]
arkiversounds good [19:35]
jrwrOh, Your values are larger then my test set
thats fine, ill moniter
46GB of swap should do the trick :)
[19:36]
arkiver:D [19:36]
HCross2Very soon we'll have a Google sized datacentre.. just for this [19:36]
arkiverhaha
everything is kept in redis for a month
if not requested in that month, it is removed again from redis
else it stays for another month starting from the last request for the data
[19:37]
jrwrIts last update
not last request
[19:38]
trvz"you won't believe how big a percentage of our archived news articles are plain clickbait titles that should be purged forever" [19:38]
jrwrWell, Im pretty sure [19:38]
trvztitles->content [19:38]
arkiverjrwr: this should set expiration to another month right? https://github.com/ArchiveTeam/NewsGrabber-Deduplication-Feeder/blob/master/indexer.py#L75-L77
expire date*
[19:39]
jrwrya [19:39]
Hecatzthis new deduplication is pretty cool :) [19:50]
arkiverHecatz: yes :) [19:54]
jrwrhttp://163.172.138.207/stats.php
top number is number of keys
the rest is normal redis stats
[19:55]
........ (idle for 35mn)
arkiverjrwr: can we increase max file size for https://wiki.newsbuddy.net/Special:Upload to 40 MB? [20:30]
jrwrYa
Give me a moment
[20:31]
..... (idle for 22mn)
arkiverhmm I created an account and checked the box to sent a random password
never got a mail
was just an account for the screencapture bot
[20:53]
....................... (idle for 1h54mn)
trvzHCross2: stop stealing my internet points [22:47]
...... (idle for 28mn)
arkiverthere's room for plenty of more warriors
not much stealing happening at the moment :P
[23:15]
Kazah yes
my OVH box is already screaming at me, perfect
[23:21]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)