#newsgrabber 2018-05-31,Thu

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***HCross has quit IRC (Read error: Connection reset by peer)
HCross has joined #newsgrabber
[02:44]
.... (idle for 18mn)
HCross has quit IRC (Read error: Connection reset by peer)
HCross has joined #newsgrabber
[03:02]
qw3rty119 has joined #newsgrabber [03:15]
qw3rty118 has quit IRC (Read error: Operation timed out) [03:21]
.... (idle for 17mn)
odemg has quit IRC (Ping timeout: 260 seconds) [03:38]
odemg has joined #newsgrabber [03:50]
matthusby has quit IRC (Remote host closed the connection) [04:00]
........................................................ (idle for 4h39mn)
odemg has quit IRC (Remote host closed the connection) [08:39]
.............. (idle for 1h9mn)
JensI think a local proxy for dedupe lookups would really speed things up. Thousands of fbcdn links in every job. [09:48]
Igloothere is one Jens
The dedupe runs on a CDN
09:52 < NewsDeDup> [DEDUPE STATUS] Records Added: 0 || DB Size: 47.70GB || Load: 56% || [CDN] 117.1 r/s 46% hit 93.2G BW || EU: London, UK: 17% NA: Los Angeles, CA: 4% EU: Bucharest, RO: 10% EU: Paris, FR:
16% NA: Chicago, IL: 19% EU: Frankfurt, DE: 22% NA: Dallas, TX: 6%
(Also, the dedupe DB lives on my NVME 1Gbit server in Canada)
[10:02]
JensI mean local as in some Squid proxy or whatever on the local machine. [10:03]
IglooOh
The DB is 47GB
Takes a fair chunk of space
Also, Note for odemg when he's next online: Likely nothing.
[10:03]
JensSure, but that's everything that's ever gone through the Newsgrabber project, right? [10:04]
IglooCorrect [10:07]
JensCould even put the local cache on tmpfs or whatever. It's just to catch the twitter, facebook and google nonsense that's a disturbingly large portion of every job. [10:07]
IglooWell, Since we built it that way, at least 12 - 18 months worth
Hmmm, Good idea
Wonder if we could do it so that it queries local, if miss, queries CDN
[10:07]
JensIt just seems that whever I look at my newsgrabber process, it's busy deduping fbcdn. [10:08]
IglooIt wouldn't add much processing time [10:08]
JensThat's basically what Squid does. [10:08]
IglooBut we don't want to run this via a proxy
The data needs to be "clean"
[10:08]
JensI looked into setting up a Squid proxy once (unrelated to newsgrabber), but it got really annoying and convoluted with HTTPS. [10:09]
IglooYep, But we can't have the replies being proxied
So we would have to re-write part of the CDN script
[10:09]
JensYou can tell Squid to only cache certain URLs.
You could run some interesting stats on the dedupe data. I'd like to see the top 100 most deduped files fx.
[10:11]
IglooYeah, We can't do that for the traffic to go into the way back machine
HCross: ping, What stats can we get from bunny
[10:14]
JensWhat can't you do?
For proxying, just ask squid to only cache NewsGrabberDedupe.b-cdn.net.
[10:14]
IglooPutting each request via the proxy is something that would stop this project from being inthe way back machine. As the request has been man in the middled
It just won't happen, So we have to be more creative
[10:23]
JensMake the proxy non-transparent, only call it from dedupe.py. [10:25]
.... (idle for 15mn)
HCrossIgloo: https://bunnycdn.docs.apiary.io/ [10:40]
..................... (idle for 1h41mn)
***odemg has joined #newsgrabber [12:21]
...................................... (idle for 3h7mn)
Kaz@Igloo> Also, Note for odemg when he's next online: Likely nothing. [15:28]
........................... (idle for 2h10mn)
***visi0n has quit IRC (Remote host closed the connection) [17:38]
........................ (idle for 1h57mn)
JensWhen you download gigabytes of videos, and it gets deduplicated :( [19:35]
............ (idle for 57mn)
arkiverI can add some local deduplication to the project [20:32]
odemgKaz, here briefly, sup? [20:35]
Kaznothing, just igloo wanted that passing on [20:35]
odemgahh
HCross, still discussing hardware, wanting to get the cpu/ram I listed instead of the lower core count, guy that has final say will be in the office monday but it's looking good thus far
still worth speaking on Igloo point though as if we go that route we can get us 200TB storage and 16TB nvme which would welcome I'm sure
[20:36]
arkiverodemg: amazing stuff :o [20:42]
odemgmaybe back later anywho, I'm on the road atm [20:43]
HCrossThanks odemg [20:57]
***matthusby has joined #newsgrabber [21:07]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)