#newsgrabber 2017-08-28,Mon

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***Fletcher- is now known as Fletcher_ [02:56]
....... (idle for 31mn)
Fletcher is now known as Fletcher-
Fletcher_ is now known as Fletcher
[03:27]
.... (idle for 19mn)
Fletcherokay, grabber running at 20 concurrent [03:46]
..... (idle for 23mn)
***HarryCros has quit IRC (Read error: Connection reset by peer)
HarryCros has joined #newsgrabber
[04:09]
........................................ (idle for 3h15mn)
HCross2Fletcher: you can probably run 100 concurrent on that thing. Use lots of instances of 2 concurrent [07:24]
..... (idle for 20mn)
FletcherHCross2, do I need to run them from different directories or can I just up+enter a bunch of times? [07:44]
HCross2just do --port and change the port
Fletcher:
for i in {8000..8060}; do screen -dm su -c "cd /home/archiveteam/NewsGrabber-Warrior/; run-pipeline pipeline.py --concurrent 2 --port $i --address '185.206.224.134' HCross" archiveteam; sleep 2; done
thats what I do
[07:44]
Fletcherisn't the port just for the webserver? [07:48]
HCross2yeah [07:48]
Fletchernever bothered running that anyway :P [07:48]
HCross2ahh, I do just to keep an eye on it [07:49]
Fletcherbeen a while since I've run the warrior scripts at scale >_> [07:50]
HCross2I find that staggering the launches of the warrior instances ensures there is always something grabbing, so it evens out the load on the hardware
otherwise you get huge disk spikes when all 120 instances are writing to disk at once
[07:51]
Fletcherhow long should I expect a cycle to last? (how far apart should I space the groups)
Though it shouldn't be too bad with the nvme drives
[07:52]
HCross2id say 4-5 minutes per "cycle" but I throw down new instances every 10 seconds
that gives it a good stagger
as in my case I had 1 instance launching every 10 seconds, and 60 instances to launch.. so it was about 10 minutes of "ramp up" which was perfect
[07:53]
FletcherFletcher nods [07:54]
HCross2initally youll have huge loads and high resource usage but itll settle back down after a bit
it will settle into a rythmn
[07:55]
Fletcherall good, the server isn't going to be doing anything else :) [07:56]
HCross2haha, ive had some "fun" messages from VPS providers [07:57]
Fletcheryeah I can imagine >_> [07:58]
HCross2M2 "what the hell are you doing Harry" 7
M24 "what the hell are you doing Harry" 7
[07:59]
........... (idle for 54mn)
Fletcher10:53:22 up 5:26, 1 user, load average: 66.67, 54.45, 30.45
lol
[08:53]
............................................................... (idle for 5h11mn)
arkiverwe're making good progress :D
I raised the limit to 200 items/min
I'm going to put a hard limit on the number of URLs that are archive
in case of loops
and I think of making max time to download an URL 4 hours
what do you think?
[14:04]
.... (idle for 17mn)
HCross2that would work, is that the "number of urls per item" [14:22]
arkiverno the maximum number of URLs downloaded by wpull for an item
including redirects for exampe
example*
[14:22]
HCross2ahh yep, that would be nice [14:24]
jrwrYa
Dedupe can handle a ton more traffic
BRING IT
[14:24]
HCross2in terms of the network issues on master.newsbuddy.net - things seem to be calm now.. but I am keeping an eye on it [14:25]
jrwrsomeone is getting me a full dedicated
its one like I can now
we could use it as a rsync target
also arkiver I gave you that list of untrusted rsync targets you can use as well
[14:26]
HCross2jrwr: my issue is "the switch in front of master.* at OVH cant cope with us at full whack" [14:27]
jrwrHA [14:28]
HCross2its passing packets now.. but its having to de prioritize icmp to hell and back.. [14:28]
jrwrhttps://www.youtube.com/watch?v=ygBP7MtT3Ac
what kind of OVH box is it
SYS, OVH, KMS
[14:28]
arkiverjrwr: yes [14:29]
HCross2jrwr: OVH FS-12T
But with a D series CPU
[14:29]
jrwrNot bad
We have access to about 5 feral seedboxes
have a relay of sorts to take ingress there and then upload to master in a UDP stream
OH
OH
I wonder if there is a "rsync" over udp
in theory we could do this with pure python and some foo
Split file into chunks, checksum the chunks, send {metadata}{crc}{fileblock1}{crc}{fileblock2}{crc}
[14:29]
HCross2whoever just sent in around 800Mbps... you just murdered the switch I thinkk [14:34]
jrwrhahah [14:34]
arkiverUDP might be dangerous here? [14:34]
HCross2I just watched it go from responding to 1 in 10 packets to flat out not responding at all [14:34]
arkiverwe can't have corrupted WARC files
yeah we could check of course
[14:34]
jrwrIf you split them into small chunks and checksum them [14:35]
HCross2aannd the file transfer stopped.. and the switch responds like normal now [14:35]
jrwrlike 5MB Chunks with SHA1 hash [14:36]
HCross2thats not the issue.. the switch is duff [14:36]
arkiveryeah [14:36]
jrwrWe need to split up the load [14:36]
arkiveryeah
it looks at this speed like are getting rid of more items than we are getting
we are*
[14:37]
HCross2hallelujah [14:37]
midasHCross2: if it was a OVH box, i've had the DDOS protection kicking in during a busy day on on of my servers. [14:42]
HCross2hmm ill keep an eye - I dont get any under attack emails [14:42]
jrwrThat might be whats happening
you should install munin so I can look at pretty graphs :)
[14:42]
HCross2if I leave an MTR running, I should see it go into VAC then [14:43]
trvzfrom another server, run a mtr report every minute, save to text file, grep over those to look for vac? [14:49]
....... (idle for 30mn)
***HCross has joined #newsgrabber
HarryCros has quit IRC (Ping timeout: 268 seconds)
[15:19]
...... (idle for 28mn)
kyan has joined #newsgrabber [15:50]
.......................... (idle for 2h9mn)
kyan has quit IRC (Remote host closed the connection) [17:59]
........................................... (idle for 3h33mn)
kyan has joined #newsgrabber
Wurstsala has joined #newsgrabber
[21:32]
WurstsalaHey guys I am getting those errors like 50-60% of the time here when runnign the script it doesnt stop just pushes out those errors for 5 minutes and then finds something new https://pastebin.com/81PxrWM6 [21:37]
..... (idle for 24mn)
***Wurstsala has quit IRC (Ping timeout: 268 seconds) [22:01]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)