#newsgrabber 2017-06-23,Fri

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
arkiverjrwr: it might be good to delete the cache after some time
since many static URLs will be grabbed many times by this project
they end up in the wayback machine
and then are deduplicated through the CDX API in future WARCs
problem is that popular URLs will never be refreshed
jrwr: would it be possible to set that pages with certain content will automatically be removed after 48 hours
while pages with not that specific content will stay with what is done now
(not removed if very popular)
[00:30]
..... (idle for 20mn)
jrwrarkiver: its 1D
everything expires after 1D
[00:54]
inactives get purged every day as well [00:59]
.............. (idle for 1h5mn)
***Aranje has joined #newsgrabber [02:04]
....... (idle for 33mn)
Crusher has quit IRC (Bye)
Crusher has joined #newsgrabber
[02:37]
Crusher has quit IRC (Read error: Connection reset by peer)
Crusher has joined #newsgrabber
Crusher has quit IRC (Client Quit)
Crusher has joined #newsgrabber
[02:51]
CrusherWhich file holds the config for the webserver port? [02:55]
I'm trying to run newsgrabber on the same machine as what's currently running imzy
And it won't run because the port is already in use
[03:07]
................... (idle for 1h32mn)
***Crusher_ has joined #newsgrabber
Crusher has quit IRC (Read error: Connection reset by peer)
[04:39]
....... (idle for 33mn)
Crusher has joined #newsgrabber
Crusher_ has quit IRC (Read error: Connection reset by peer)
[05:12]
Crusherjrwr : as soon as imzy is done (sometime tomorrow at this rate), I'll give'er on the newsgrabber should be interesting to see what effect it has on your cache [05:18]
......... (idle for 43mn)
KazCrusher: --disable-web-server, or --port abcd [06:01]
HCross2We're at 98% disk usage on my side
That means 12tb
[06:14]
.......... (idle for 45mn)
218 GB left
arkiver: ^
[07:00]
Kazpause grabbing? [07:09]
............. (idle for 1h2mn)
***blitzed has quit IRC (hub.efnet.us hub.dk)
joepie91 has quit IRC (hub.efnet.us hub.dk)
chfoo has quit IRC (hub.efnet.us hub.dk)
underscor has quit IRC (hub.efnet.us hub.dk)
[08:11]
Kaz!server-status newsbuddy
!server-stats newsbuddy
[08:12]
newsbuddyKaz: Getting server stats... [08:12]
Kazthats the one [08:12]
newsbuddyKaz: CPU usage percent: total 6.0 - user 5.4 - nice 0.0 - system 0.4 - idle 94.1.
Kaz: Virtual memory usage: total 16778424320 - percent 24.0.
Kaz: Disk usage: total 11906045009920 - percent 99.1.
[08:12]
Kazha [08:12]
.................. (idle for 1h26mn)
HCross2arkiver: I need to reboot master at somepoint soon [09:38]
Kaz!server-stats newsbuddy [09:45]
newsbuddyKaz: Getting server stats...
Kaz: CPU usage percent: total 5.8 - user 5.9 - nice 0.0 - system 0.4 - idle 93.5.
Kaz: Virtual memory usage: total 16778424320 - percent 25.3.
Kaz: Disk usage: total 11906045009920 - percent 99.1.
[09:45]
.......................... (idle for 2h6mn)
***chfoo has joined #newsgrabber
joepie91 has joined #newsgrabber
blitzed has joined #newsgrabber
underscor has joined #newsgrabber
hub.dk sets mode: +o chfoo
[11:51]
..... (idle for 21mn)
Crusher_ has joined #newsgrabber [12:12]
Crusher has quit IRC (Read error: Operation timed out) [12:17]
...... (idle for 29mn)
HCross2Kaz: you around? [12:46]
***Crusher has joined #newsgrabber
Crusher_ has quit IRC (Ping timeout: 246 seconds)
[12:58]
HCross2Can any of the tracker admins get to the tracker?
arkiver: hm. Were not uploading for some reason
The packer wasn't running
Can you take a look when you get a second please. We're full and not uploading
[13:06]
arkiver: hmm were not putting anything into the megawarc incoming folder [13:16]
arkiverhi
jus got up
just*
huh strange
taking a look at that now
[13:17]
HCross2Good morning :p [13:17]
arkiverI do know the packer wasn't running, but WARCs should have been moved to /megawarc/incoming dir in NewsBuddy [13:17]
..... (idle for 22mn)
***newsbuddy has quit IRC (Remote host closed the connection) [13:39]
arkiverfixed the WARCs sorting thing [13:39]
***newsbuddy has joined #newsgrabber [13:40]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [13:40]
..... (idle for 20mn)
KazI can get on the tracker [14:00]
arkivermore space is being freed
stuff is being uploaded
!server-stats newsbuddy\
!server-stats newsbuddy
[14:01]
newsbuddyarkiver: Getting server stats...
arkiver: CPU usage percent: total 29.5 - user 16.6 - nice 0.0 - system 3.3 - idle 68.9.
arkiver: Virtual memory usage: total 16778424320 - percent 19.2.
arkiver: Disk usage: total 11906045009920 - percent 98.6.
[14:01]
HCross2Shall we wait for it to clear down more before we add more? [14:02]
arkiverwell more is constantly being added
maybe wait until we are back to 500+ GB free
[14:03]
HCross2Yep, and until we've started pushing warcs over [14:10]
arkiver82100 MB WARC is being uploaded now [14:23]
Kaznewsbuddy:warrior-videos_143_1498108580.24 newsbuddy:warrior-videos_294_1498013023.02
can someone check that these lists are actually different please - seems really weird that they're 1mb different
[14:28]
arkiverunfortunately not
they've been processed already and lists removed
[14:30]
Kazah okay [14:31]
HCross2arkiver: it's uploading at around 20Mbps [14:39]
arkiveryeah :/
starting more uploads
[14:39]
HCross2The IA have never been good at single threaded.. especially long distance [14:44]
arkiver: hmm. Could we upload our URL lists to the Archives too? [14:53]
arkiveryes [14:54]
........................ (idle for 1h56mn)
CrusherWhat CPUs is the IA using? [16:50]
...... (idle for 27mn)
Kaz!server-stats newsbuddy [17:17]
newsbuddyKaz: Getting server stats...
Kaz: CPU usage percent: total 35.0 - user 13.5 - nice 0.0 - system 3.0 - idle 63.0.
Kaz: Virtual memory usage: total 16778424320 - percent 18.8.
Kaz: Disk usage: total 11906045009920 - percent 98.7.
[17:17]
Kazoh [17:17]
CrusherThat doesn't help :P [17:19]
arkiverI'm creating some articles on wiki.newsbuddy.net
mostly basic articles about how this works, what IA is and does, etc.
[17:33]
....... (idle for 31mn)
Wrote some basic pages
Going to work on an article on how to add a services and the different variables involved
[18:04]
CrusherSo what's this one for? [18:06]
arkiver=====================================
Official NewsGrabber wiki!
https://wiki.newsbuddy.net/Main_Page
=====================================
Have a look at it
if something is not clear, let me know
[18:06]
CrusherWill do. [18:06]
arkiverI'm still improving the wiki to make everything more clear [18:06]
CrusherBtw, the only way to browse currently is "random Page"
xD
[18:07]
arkiverkind of
https://wiki.newsbuddy.net/NewsGrabber has some links
[18:07]
..... (idle for 21mn)
***Crusher_ has joined #newsgrabber
Crusher has quit IRC (Ping timeout: 246 seconds)
[18:28]
HCross2arkiver: I finally upgraded my phone deal.. so I should no longer have a blocked archive.org :) [18:44]
...... (idle for 29mn)
We're uploading at 200-400Mbit [19:13]
make that 700Mbps [19:20]
***Crusher has joined #newsgrabber
Crusher_ has quit IRC (Read error: Connection reset by peer)
Crusher has quit IRC (Read error: Connection reset by peer)
[19:33]
.... (idle for 16mn)
HCross2We're over 400GB free now. Turning this all back on [19:53]
.... (idle for 16mn)
KazHCross2: can you see dedupe stats? [20:09]
HCross2Ill log in and take a look... was just turning 100 concurrent on [20:09]
Kazyeah, I was going to do the same thing
I'm at 9 concurrent atm
[20:10]
HCross2im doing 50x2
39 load lol
[20:15]
Kazjust jumped to 60x1
lets see.
disco process is using a comfy 8gb of ram alone
[20:17]
HCross2same kind of thing here
Kaz: I should have staggered the start of my processes so that there is always something grabbing and something deduping
[20:17]
Kazafter a while it'll all smooth itself out [20:22]
HCross2ah yea, I suppose itll find its own rythnm [20:22]
arkivereveryone feel free to create an account at https://wiki.newsbuddy.net/Main_Page
and help build the wiki
:D
[20:23]
HCross2ah yea Kaz - its found its beat now and is always downloading something [20:24]
Kazhttps://jrwr.io/nginx_status
i think all of mine are deduping, lmao
[20:25]
HCross2we are doing well over 40-50k urls an hour
in disco.. not sure how in grab as we have a huge backlog
[20:26]
arkiverlooks like the first derived well https://archive.org/details/archiveteam_newssites_20170623154143 [20:26]
HCross2200 concurrent = 10GB RAM usage [20:29]
..... (idle for 24mn)
jrwrWaiting: 234
WHAT DID YOU DO HCROSS
:p
or was it Kaz
I blame Kaz
[20:53]
Kazbugger
his concurrency is higher!
[20:54]
jrwrI've never seen nginx pegged this ahrd
im at 10% CPU Usage!
https://www.youtube.com/watch?v=sT1bp9ujQEE
PUSH IT TO THE LIMIT
I would say about 30% of the requests are cached
[20:54]
HCross2jrwr: is it an online.net IP? [20:58]
jrwrits a Scaleway instance
http://jrwr.io is the host
[20:58]
HCross2I meant the IP abusing you :p [20:59]
jrwrOH
No idea
wwwb-front3.us.ar
[20:59]
HCross2arkiver: you aware we're using a third of the entire dedupe capacity? [21:00]
jrwrOh
So
Thats how many Keep-Alives I have open with IA
about 100 of them
I've got a TON of wwwb-front3.us.ar:https TIME_WAIT
Right now we are doing 390 Req/s
Thats inbound to me
[21:00]
Kazsmall fire starting in online's datacenter then [21:05]
jrwrThats more then the 200 Req/s I was getting for being on the number one post on reddit for 27 hours
Requests going to IA are hovering at about 150/s
Anyway, I can take it, Im at only 15% BW
Poor IA doing all the processing on the other hand
[21:05]
JAAMFW MediaWiki doesn't offer an option to have times always displayed in UTC. [21:12]
jrwrYou can
Its a LocalSettings Option
but thats a Admin thing
[21:16]
JAAYeah, I meant in the user preferences panel. [21:17]
jrwr@ANSI 6360300S1330090180101DDB12102012ZMZMAYZMBNZMCNZM [21:17]
.................. (idle for 1h26mn)
***Crusher_ has joined #newsgrabber [22:43]
........ (idle for 39mn)
KazHCross2: lots of failed jobs from you - something up with your youtube-dl or something? haven't seen any particularly big items come in [23:22]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)