#newsgrabber 2017-06-24,Sat

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
Crusher_Oh, btw: the warrior doesn't have a hard-depend on youtube-dl
You might want to enforce that
[00:57]
It let me run it before I realized it wasn't installed [01:08]
jrwrAlso
you should have it do arch detection
it did not like my ARM CPU Scaleway
[01:15]
***Crusher_ has quit IRC (Ping timeout: 246 seconds)
Crusher_ has joined #newsgrabber
[01:23]
.......... (idle for 49mn)
Crusher_ has quit IRC (Read error: Connection reset by peer)
Crusher_ has joined #newsgrabber
[02:15]
.................... (idle for 1h36mn)
HCross2Kaz: hm thanks I'll have a look [03:52]
..... (idle for 20mn)
***kyan has joined #newsgrabber [04:12]
........ (idle for 37mn)
Aranje has quit IRC (Ping timeout: 506 seconds) [04:49]
................................. (idle for 2h42mn)
Crusher_Imzy's dead
Their site is timing out
We got most of it, so that's something.
Kaz : he's dead Jim
[07:31]
....... (idle for 34mn)
Oh, wrong tab, this is newsgrabber :P [08:10]
....................... (idle for 1h51mn)
underscoris it just me or does dedupe feel a lot slower now?
my clients seem to be taking 3-7s per dedup request
[10:01]
..... (idle for 23mn)
Kazyep, it's definitely slower
i think due to the load we're putting on it
https://jrwr.io/nginx_status
[10:24]
HCross2hm, we need a way of deduping very quickly
I dont think the CDX API was designed for tons of requests, from the other side of the world
I did think of asking for a dump... but does anyone want to host 600 billion URL records for me?
[10:35]
Kaz: weve even broken the deduplication cluster charts [10:43]
Kazthat's something of an achievement
breaking things seems to be a core part of what we do these days
[10:44]
underscorHCross2: out of curiosity, where do those charts live?
I know of the a.o/stats/www.php and s3.php ones but those don't seem significantly effected
[10:50]
HCross2behind a password :p
its not too bad now as its 4am in Cali its not too bad
[10:50]
Kazhttp://archive.org/~tracey/mrtg/ is always nice to look at, shame dedupe isn't there [10:55]
HCross2each dedupe is now taking several seconds for me
arkiver: could we only dedupe files under/over a certain size?
[11:05]
***jrwr has quit IRC (Read error: Operation timed out)
jrwr has joined #newsgrabber
[11:17]
...... (idle for 25mn)
arkiverHCross2: sure
we totally can
however, the warriors will grab some static stuff many many times
and that really needs to be deduplicated
[11:42]
jrwrI'm still getting a 20%hit rate on the cache [11:47]
arkiverthat's pretty good
jrwr: have you seen my messages about keeping the URLs in cache?
and the problems with that
problem is that URLs that are initially not in the wayback machine, but later are in there, will never be deduplicated
if they are popular and not deduplicated first time they are found
[11:49]
.................. (idle for 1h28mn)
HCross2Ideally we need an EU cdx that's kept up to date :p [13:20]
arkiverhmmmm
nice idea actually
[13:21]
jrwrarkiver: it only keeps the cache for a day
so
[13:24]
arkivereven if an URL is hit every minute? [13:24]
jrwrYes [13:24]
arkiverah, cool
didn't know that
[13:25]
..... (idle for 23mn)
Crusher_Sigh... I woke up expecting to see some progress on the grabber, but looking on the tracker, my overnight progress was 0 items >_> [13:48]
arkiverjrwr: how do you want to import our current services into the wiki?
should I help convert the data in some form?
into*
[13:54]
jrwrthat would be nice
a nice CSV
[13:55]
arkiverlines seperates by \n ?
hmm
[13:55]
jrwrya [13:55]
arkiverI don't think we can do that in a csv
then it would be a new lines
would json be ok?
[13:55]
jrwrYa thats fine [13:55]
arkiverok [13:56]
..... (idle for 22mn)
Crusher_How slow is the newsgrabber usually? [14:18]
It's been running all night and still just deduping
It hasn't actually uploaded any items...
[14:25]
.... (idle for 16mn)
!server-stats newsbuddy [14:41]
newsbuddyCrusher_: Getting server stats...
Crusher_: CPU usage percent: total 7.6 - user 6.4 - nice 0.0 - system 0.7 - idle 92.5.
Crusher_: Virtual memory usage: total 16778424320 - percent 22.3.
Crusher_: Disk usage: total 11906045009920 - percent 74.4.
[14:41]
Crusher_Hmm. [14:41]
jrwr : how's the server holding up [14:51]
jrwrgood
playing a CTF right now
Hack the Arch
[14:51]
arkivernice
and?
[14:52]
jrwrlots of mysql challages
JOKES ON YOU I HAVE A PHPMYADMIN INSTALL
easy mode activated
[14:52]
Crusher_Lol [14:52]
...... (idle for 25mn)
***kyan has quit IRC (Read error: Operation timed out) [15:17]
kyan has joined #newsgrabber [15:28]
arkivercreated https://wiki.newsbuddy.net/index.php?title=Service
let me know if you think anything is missing
and feel free to edit something :)
[15:36]
jrwr: can we put https://wiki.newsbuddy.net/images/6/6b/Newssites_logos.png as logo?
we might take an other logo later on, but this one is pretty nice
really nice compilations of logos from all over the world
[15:44]
***Aranje has joined #newsgrabber [15:47]
HCross2Im going to look at adding more news from smaller african nations n [15:50]
arkiverthat'd be really nice :)
I'm going to improve the front page a little too
[15:55]
HCross2hmm https://usercontent.irccloud-cdn.com/file/beOsb0TU/image.png [15:58]
arkiverhuh
what did you try to add?
[15:58]
HCross2it doesnt like the straight bar
I removed that and its fine
[15:58]
arkiverI'm thinking of also adding the option for a screencapture
both logo and screencapture are not and will not be mandatory though
[16:01]
HCross2We're getting daily TV from Burundi soon :)
in both English and French
arkiver: Can you give https://wiki.newsbuddy.net/index.php?title=IWACU_Voice_of_Burundi&stable=0 a read over for me please?
[16:03]
arkiverI fixed it a little
regex can just be text now
doesn't have to be a list of between r"" anymore
or*
HCross2: might be good to write a guide on how to set a title for a webiste
Voice of Burundi is written in French on the logo
but is also more of a 'subtitle'
we do have an english version though
[16:10]
HCross2ah yea, I wanted to provide more meaning than IWACU [16:17]
arkiverI'll add a subtitle field
for these kind of things, so in that case the title would be IWACU
HCross2: just a start, but https://wiki.newsbuddy.net/index.php?title=IWACU
I'll make a subtitle more nicely visible
[16:18]
.... (idle for 15mn)
jrwrhay arkiver
can I get some regex foo from you
or HCross2
[16:36]
HCross2jrwr: we're both on skype atm [16:37]
jrwrAh [16:37]
***Crusher_ has quit IRC (Ping timeout: 492 seconds)
Crusher_ has joined #newsgrabber
[16:50]
..... (idle for 23mn)
_Crusher_ has joined #newsgrabber
Crusher_ has quit IRC (Ping timeout: 492 seconds)
[17:15]
Aranje has quit IRC (Ping timeout: 245 seconds) [17:25]
.... (idle for 17mn)
arkiverjrwr: working on the json noe
now*
[17:42]
jrwrcool [17:42]
arkiverI have another question :)
we have a hidden screenscapture field now
it is possible to automatically trigger the creation of a screencapture and upload and add that to the item?
we can use cutycapt for screencapture creation, works pretty good
HCross2: how does this look? https://wiki.newsbuddy.net/index.php?title=IWACU
[17:43]
HCross2that looks good [17:46]
arkiveronly seedURLs should be one per line, will fix that
what do you think of adding a screencapture under the logo automatically?
[17:46]
HCross2that would be perfect [17:47]
...... (idle for 27mn)
***kyan_ has joined #newsgrabber
kyan has quit IRC (Read error: Operation timed out)
[18:14]
.......... (idle for 46mn)
kyan_ has quit IRC (Read error: Operation timed out) [19:03]
................ (idle for 1h19mn)
arkiverjrwr: https://paste.fedoraproject.org/paste/tRffwWhcJRgdn8oK8zkvaA/raw
lines are seperated by \n
[20:22]
jrwrThanks
Im almost done with my CTF
[20:23]
arkiverhaha nice
good luck
I'm afraid we will have to add titles to everything though
[20:23]
..... (idle for 24mn)
jrwr: make sure to import the double \\ as a single \ in the regexes [20:48]
................... (idle for 1h32mn)
***kyan has joined #newsgrabber [22:20]
...... (idle for 26mn)
arkiverHCross2: jrwr: can you please install https://www.mediawiki.org/wiki/Extension:RegexParserFunctions ? [22:46]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)