#newsgrabber 2017-12-03,Sun

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***kyan has joined #newsgrabber [00:09]
................. (idle for 1h20mn)
CoolCanukmy grabber only uploads a few megs at a time. How can I get 1000+ MB uploads like others :(? [01:29]
JAABe lucky, I guess. [01:32]
CoolCanukis it okay to run multiple instances of news grabber?
eg: 20 concurrency + 20 concurrency
[01:33]
JAAI think so. I'm not running this project myself, so I can't confirm it, but I believe others do that.
It shouldn't be a problem in general.
NewsGrabber is limited by what your machine can handle.
[01:35]
CoolCanuk:)
I was doing url team too, but it complains it only hands out one service or something per ip
and it kept saying there were no urls for hours
[01:36]
JAAYeah, URLTeam needs a ton of work. We have a huge list of shorteners which haven't yet been added to the project. [01:38]
CoolCanuk:c
are you still looking for more shorteners?
[01:38]
JAASure, always. #urlteam is the channel for that project, by the way.
The real bottleneck is obviously researching each shortener, figuring out how it works, and then adding the relevant rules to scrape it.
[01:38]
............. (idle for 1h1mn)
CoolCanukis anyone able to approve this? https://github.com/ArchiveTeam/NewsGrabber-Services/pull/11 [02:40]
***kyan has quit IRC (Read error: Operation timed out) [02:45]
.... (idle for 18mn)
CoolCanukif I add something like https://www.southwesternontario.ca , will it also add https://www.southwesternontario.ca/southwest-classifieds/ and https://southwesternontario.save.ca ? [03:03]
....... (idle for 30mn)
***kyan has joined #newsgrabber [03:33]
.......... (idle for 47mn)
CoolCanukblah. why do we have to include outbrain :P ?
(spam ads)
[04:20]
.... (idle for 16mn)
***kyan has quit IRC (Read error: Operation timed out) [04:36]
.......................................................... (idle for 4h47mn)
netrix has joined #newsgrabber
netrix has quit IRC (Client Quit)
CoolCanuk has quit IRC (Quit: Connection closed for inactivity)
[09:23]
.............................................................................. (idle for 6h27mn)
CoolCanuk has joined #newsgrabber [15:55]
.......... (idle for 46mn)
kyan has joined #newsgrabber [16:41]
........ (idle for 39mn)
CoolCanukis this project still alive? [17:20]
............... (idle for 1h14mn)
jrwrCoolCanuk: Very well alive [18:34]
CoolCanukomg life [18:34]
jrwrtons and tons of traffic going down, just look at the tracker [18:34]
CoolCanukcould you check if I'm doing the syntax properly [18:34]
jrwrhttp://tracker.archiveteam.org/newsgrabber/ [18:34]
CoolCanukhttps://github.com/ArchiveTeam/NewsGrabber-Services/pulls
I asked last night but no one replied.. I dont want to mess up on all of them, if I know how to do it properly first
[18:34]
jrwrthats because arkiver is the main dev, he needs to approve that (also hes been a little ill as of late) [18:35]
***figpucker has quit IRC (Ping timeout: 260 seconds) [18:35]
jrwrit /looks/ right [18:35]
CoolCanukI just want to make sure I'm doing it right
I'm fine waiting (although, removing 24hrs would be best ASAP, as they are no longer online)
is this incorrect (with the www) https://github.com/ArchiveTeam/NewsGrabber-Services/pull/13/commits/e75ebd9962db3acbc9631d33f6c3cde258fc43a4
[18:35]
jrwrnewsgrabber is not for dying sites
its for active sites that we want to archive over LONG timeframes every day
[18:50]
.... (idle for 15mn)
CoolCanukI know
I mean 24hrs is dead now. It redirects to metro
toronto24hours.ca -> metro
vancouver.24hrs.ca -> metro
but we have it being crawled
jrwr: it crawls every DAY? every 2nd week is probably fine for the "community" sites I propsed to add, most are not major cities
[19:05]
................. (idle for 1h21mn)
jrwroh ya
discovery has a pretty small revist time
[20:28]
........................ (idle for 1h57mn)
***kyan has quit IRC (Read error: Operation timed out) [22:26]
....... (idle for 33mn)
medowar has quit IRC (Ping timeout: 248 seconds)
medowar has joined #newsgrabber
[22:59]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)