#newsgrabber 2017-06-25,Sun

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
arkiverI'm thinking of increasing item URLs to 50 or 100
so we don't grab so many revisit records of static sutf
stuff*
[00:19]
.............. (idle for 1h7mn)
Might switch deduplication fully to internal deduplication
we'd go through the megaWARCs and use what's in there to deduplicate
I'm afraid even at low speeds IA is simply not ready for this kind of deduplication
HCross2: ^ what do you think?
It will mean we might end up with a copy of something that's already in wayback, but at least we'll end up with max 1 extra copy
(and we can actually go through all discovered URLs, unlike now)
[01:27]
.............. (idle for 1h7mn)
_Crusher_So you want to internalize deduping to the tracker
Instead of relying on IA
[02:38]
***Crusher_ has joined #newsgrabber
_Crusher_ has quit IRC (Ping timeout: 246 seconds)
Crusher_ has quit IRC (Read error: Connection reset by peer)
[02:38]
..... (idle for 21mn)
Crusher_ has joined #newsgrabber [03:03]
............. (idle for 1h2mn)
jrwrWhat about custom de-duping
since we scrape the same sites all the time
setup a little central in memory store or sha1s
of*
[04:05]
..... (idle for 23mn)
Crusher_Why not include the sha1s with the URLs to the client
Let the client sort it out
[04:29]
..... (idle for 24mn)
***Aranje has joined #newsgrabber [04:53]
....... (idle for 32mn)
Aranje has quit IRC (Ping timeout: 245 seconds) [05:25]
.... (idle for 19mn)
HCross2arkiver: I'm up for that, a little bit of duplicated data is worth more speed [05:44]
***Crusher_ has quit IRC (Read error: Connection reset by peer)
Crusher_ has joined #newsgrabber
[05:50]
Crusher_ has quit IRC (Read error: Connection reset by peer) [05:59]
underscorYay, finally a big item finished [06:10]
................ (idle for 1h19mn)
HCross2Wonderful.. ive got a dead HDD in my crawler
explains why its been so slow and useless
[07:29]
Anyone around good at interpreting smartctl data for me please? [07:34]
................................................. (idle for 4h0mn)
arkiverjrwr: yes, that's what I mean
HCross2: ok, nice
let's do that then
[11:34]
.................. (idle for 1h29mn)
jrwr: how can we run commands when new pages are added?
we can run
cutycapt --url=<url> --out=<output.png> --min-width=1920 --min-height=1080
to make a screencapture of the URL and upload that to the wiki
[13:04]
.................. (idle for 1h27mn)
HCross2: jrwr: can you please install https://www.mediawiki.org/wiki/Extension:StringFunctions or enable it by setting $wgPFEnableStringFunctions = true; in LocalSettings.php ? [14:32]
HCross2arkiver: doing it now
arkiver: https://wiki.newsbuddy.net/Special:Version done
[14:46]
jrwrSorry Ive been AFK
Ill be doing stuff once I get off work at noon today (its 10am right now)
[14:58]
.... (idle for 19mn)
arkiverjrwr: no hurry, totally cool :)
thanks HCross2
HCross2: did you also set the variable in localsettings.php ?
it looks like #explode: isn't working yet
[15:17]
HCross2Hm. I added the line it said to add [15:20]
arkivermaybe something needs to be restarted? [15:21]
***Aranje has joined #newsgrabber [15:26]
............... (idle for 1h13mn)
Aranje has quit IRC (Quit: Three sheets to the wind) [16:39]
........ (idle for 36mn)
Crusher_ has joined #newsgrabber [17:15]
......... (idle for 40mn)
medowar has quit IRC (Ping timeout: 268 seconds) [17:55]
................ (idle for 1h15mn)
arkiverworked on the service pages
https://wiki.newsbuddy.net/index.php?title=IWACU
what do you think?
I only want to make the link at the top left not a link and remove the http://
jrwr: HCross2: can you please install https://www.mediawiki.org/wiki/Extension:RegexParserFunctions ?
so I can use it for editing that URL
and the seedURLs list needs to be an actual list of course
I'm also going to add a lot more optional fields to the services form
[19:10]
....... (idle for 33mn)
***Crusher_ has quit IRC (Ping timeout: 492 seconds) [19:49]
............ (idle for 56mn)
jrwrVery Nice
Liking Forms arkiver
Install Regex crap now
Done
[20:45]
................. (idle for 1h22mn)
Anyone want to be a stats nerd
I can give you the last three days of the nignx proxy log :0
[22:11]
arkiverjrwr: we now have top level domain sorting :) https://wiki.newsbuddy.net/Category:Services [22:12]
jrwrNice! [22:12]
arkiverjrwr: can you please also install https://www.mediawiki.org/wiki/Extension:Graph ?
:)
it's going to be for the statistics part
[22:17]
.......... (idle for 46mn)
jrwrya [23:03]
arkiverand maybe a last one for now, https://www.mediawiki.org/wiki/Extension:Translate
I'd like to see if we can get the services form translated nicely to other languages
not sure how hard it is to install, but the translate extension does not have to be done per se
err
yeah
so we can do that one some other time too
[23:06]
jrwrSo
https://noc.wikimedia.org/conf/highlight.php?file=CommonSettings.php
All the settings Wikipedia uses
Its a damn good read, maybe Im strange like that
[23:17]
arkiverthat's interesting [23:29]
jrwrI found the motherload
https://noc.wikimedia.org/conf/
[23:39]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)