#newsgrabber 2017-06-14,Wed

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***kyan has joined #newsgrabber [04:33]
...... (idle for 27mn)
kyan has quit IRC (Remote host closed the connection) [05:00]
.................................................................................................................................................................. (idle for 13h28mn)
blitzed has joined #newsgrabber [18:28]
......... (idle for 43mn)
MarkGraha has joined #newsgrabber [19:11]
MarkGrahaHI [19:12]
.... (idle for 15mn)
HCross2Hi Mark [19:27]
..... (idle for 21mn)
***MarkGraha has quit IRC (Quit: Page closed) [19:48]
...... (idle for 29mn)
blitzed has quit IRC (Quit: Leaving) [20:17]
jrwr has joined #newsgrabber
MrRadar has joined #newsgrabber
[20:23]
arkiverhi [20:23]
jrwrOh?
Anything I can do?
[20:24]
arkiverCurrently we have add websites through little python files, which are stored here https://github.com/ArchiveTeam/NewsGrabber/tree/master/services
each has for example a list of URLs https://github.com/ArchiveTeam/NewsGrabber/blob/master/services/web__cnbc_com.py
that are the seedURLs.
these URLs are checked every number of seconds (refresh variable) for new URLs
if those new URLs are found they are archived into WARCs and uploaded to the wayback machine
[20:24]
jrwrNice! [20:26]
arkiverIt might be nice to make adding websites easier for the rest of the world.
Maybe some kind of wiki system
or a kind of wikidata system
[20:26]
jrwrOh! That could be done matte rof fact [20:26]
arkiverwith information that can be easily read by the scripts, and where the user can add a certain set of variables for each webiste
awesome :)
[20:27]
JAASo if articles are updated, that isn't caught, correct? [20:27]
arkiverunless the URL of the articles changes, no
but
[20:27]
jrwrWe could have a system go back in x days and check
a little back queue
[20:27]
arkiverif the URL has 'live' or 'blog' in it, it will be archived everytime it is found [20:27]
JAAAh, nice [20:28]
arkiverin case of for example those live newsblogs during large events
like the recent terrorist attacks
not sure if events was the right word
[20:28]
jrwrAnyway arkiver, using wiki forms you can make a form to be filled out [20:28]
JAAYeah, I've been archiving the BBC and Guardian live reports a few times today for the London fire [20:28]
jrwrand then could be cross checked by someone [20:28]
arkiverjrwr: exactly [20:29]
jrwrmaybe even a script to check back and respond to the user what was returned [20:29]
arkiverwell
I guess we can make a dump of the wiki for the machines every hour and update the data we are running with with this new dump
[20:29]
jrwrThere is a method to make the entire thing managed by users of the wiki (mods can approve code)
so gets put into a pending status, then approval status
[20:29]
arkivernice..
that would be perfect
I'm thinking something like https://www.wikidata.org/wiki/Q4787261
with just variables needing to be filled in
[20:30]
jrwrhave the page contain a subpage that another bot and do and show example data returned
so we can edit and get feedback
[20:30]
arkiveryep
after the update it will be a lot easier to get more resources running for the project
[20:30]
jrwrYep [20:31]
arkiverthe list of discovered URLs will be split in little lists and handed to the warrior
warriors*
[20:31]
jrwrI will be able to provide a quick version number
if version number changed, download new defs
and reload
Ill have a single page will all the entries to cut down on requests
If we did this with dokuwiki the overhead is almost nill
So arkiver where do you want to put this at
We could also have some common templates, like the youtube channel
or basic webpage
[20:31]
arkiverwith the single page for the entries you mean like an admin page with the newly added websites to be approved? [20:36]
jrwrya
and for the scraper to update from
[20:36]
arkiverYes
I have no idea where to put this
[20:36]
jrwrI've done systems like that in the past [20:37]
arkiverHCross2 might have an idea [20:37]
HCross2jrwr: is it really just a small web page/php setup? [20:37]
jrwrYes
No Database or anything
a 512mb VPS would do fine
[20:38]
arkiverHCross2 set up http://newsgrabber.harrycross.me [20:38]
jrwrI can config the full stack so [20:38]
arkivernot sure if we want something under archiveteam.org or a totally new website
hmm
HCross2: did we already had a newsgrabber sites?
have
[20:38]
HCross2There was something the script made [20:39]
jrwrIve got my domain as well [20:39]
arkiverjrwr: any idea what this might look like? [20:39]
jrwrYa [20:39]
arkiverusers being able to search for websites? like a wiki system? [20:40]
jrwrYes [20:40]
arkivernice [20:40]
jrwrThere will be a master list of sites a user can look over
if they want to, they can submit a new site, have some checkmark boxes for common regex or custom
it will go into a pending queue
[20:40]
arkivernice [20:41]
jrwra bot will go over the pending queue and scape basic off what was submitted so admins can see how it behave [20:41]
arkiverand a full export of all website data in there might be possible? [20:42]
jrwrCorrect
Ill have a button for debugging
so a admin can see the full scope of what the regex does
Ill use a mix of dokuwiki bots and dokuwiki it self
I've done a fuckton with dokuwiki to this scope in the past (my main site jrwr.io is dokuwiki)
Admins will have the power to update existing sites at will
anyone else will have the site copied into pending
so if someone submits a update it will copy the submission with the changes
to be approved
[20:42]
arkiverarkiver looks up dokuwiki [20:44]
jrwrits like mediawiki, but lighter and flatfile [20:45]
JAAInteresting, never heard of it before. [20:46]
jrwrreally?
its big in the eve online space since its permissions are super easy
and much more powerful's then mediawikis
[20:46]
JAAJAA looks up Eve Online.
;-)
[20:48]
jrwrOh god [20:48]
arkiver:) [20:48]
jrwrEVE Online is amazing [20:48]
MrRadarWell, at the very least it's amazing to *read* about. [20:49]
jrwrthe only MMO that gives you ALL the data [20:49]
arkiverarkiver has never been into gaming much [20:49]
MrRadarPlaying it I'e herd is a bit of a chore [20:49]
JAAJust kidding. I've heard of it before, never played it though. [20:49]
jrwrMrRadar: it can be, its a true sandbox
Ive made some cool ass shit for EVE Online and its API
Ill spin up a VPS for it tonight
Ill do some SCIENCE
[20:49]
arkiver:D awesome [20:52]
jrwrI fucking love projects and Ive been looking for now
one(
Ugh, stupid phone
Give me 48-72 hours and ill have something whipped up
Im going to a infosec meetup tonight
so maybe ill recuit a few nerds
[20:52]
JAA"No database required, it uses plain text files" sounds ... interesting. I wonder how well that scales. [20:55]
jrwrpretty well
it does use a flatfile cache
[20:55]
arkiverplan of this is to get more people to contribute websites [20:56]
jrwrso the rendered pages are saved in chunks to the file system [20:56]
arkiverespecially those small local newssites [20:56]
jrwrYep! [20:56]
JAAI guess that helps, but what about things like search? [20:57]
jrwrAs good as mediawikis
I use it at work for a knowledge base
my users love it
[20:57]
arkiverjrwr: later on, when we have something running again, could we do some kind of stats system?
for each websites how URLs were discovered
[21:00]
jrwrYes [21:00]
arkivermaybe even lists of URLs that were recently discovered.
nice
[21:00]
jrwrA log of sorts [21:00]
arkiveryep [21:00]
JAASounds good [21:01]
jrwrI can have a URL that the workers post details about to
or just the ingress discovers
Ill leave that to you
Ill make the endpoint
[21:01]
arkivernice [21:01]
jrwrIf you can make me a little stub of code I can feed these defs into as a safe manner and report back how it did
arkiver:
that would be the only thing I request
[21:11]
arkiverI'll make something up [21:12]
jrwrThanks [21:12]
arkiverprobably just the plain discovert files that are send back to the main server
discovery*
arkiver is afk for an hour
HCross2: I'm going to try to get it running again when I'm back
is newsbuddy ready for (maybe) some action?
(hopefully) I should have put there
[21:12]
HCross2Yep. Server is still there [21:16]
..... (idle for 23mn)
Kazsomeone poke me if we're getting moving / i'm needed
I'll be around
[21:39]
.......... (idle for 45mn)
***johnny5 has quit IRC (ircd.choopa.net hub.efnet.us)
luckcolor has quit IRC (ircd.choopa.net hub.efnet.us)
joepie91 has quit IRC (ircd.choopa.net hub.efnet.us)
chfoo has quit IRC (ircd.choopa.net hub.efnet.us)
MrRadar has quit IRC (ircd.choopa.net hub.efnet.us)
dxrt has quit IRC (ircd.choopa.net hub.efnet.us)
arkiver has quit IRC (ircd.choopa.net hub.efnet.us)
midas has quit IRC (ircd.choopa.net hub.efnet.us)
lainu has quit IRC (ircd.choopa.net hub.efnet.us)
[22:25]
......... (idle for 41mn)
johnny5 has joined #newsgrabber
luckcolor has joined #newsgrabber
[23:06]
.......... (idle for 48mn)
MrRadar has joined #newsgrabber
dxrt has joined #newsgrabber
arkiver has joined #newsgrabber
midas has joined #newsgrabber
lainu has joined #newsgrabber
irc.servercentral.net sets mode: +o arkiver
[23:54]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)