#newsgrabber 2017-06-18,Sun

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
jrwrIts a little broken
Can't get it to make a table for me
Oh well, its not needed at this time, Once you have the template the way you want arkiver, I can start importing the current ones we have
[00:12]
arkivercool
jrwr: that video regex is a standard regex for video
[00:17]
jrwrAh
Its defined in the file on github currently
[00:17]
arkiverah [00:21]
jrwrIm making the tester for regex right now
since regex is regex
[00:23]
arkiveryep [00:25]
jrwrWow, Lots of broken Regex, even for python based one [00:31]
arkiver?
like which one
[00:32]
jrwrhttps://15minut.org/
I got no matches in PHP or Python
[00:32]
arkiverbut for which regex
^https?:\/\/[^\/]*15minut\.org\/
?
[00:32]
jrwr'^https?:\/\/[^\/]*15minut\.org\/'
its the only one in there
looking at the HTML
they moved from having the domain in links to not
so
[00:34]
arkiverno
we extract all URLs from the page
then match the regex on those URLs
[00:34]
jrwrhttps://github.com/ArchiveTeam/NewsGrabber/blob/master/services/web__15minut_org.py [00:34]
arkiverthe regex is not matched directly on the HTML [00:34]
jrwrOH
well, color me surpised
This shim is going to be interesting to write then
[00:34]
arkiverwhat are you trying to write exactly [00:37]
jrwrBasic, Takes in a single regex + url and returns the first ten matches [00:38]
arkiveras an example in while adding the regex in the wiki?
might be best to just use the discoverer
I can rewrite it a little for this
discoverer scripts I mean
[00:38]
jrwrIve got a wikibot that will be handling the status handling of the entries and will go over old sites to make sure the regex still works
like once a month or something
[00:39]
arkiverand how are you checking exactly is it still works? [00:40]
jrwrbut when a page is updated, it will scrape it within 5 minutes to return back some nice data so you know your regex works with the system [00:40]
arkiverif*+ [00:40]
jrwrWell, Does it still return more then X Matches (I know live / video will be wonky)
Mostly keep those as tuneables
[00:40]
arkiverright
I you can use a python script?
[00:41]
jrwrYes of course [00:41]
arkivercool [00:41]
jrwrAll it will do is attach another category to the services so we can go over the list and make sure everything is OK [00:42]
arkiverI'll rewrite the script for the discoverer a bit so it can be used for this [00:42]
jrwras if we start to manage 100+ sites
having something to do basic sanity checks will be needed
[00:42]
arkiveryep [00:43]
jrwrIll have any "extra" data you print to screen included with the log [00:43]
arkiverwho is t2t2
sounds good
HCross2: Kaz: do you know who t2t2 is?
ah nvm
[00:43]
jrwrIm going to start importing the current sites in the github [00:45]
arkiveryes, sounds good
we can always add more fields right?
or would that require some rewrite of the database
[00:45]
jrwrNopr
We can do any edits needed
its All wiki based
[00:47]
arkiverawesome [00:48]
jrwrwe can change the template design mostly and it will update on all pages [00:48]
arkiverlet's say we want to change 'info' to 'information' later on
that is also possible?
[00:48]
jrwrYa [00:48]
arkiverok [00:48]
jrwrYou can change the form to display anything really
its the backend template that does the lifting
[00:48]
arkiveryep
I'm not going to start the project yet
a problem with t2t2 that needs to be figured out first
I hope he'll get back to me soon
[00:56]
.... (idle for 16mn)
jrwrim working on the "status" bot anyway
where any page edited by someone will got into unconfirmed state
until a admin confirms it
[01:13]
.... (idle for 18mn)
arkiversounds good
do you think we can add an optional logo for each website?
an image uploaded to the wiki and displayed on the right of the websitepage
[01:32]
jrwryes
there is a section on the form to do that with
[01:38]
arkiverI don't see it unfortunately
problem with the template, form and the pages is that they all need to be updated seperately
and they don't automatically change if for example the form changes
[01:40]
jrwrYep
Anyway, the logo method is in place
[01:55]
https://wiki.newsbuddy.net/ABC_News [02:03]
.... (idle for 15mn)
arkiver:D
thanks
[02:18]
........................................ (idle for 3h16mn)
HCross2I don't know who t2 is [05:34]
.................................... (idle for 2h59mn)
KazI remember the name, haven't seen him in a while though
why do you need him?
[08:33]
......... (idle for 41mn)
HCross2not sure, I have a feeling his script is returning/doing odd things [09:14]
............ (idle for 57mn)
Kazhmm, yeah he's blocked in the tracker now too [10:11]
HCross2we've got a metric ton of backlog now [10:15]
................. (idle for 1h23mn)
arkiverI think we're starting to have something really nice here with the wiki :) [11:38]
HCross2jrwr: im about to log in to the wiki vm and install the stuff for nightly dumps [11:44]
arkiver:D
is it easy to install just any extension?
I'm thinking of a graphs extension
[11:44]
HCross2should be [11:45]
.... (idle for 18mn)
***medowar has quit IRC (Ping timeout: 268 seconds) [12:03]
Kaznot sure if you're going to be able to get hold of t2t2, arkiver: Online for about a month (idle for 12 days) [12:06]
........... (idle for 51mn)
arkiverKaz: yeah, we'll see
I'm going to reset the newsgrabber tracker
and start this all over again
not going to start packing the small WARCs yet
Want to do some random tests on them to see if they are deduplicated correctly
[12:57]
Kazright
shall I start a warrior?
[13:02]
arkiversure
try and see how it runs
I'm going to remove items from the tracker now
[13:04]
***newsbuddy has quit IRC (Remote host closed the connection)
kyan has joined #newsgrabber
newsbuddy has joined #newsgrabber
[13:04]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [13:08]
Kazheh, there's just a 'hi' message when the job starts up https://s.kurt.gg/76Yvv0K.png [13:09]
arkiveruh
and nothing after that?
[13:10]
Kazit continues as normal after that
I can't find the hi in the script
[13:10]
arkiverright
probably some error some where
wait
I remember
https://github.com/ArchiveTeam/NewsGrabber-Warrior/blob/master/warcio/__init__.py
:P
it was an early test if stuff was imported correctly
[13:11]
Kazhaha, yeah thats why it's not in pipeline.py [13:12]
arkiverfrom the local warcio that is
that 'hi' visual check is now replaced by https://github.com/ArchiveTeam/NewsGrabber-Warrior/blob/master/pipeline.py#L21-L24
HCross2: do you know how to fix the mkdir permission problem with rsync?
[13:12]
HCross2what was the error again? [13:13]
arkiverah nvm [13:13]
Kazchown -R 777 / [13:13]
arkiverneed to change owner
yeah
[13:13]
Kaz(don't do that) [13:13]
arkiveroh [13:13]
Kazchown archiveteam:archiveteam directoryname [13:14]
JAAYeah, don't do that, use chmod instead. ;-) [13:14]
arkiver:P [13:14]
Kazone of many reasons
also, chown probably wants/needs -R too
[13:14]
JAAYes, and to screw everything up properly, you want the suid/sgid bits as well. [13:15]
Kazhow often are jobs injected? hourly? [13:28]
arkiver30 minutely [13:29]
Kazah okay [13:29]
arkiverMight reduce that to 15 minutes [13:30]
KazI guess it only really matters on first startup, depends if we have enough capacity to keep up with the queue [13:32]
arkiveryep
deduplicating can take a lot of time
[13:36]
Kazalright, we've got items
arkiver: shall I let jobs out?
[13:38]
...... (idle for 28mn)
arkiverKaz: yes [14:10]
Kazwe're away [14:11]
dedupe doesn't seem to be using anything in terms of resources [14:16]
HCross2same here [14:17]
arkivernope, it doesn't use much [14:17]
Kazi'm going to see where my limits are on concurrency [14:19]
HCross2Kaz: are your web interfaces not loading either [14:22]
Kazi run --disable-web-server [14:22]
HCross2https://www.irccloud.com/pastebin/EkcPA8dZ/
arkiver: ^
[14:23]
KazI'm at 15 claims, running 10 concurrent so 5 of them have failed already [14:27]
HCross26x20 here
pegging an i7 2600 at 100% and using 4GB of RAM
load of 20
[14:27]
Kazhmm, across 10 concurrent I've only managed to pull 28MB [14:30]
HCross2someone is feeding something in, albeit slowly
Riight... whose using an AWS instance
[14:31]
KazI don't have youtube-dl installed. That'll be the one
seeing the same issue as HCross2 ^, log incoming
arkiver: http://termbin.com/cydb
[14:32]
arkiverthanks, will have a look at it
arkiver is afk for some hours
[14:39]
KazHCross2: update wpull
first job is now looking happier.. will see how it goes
[14:39]
HCross2Exception: Sorry, Python 2 is not supported. [14:40]
arkiverugh yeah [14:40]
HCross2added a 3 infront and its happy [14:40]
arkiverfeel free to keep running or pause
I'll fix the problem when I'm back
arkiver is afk
[14:41]
HCross2Kaz: did you have to restart after? [14:41]
Kazyeah, just killed the pipeline and restarted it
will see if jobs end up coming in
[14:41]
HCross2ive done the same.. and written a little bash script to spawn 10 instances [14:44]
Kazhmm, lots of small jobs coming in from you [14:46]
HCross2yep [14:46]
Kazhave you got the python2 version of youtube-dl installed? [14:46]
HCross2something on this hetzner is screwed
tempted to flatten this box and reinstall
[14:47]
Kazi take it that's not master? [14:49]
HCross2nope
this hetzner isnt a happy bunny.. about to get on the phone and ask for an urgent LARA to fix things
[14:52]
Kazwipe it [14:53]
HCross2yea, im getting an KVMoIP sorted [14:53]
***Aranje has joined #newsgrabber [15:01]
Kazah hang on, second job has just failed, similar error to the first
feels like a wpull issue, I have no idea where it's pulling '/home/box' from
oh hey, https://github.com/chfoo/wpull/issues/322
[15:06]
........................... (idle for 2h11mn)
***johnny5 has quit IRC (Ping timeout: 492 seconds) [17:22]
HCross2Kaz: have you turned the tracker off? [17:23]
KazYeah, paused it for now
just realised I have a warrior vm sitting in esxi, I'm going to spin that up and see how it goes
[17:23]
HCross2ive flattened and reinstalled
Kaz: shall I open it up a little bit so we can see if we're stable now?
[17:25]
Kazyeah give it a shot [17:25]
HCross2what did you set it to before? [17:26]
Kaz100 [17:27]
HCross2nope.. died straight away [17:27]
KazI can't remember why it is we're locked to wpull 1.2.3, there was a reason at some point [17:28]
HCross2Kaz: going back to your point about why dedupe is so quick.. all we are doing is making a request to the IA to ask if they have seen the URL before [17:34]
Kazah, makes sense why there;s hardly any disk activity then
I thought it was checking against other local archives or something
[17:34]
HCross2The IA have an API for this kind of thing
check your netstat -W and youll see whats going on
youll find we're uploading a fair bit less from now on
[17:35]
Kazhmm, just installed wpull 1.2.3 myself rather than use the prebuilt one
I'm seeing the same issue, but much less frequently
HCross2: any idea how far back we look for dedupe? articles changing etc could mean we miss some content
[17:44]
HCross2im not too sure
Kaz: does dedupe take a fair bit of time for you?
[17:47]
Kazyeah, feels like it does [17:51]
HCross2its because the request is having to cross the atlantic twice [17:51]
KazI'm tail -f'ing the log, and that's not quite updating in realtime [17:51]
HCross2and then the states [17:52]
Kazit's not too much of an issue.. if it takes ages it just means I can run more concurrent because it doesn't add too much load
waiting for this job to finish then I'm going to *try* to scale up a bit
[17:52]
HCross2am back... just had to ring the chinese place and tell them they bought me someone elses food [17:58]
Kazhaha [17:59]
HCross2Yea... 16 boxes of stuff arrived to feed 4 [17:59]
Kazcouple of weeks ago someone accidentally delivered an indian to us
seems like it was for 3-4 people.. we waiting 15min, nobody came back for it
waited*
Webinterface won't load for me, anything I need to add for that?
do I need --address?
[17:59]
HCross2--port 1337 --address '192.99.12.208'
obv change that
[18:03]
Kazah yeah thought so, cheers
http://rbx2.kurt.gg:8001/
[18:06]
HCross2Im wondering if theres a way we can have a local cache (ie on master) of urls
Opera hates the dashboard for some reason
I hear my PC fans ramp up when I click it
[18:06]
Kazhaha
I can't hear anything over the fans on this r710
[18:07]
HCross2and thats a water cooled i7 4770k
you should hear my GPUs when I turn Claymore on to mine ETH
[18:08]
Kazmy pc is pretty silent, even under load [18:09]
HCross2stock AMD gpus.. all im going to say [18:10]
Kazhaha, yeah there's your issue
also, looks like only one job can be deduping at a time
[18:10]
HCross2Im wondering if we'll thump the IA APIs too hard [18:11]
Kazone easy way to find out.. [18:13]
HCross2*slack starts lighting up*
*Mark Graham comes in here and asks what weve done*
[18:17]
Kazwe'll see
at the moment it looks like I'm not even uploading anything
i think there's a job trying to upload, but can't open the webui any more
[18:18]
HCross2Try *shudders* IE or Edge
Kaz: idea.. try lots of small concurrent, like 10x2 maybe and that should dedupe a lot more
[18:19]
Kazjust kicked off 30x1
yep, lots of jobs failing, trying to work out why now
it looks like IA just kills connections if you hit too hard https://s.kurt.gg/77CIoLC.png
[18:20]
HCross2let me ask [18:25]
Kazis that slack public or?
assuming not
[18:26]
HCross2it isnt [18:27]
Kazwell, at least grabs are running consistently now
might be worth trying to start yours back up, lightly?
[18:30]
HCross2mine are fine
but being slow at deduping
[18:30]
Kazah, so they are [18:30]
HCross2Kaz: are you able to resolve p3.qpic.cn from OVH? [18:30]
Kazwpull from source, or the zip? [18:31]
HCross2I upgraded after [18:31]
Kazah
that resolves fine for me, cname to p.qpic.cn
[18:31]
HCross2Kaz: me thinks OVH CA DNS is doing some fun things then [18:32]
arkiver: we need an idea of what sort of rate limiting the CDX API does
and then probably get as close to it as we can
especially with the fact most of our requests have to cross an ocean and a continent
[18:37]
Kaz: arkiver ive had an idea that may help things... some sort of proxy server maybe in NYC that we hit which forwards our requests to the IA [18:43]
KazWould work nicely if IA lets it hit as hard as we want to [18:43]
jrwrHCross2: wanna see something neat, go login to the wiki, added a system for reviewing edits so we can mark pages that need work [18:44]
HCross2hm. seems every IA IP no longer returns to ping [18:44]
jrwrIA Crapped out?
Must be trump
[18:44]
Kazhttps://www.irccloud.com/pastebin/6sOzrbEJ/
dedupe still looking fine
I can ping 207.241.224.2 just fine
[18:47]
HCross2Im investigating a proxy atm.. now to find a provider. M247 NYC is out.. 100ms to cross the states [18:53]
jrwrWhat kind of proxy do you need [18:57]
HCross2just something to take the edge off of the transatlantic journey requests to the IA API do [18:57]
jrwrOH
Well
I've got VPNs out the ass
[18:57]
HCross2I want something we can run on a VPS somewhere and then point the grabbers at [18:58]
jrwrRight
How much BW do you need
I might be able to bridge one up
[18:58]
HCross2not sure [18:58]
jrwr200,400,500,1000 Mbit? [18:58]
HCross2a lot less [18:59]
jrwrOh [18:59]
Kazwe need to be able to proxy this: https://github.com/ArchiveTeam/NewsGrabber-Warrior/blob/master/pipeline.py#L164 [18:59]
jrwrAh
Ok
[18:59]
HCross2its literally just flinging json around [18:59]
jrwrcan I get a full example
I might be able to get it down to 75-50 Ms from Paris to NYC
But its going to be wonky
Ill have to use a off port for it
[19:00]
HCross2and then whatever the hop to the IA on the west coast is [19:00]
jrwrthats about 30MS [19:01]
HCross2to put things into contrast. I get 140ms to my server in LA from London [19:01]
jrwrSo
Im getting 75 from my seedbox out in France
[19:02]
HCross2jrwr: just tested that to my server in Psychz LA from London I get 140ms.. and then another 9ms down to the IA
so that may be a potential
[19:03]
jrwrI get a return time of 0.10s off curl to time curl web.archive.org [19:03]
HCross2so ive got 149ms
Kaz: what is ping from your box to georgina.harrycross.me like?
[19:03]
jrwrhttps://hastebin.com/safelusige.css
here are my response times
[19:04]
HCross2plus weve got to add whatever it takes the IA to give us an answer [19:05]
jrwr153 ms
my pastebin is a full time + response time
[19:05]
HCross2ahh, well be making API calls so it may take a tad longer
seattles-best-gw.ip4.gtt.net is a thing :p
[19:05]
KazHCross2: 136ms from RBX2 [19:05]
HCross2nice, so under 150ms still [19:05]
jrwrI get about 156Mx
ms
ldn -> nyk -> las
I get 83ms
to LAS
then it jumps HARD
las-b22-link.telia.net (62.115.121.220) 149.746 ms las-b22-link.telia.net (62.115.134.85) 149.740 ms ash-bb3-link.telia.net (62.115.141.244) 83.270 ms
ae9.telia.lax.us.AS40676.net (23.238.223.13) 158.242 ms 153.228 ms 158.086 ms
dat shitty route
I know 250ms adds up
but I think most of the time will be spent in the API
[19:06]
HCross2premium Psychz networks [19:09]
jrwrWhos? [19:14]
HCross2jrwr: https://www.psychz.net [19:14]
jrwrhave you seen OVHs Weathermap [19:14]
HCross2yep
Kaz: opened the taps on the traccker
[19:15]
Kazhow badly is the ia ratelimiting hurting you? [19:19]
HCross2ive got 30x2 here now, but not enough concurrent to fill my slots
so ive not been limited yet
im waiting for the slots to fill a tad more
aannnd there we go
Kaz: its throttling my inbound down to around 20Mbit
[19:19]
Kazwe need to work out how hard we can hit Ia before we try to proxy anything, i guess [19:29]
HCross2I'm going to see if there's a way we can get a higher rate limit
Kaz: the IA have a capacity of 960 concurrent
[19:32]
Kazhmm
what about per-ip limits?
i was getting hit very hard at 30 concurrent
[19:35]
jrwrI can proxy requests and slap the frontend in front of cloudflare for better routing
I can force some major piplining in nginx
[19:36]
Kazcloudflare might end up being an issue if we hit too hard
HCross2: it feels like the limit is 1 per ip, concurrent
even at 5 concurrent per IP i'm getting killed
[19:38]
HCross2Kaz: ive spoken to our IA contact, hes going to get me together with a few others tomorrow to work on a way to deal with this [19:40]
Kazperfect [19:41]
HCross2Im thinking some sort of proxy to make the network easier, and then some sort of key authentication
Kaz: ive also got access to some stats on how the dedupe servers are holding up now
[19:41]
Kazooh
if we can get a proxy with no limits on the IP, I think that's probably the ideal solution for us
[19:42]
HCross2Many.. many... many... many charts, lots of pretty colours [19:42]
Kazis this hidden somewhere in the depths of the IA munin/cacti or whatever they use [19:43]
HCross2not too sure [19:43]
jrwrI can provide a offport proxy that had 1Gbps port on it for this project [19:47]
HCross2Kaz: its not an issue of being rate limited [19:54]
Kazoh? [19:55]
HCross2its more an issue of "the IA can only deal with 960 concurrent dedupe requests and they got more than that" [19:55]
jrwrthats what Im doing
Im doing some proxy_pass caching
full URL of course
10m timeouts
[19:55]
Kazhmm, are they running close to capacity at the moment then?
weird that 1 concurrent is perfectly fine, but 15 breaks it
in the grand scheme of things, 15 isn't a lot
[19:56]
HCross2Kaz: it depends which dedupe server youre hitting too
as there are a set of them, and each one can only do 90
[19:57]
Kazah [19:58]
it'd be nice if the IA deduped as part of the derive when things got uploaded
would solve all these issues]
[20:05]
jrwrhttp://163.172.128.219:4444/
its a direct proxy to Wayback
has 10GB of Response Catching (full Query URL)
[20:10]
Kazany idea how it'll respond if we get a connection reset from IA? [20:18]
HCross2Id like to have a chat with the IA before we use it, as I want to understand a bit more about the setup and how they route requests etc [20:19]
jrwrright
I see someone is testing it right now
[20:21]
Kazgah, OVH dns is being fucky [20:21]
HCross2jrwr: is it a *.virginm.net IP? [20:21]
KazI tested to see what would happen, it kills connection like IA does [20:21]
jrwr94.23.45.xxx [20:21]
Kazswitching dns over to google [20:22]
HCross2ahh [20:22]
jrwrIm getting good response times
im recording times in my logs
"18/Jun/2017:20:22:32 +0000" client=94.23.45.xxx method=GET request="GET /cdx/search/cdx?url=https%3A%2F%2Fwww.washingtonpost.com%2Fpb%2Fgr%2Fc%2Fdefault%2FrWrxHh1cRcIJmq%2Fheadjs%2F70b9918770.js%3F_%3D64058&output=json&matchType=exact&limit=1&filter=digest:XGAQKM322OOCC4WFEYI2K6SLDYGYETQN HTTP/1.1" request_length=401 status=200 bytes_sent=316 body_bytes_sent=34 referer=- user_agent="python-requests/2.4.3
CPython/2.7.9 Linux/3.16.0-4-amd64" upstream_addr=207.241.225.186:443 upstream_status=200 request_time=0.655 upstream_response_time=0.655 upstream_connect_time=0.308 upstream_header_time=0.655
pipelining is working, I've only got one active TCP connection to IA
[20:22]
Kazit *feels* snappier, but I have absolutely no data to back that up [20:23]
jrwrwell for one, I know pythons SSL is shit
Now I have limited the max connections to the backend to 100 keepalived connections
[20:24]
upstream_addr=207.241.225.186:443 upstream_status=200 request_time=0.495 upstream_response_time=0.495 upstream_connect_time=0.295 upstream_header_time=0.495
Not too bad
Well, Ill keep the proxy alive, let me know if you need anything else, Im going to poke the wiki some more
[20:38]
So HCross2 I've got the review system in place, it will help keep the services in order [20:54]
arkiverhi I'm back
anything needs attention?
how's this going?
jrwr: AWESOME revision box :D
trying it out now]
does it change status automatically?
[21:07]
HCross2arkiver: wpull issues
and the fact that checking for duplicates takes a very long time as it does 1 at a time - but the IA has some capacity issues sometimes
[21:10]
arkiverah
yeah
[21:10]
HCross2see my discussion with mark [21:11]
..... (idle for 24mn)
***logchfoo0 starts logging #newsgrabber at Sun Jun 18 21:35:17 2017
logchfoo0 has joined #newsgrabber
[21:35]
arkiveryep
do you think we can add https://www.mediawiki.org/wiki/Extension:Graph ?
maybe we can update the services with a new graph on discovered URLs once a month or so
nice examples https://www.mediawiki.org/wiki/Extension:Graph/Demo
[21:36]
jrwrYa
Oh!
Look into Gadgets arkiver
those are much better methods
[21:37]
arkiverthis? https://www.mediawiki.org/wiki/Extension:Gadgets [21:39]
jrwrYa
they are really templates with some power behind them
[21:39]
.... (idle for 19mn)
***logchfoo1 starts logging #newsgrabber at Sun Jun 18 21:58:57 2017
logchfoo1 has joined #newsgrabber
[21:58]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)