#newsgrabber 2017-06-22,Thu

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***luckcolor has joined #newsgrabber [00:54]
......................... (idle for 2h1mn)
Crusher has joined #newsgrabber
Crusher_ has quit IRC (Read error: Connection reset by peer)
[02:55]
..................................................................................................................... (idle for 9h43mn)
Crusher has quit IRC (Ping timeout: 492 seconds) [12:38]
..................................... (idle for 3h4mn)
Crusher has joined #newsgrabber [15:42]
.............................. (idle for 2h27mn)
Crusher_ has joined #newsgrabber
Crusher has quit IRC (Bye)
[18:09]
................ (idle for 1h15mn)
Crusher_I've got Ubuntu server up and running, the script is asking me to install or update the requests module
I'm probably missing something kinda-sorta important xD
[19:24]
JAApip install requests [19:24]
Crusher_Oh... Well that's easy [19:25]
JAAOr if you want to install it in your home directory rather than system-wide, pip install --user requests [19:25]
Crusher_This system is only for archiving, so that's fine
My current desktop is, well, the box...
[19:29]
***Crusher has joined #newsgrabber
Crusher_ has quit IRC (Read error: Connection reset by peer)
Crusher_ has joined #newsgrabber
Crusher has quit IRC (Read error: Connection reset by peer)
Crusher has joined #newsgrabber
Crusher_ has quit IRC (Read error: Connection reset by peer)
[19:30]
CrusherGah
Behave Wi-Fi, before I use my desktop client again.
[19:33]
***JAA has quit IRC (leaving)
JAA has joined #newsgrabber
[19:40]
....... (idle for 33mn)
CrusherIs there a maximum size for the warc files?
The ones generated by the client, not the megawarcs
Or, better question: does the scripts know not to exceed the physically available space?
Do the*
[20:14]
JAAI doubt it. [20:17]
CrusherI'm just looking at my poor HDD being murdered by all the tiny I/O requests
While cpu, net I/O and memory aren't even sweating
[20:21]
...... (idle for 28mn)
***kyan has joined #newsgrabber [20:50]
......... (idle for 44mn)
Crusher_ has joined #newsgrabber [21:34]
Crusher_Question: where's the part of the script controlling the concurrency limit of 20 [21:35]
arkivernot sure
why?
[21:35]
JAASomewhere in seesaw, I assume. [21:35]
***Crusher has quit IRC (Read error: Connection reset by peer)
Crusher has joined #newsgrabber
[21:36]
JAABut I doubt that running more than 20 jobs on a single IP is a good idea, so... [21:36]
arkiveryeah [21:36]
***Crusher_ has quit IRC (Read error: Connection reset by peer) [21:36]
arkiverfeel free to make a PR, but don't run edited scripts on 'real' projects [21:36]
Crusher:P [21:37]
arkiveralso, we got quite a bit of stuff now, so I'm going to start the megaWARCs processing [21:37]
CrusherAlright I'll leave it at 20
I guess I'll run 20 news threads at the same time
[21:40]
JAACrusher: For the record, on almost all projects, we're limited by either the service itself (insufficient server power) or their rate limiting. You'll basically never manage to saturate your machine/network connection with jobs. [21:41]
CrusherEroshare was probably an exception [21:41]
JAAPossible, I didn't participate in that grab because of the item size. [21:41]
CrusherIt was only limiting from my HDD [21:41]
JAAOne exception that comes to mind was MLKSHK when we were grabbing post contents from their CDN, Fastly, which has immense amounts of bandwidth/server power. [21:42]
arkivercurrently we're limited by the wayback machine :P
the CDX API
[21:42]
JAARight, but that's a special case for this project. :-P [21:42]
Kaz<Crusher_> Question: where's the part of the script controlling the concurrency limit of 20
it's somewhere in (from the top of my head) /usr/local/lib/python2.7/dist-packages/seesaw/
don't change it - there's lots of very good reasons it's there. you should never be doing 20 per host anyway, things get messy. 2x10 is faster than 1x20
HCross2: are you around? would be interesting to see how dedupe is holding up
[21:43]
***Crusher_ has joined #newsgrabber [21:47]
Crusher_Alright then, but put me on the shortlist for high-bandwidth projects :P
I Wonder how many threads of newsgrabeber I can get away with...
[21:47]
Kazthe only limit you'll run into is either a) your own, or b) IA's dedupe [21:49]
Crusher_Could you please explain b? [21:49]
Kazevery URL you grab is checked against IA's collection [21:50]
***Crusher has quit IRC (Ping timeout: 492 seconds) [21:50]
Crusher_Oh gotcha [21:50]
Kazif it's already there, we don't save the file itself, just a pointer to the version Ia has [21:50]
Crusher_So either a) saturate myself
Or b) saturate IA
Given this is a single i5 on a 300mbit line, I think I'll lose first
[21:50]
Kazwe pushed IA over capacity with like 60 concurrent the other day, because it was heavily loaded [21:52]
Crusher_Woops... [21:52]
So at what point is it more practical to start shipping hard drives :P [21:57]
***Crusher_ has quit IRC (Bye)
Crusher has joined #newsgrabber
[21:58]
JAAIf I understand it correctly, this is not a bandwidth issue. IA has a *massive* database of URLs which has to be searched for every page in this project.
So the better question is: how much do you donate to the IA? :-P
[22:01]
.... (idle for 16mn)
jrwrKaz: You should use my proxy
see if that breaks IA
:)
[22:19]
Kazi am [22:19]
jrwrOh
NICE
[22:19]
Kazyours caches right? forgot to ask the other day [22:19]
jrwrHow is it holding up
Ya
Full Query
[22:19]
Kazseems fine so far
there's this set of washingtonpost assets that FLY through every time i hit them. like 20-30 urls just fly
[22:19]
jrwrYa
I figured that would be the use case it would work for
since those never really change much
[22:21]
Kazso yeah, definitely working perfectly for me so far [22:21]
arkiververy nice jrwr [22:22]
jrwrThanks arkiver [22:22]
arkivermaybe we can add it to the script to be used by everyone? [22:22]
jrwrI would be fine with that [22:22]
arkiverdoes this work with a single connection?
or are you still making multiple
[22:22]
jrwrmy little scaleway can take it!
it works my piplining and caching requests to IA
by*
[22:22]
arkivercool
any limits or number of requests?
[22:23]
jrwrnone so far [22:23]
arkivernice [22:23]
jrwrit does have a hard limit of 90 connections to IA [22:23]
***kyan has quit IRC (Remote host closed the connection) [22:23]
jrwrbut it will return 500s for that [22:23]
arkiverwhat URL should we use for it?
ah yes
[22:23]
jrwrbut It has never hit it
its currently at 2 connections right now
[22:23]
arkiversounds good [22:24]
jrwrits the same URL but http://jrwr.io:4444/
http://jrwr.io:4444//cdx/search/cdx?url=http%3A%2F%2Fwww.ctvnews.ca%2Fpolopoly_fs%2F7.663650%21%2FhttpImage%2Fimage.png&output=json&matchType=exact&limit=1&filter=digest:2HHZX4PXAGD3XU72DOBC7DUVUS7PFG7G
[22:24]
arkivertesting it now [22:25]
jrwrcached requests show like this in my logs
"22/Jun/2017:22:26:23 +0000" client=94.23.45.204 method=GET request="GET /cdx/search/cdx?url=https%3A%2F%2Fwww.washingtonpost.com%2Fnotification-sw.js&output=json&matchType=exact&limit=1&filter=digest:SHS74LA7VAIFW4DQP67RTLEKEL2XPFAD HTTP/1.1" request_length=344 status=200 bytes_sent=513 body_bytes_sent=230 referer=- user_agent="python-requests/2.4.3 CPython/2.7.9 Linux/3.16.0-4-amd64" upstream_addr=-
upstream_status=- request_time=0.000 upstream_response_time=- upstream_connect_time=- upstream_header_time=-
[22:26]
arkiverit is definitely faster...
very nice
[22:26]
jrwrso if we wanted to graph it we could
"22/Jun/2017:22:27:21 +0000" client=94.23.45.204 method=GET request="GET /cdx/search/cdx?url=http%3A%2F%2Fwww.cameroonpostline.com%2Fwp-content%2Fthemes%2Flocalnews%2Ffunctions%2Fshortcodes-ultimate%2Fimages%2Flist-style-link.png&output=json&matchType=exact&limit=1&filter=digest:HDLEBJFVTAPGVTURYDFMENY4YS7ZQMGP HTTP/1.1" request_length=423 status=200 bytes_sent=548 body_bytes_sent=266 referer=-
user_agent="python-requests/2.4.3 CPython/2.7.9 Linux/3.16.0-4-amd64" upstream_addr=207.241.225.186:443 upstream_status=200 request_time=0.505 upstream_response_time=0.505 upstream_connect_time=0.294 upstream_header_time=0.505
[22:27]
arkiverthat's be nice, if it's not too hard
or taking too much time
[22:27]
jrwrTimeouts are set to 30s
72h inactive cache
with 10G store
I am sending proxy_set_header X-Real-IP $remote_addr;
[22:28]
arkiververy nice
it's really a lot faster...
[22:29]
jrwrso the remote IP is reported to IA
:)
that is because 2/3rds of the requests are coming out of the cache
[22:29]
arkiverI see, yeah
I'm going to put this in the repo
and will make it the new default minimum version
[22:29]
jrwrill work on graphing it
Ill just do some PHP + RRD Graphs
[22:30]
arkiver20170623.01 is now minimum
with your server in it
[22:31]
jrwrCool
Good ol NGINX
fucking love the old bird
[22:32]
KazHCross2 has 32k claims
hmm
[22:35]
arkiverhaha it's so fast
all the static stuff it going really fast
[22:36]
Kazit absolutely flies if you hit the cache [22:36]
arkiveryep
just saw 100 nrk.no static URLs go by in a second
[22:37]
jrwr:)
https://jrwr.io/nginx_status
[22:39]
DAT TRAFFIC [22:46]
Kazhows it holding up? [22:47]
jrwrgood
Its got 200Mbit/s play
currently doing 120KiB/s right now
[22:47]
Kazah, easy then
can you see how many connections are killed by IA easily?
if we hit the limit
[22:50]
jrwrI can
Ill grep for error codes
[22:52]
Kazi'm about to head off for the night, I'll scale up a ton tomorrow and keep an eye on processes, to see how things go [22:54]
jrwrim getting some 403s already
org.archive.util.io.RuntimeIOException: org.archive.wayback.exception.RobotAccessControlException: Blocked By Robots
wtf is this
/cdx/search/cdx?url=http%3A%2F%2Fwww.lfpress.com%2Fevents%2F832377&output=json&matchType=exact&limit=1&filter=digest:Q3VAWCUCHQYRE2KHHBSTJKS4KLFHBLGX
[22:54]
JAAProbably because lfpress.com/robots.txt blocks ia_archiver? [22:57]
jrwrYa
Thats what it is
I just looked over openwaybacl
[22:57]
Kazwhen we hit the limit IA just kills connections, so will be worth watching out for [22:58]
JAAWhy can't the IA just get rid of that fucking robots.txt policy already? [22:58]
Kazarkiver: might be worth making the script retry if it fails to dedupe through the IA
as it is at the moment, one failure = failed job
[22:59]
jrwroh, I'm getting 302
502s*
it was just a few
tail -f /var/log/nginx/proxy.log | grep -v upstream_status=200 | grep -v upstream_status=- | grep -v upstream_status=403
[22:59]
Kaz: JAA arkiver, Someone's worker is stuck in a Loop!
https%3A%2F%2Fopen.http.mp.streamamg.com%2Fhtml5%2Fhtml5lib%2Fv2.42%2FmwEmbedLoader.php%2Fp%2Fresources%2Fuiconf_id%2Fresources%2Fjquery%2Fresources%2Fjquery%2Fresources%2Fjquery%2Fresources%2Fjquery%2Fresources%2Fjquery%2Fresources%2Fjquery%2Fresources%2Fjquery%2Fresources%2Fjquery%2Fresources
94.23.45.XXX
its like 90 folders deep atm
[23:14]
KazFeels like me
Gah
[23:15]
jrwrits getting bigger! [23:15]
KazWhat does rbx2.kurt.gg resolve to please
Away from pc atm
[23:15]
JAA94.23.45.204 [23:16]
jrwrYep!
Thats you
[23:16]
KazBugger [23:16]
jrwrYour poor poor worker [23:16]
Kazah yes https://s.kurt.gg/7bScu9k.png
it seems to have stopped trying to dedupe anything
[23:17]
JAABeautiful. [23:18]
Kazi'm just going to let it run, on the basis it'll eventually end itself [23:19]
JAAWould a generic ignore pattern like (/.*){4,} be a good idea? [23:19]
jrwrMaybe
to prevent inf recursion
[23:19]
JAAYeah [23:20]
jrwrup it to 10
maybe 15, you never know, it IS the internet
[23:20]
JAAYup
Possibly needs some other tweaks as well. I'm not sure how URLs with consecutive slashes are handled, for example.
[23:20]
jrwrlike
help.com/http://help.com/
[23:21]
JAAhttps://example/news///////something [23:22]
jrwrURLs in URLs are common
or base64 in a URL
that has slashes
[23:22]
JAAOh right.
(/[^/].*){10,} or something like that then
[23:22]
***Crusher has quit IRC (Bye)
Crusher has joined #newsgrabber
[23:32]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)