#newsgrabber 2017-07-07,Fri

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***MrRadar_ is now known as MrRadr
MrRadr is now known as MrRadar
[01:43]
...................... (idle for 1h47mn)
SmileyG has joined #newsgrabber
Smiley has quit IRC (Read error: Connection reset by peer)
[03:30]
kyan has quit IRC (Ping timeout: 370 seconds)
kyan has joined #newsgrabber
[03:37]
kyan has quit IRC (Ping timeout: 370 seconds) [03:44]
kyan has joined #newsgrabber [03:55]
Aranje has quit IRC (Ping timeout: 245 seconds) [04:04]
............................... (idle for 2h31mn)
luckcolor has quit IRC (Read error: Operation timed out)
luckcolor has joined #newsgrabber
[06:35]
..................... (idle for 1h41mn)
kyan has quit IRC (Ping timeout: 370 seconds)
kyan has joined #newsgrabber
[08:19]
............................ (idle for 2h15mn)
kyan has quit IRC (Ping timeout: 370 seconds) [10:34]
.... (idle for 17mn)
HCross2arkiver: you around? [10:51]
............. (idle for 1h4mn)
***stns4 has joined #newsgrabber [11:55]
.... (idle for 19mn)
gk_1wm_su has joined #newsgrabber
stns4_ has joined #newsgrabber
gk_1wm_su has left
stns4 has quit IRC (Read error: Operation timed out)
[12:14]
ErkDog has joined #newsgrabber [12:27]
................................................................................ (idle for 6h35mn)
kyan has joined #newsgrabber [19:02]
..... (idle for 23mn)
kyan has quit IRC (Ping timeout: 268 seconds) [19:25]
.... (idle for 19mn)
kyan has joined #newsgrabber [19:44]
..................................... (idle for 3h0mn)
arkiverHCross2: yes :) [22:44]
HCross2I'm noticing hardly any inbound traffic on the server at the moment, can you please check that all is well? [22:45]
arkiveryes
probably deduping is just really slow
let's get the internal deduplication going
we only need some way for the warrior to contact some server that return if an URL with certain hash was already downloaded
and I'm not sure how to set that up
I can extract and deliver information to where we want to distribute it from
[22:47]
jrwrarkiver: I can
I can do Sceince!
[22:56]
arkiverwooh! [22:56]
jrwrshould be a little tracker, have it store the details in a memcache that has a TTL [22:56]
arkiverI think so [22:56]
jrwrso send a sha1 of the url and i can return if its within the TTL
what kind of TTL are we looking at
12/24 hours?
[22:56]
arkivernot that probably [22:57]
jrwra hour or two [22:57]
arkiverBut the idea is
We extract data from the WARCs on the main server
included in this data is the URL and the hash (and other stuff)
this data is saved in the other place (maybe same main server?) where we want to have the database for deduplication
for each URL the warrior makes a request to this place with the URL and the hash
[22:57]
jrwrwell [22:59]
arkiverand then all the data that has this same URL and hash in it is returned [22:59]
jrwrHrm
so our own DeDupe
are we doing 1:1 matching on the urls?
[22:59]
arkiveryes
and the hash
[23:00]
jrwrI suggest something like
get.php?url=sha1(URL)&hash=sha1(FILE)
to keep the keys a little more searchable in a database
[23:00]
arkiverright
might use sha256 for the url
[23:01]
jrwrthat would be fine
what kind of TTL?
for the I have the file flag
[23:02]
arkiverI've been thinking about URL agnostic deduplication, but since sha1 is cracked that seems dangerous [23:02]
jrwrsha256 is fine [23:02]
arkivercool [23:02]
jrwrI can make something and host it
Ill dedicated a scaleway VM to it
[23:02]
arkiverat some point we should all switch to sha256, but that would require a massive parse of all WARC in IA [23:03]
jrwrYa [23:03]
arkiverhow heavy do you think will be on the system? [23:03]
jrwrNot very
if the TTL is small (A day) then we can keep most of it in ram
[23:03]
arkivermaybe we can run it on our main server where the WARCs some in and are uploaded to IA from [23:03]
jrwrWe can
I can also run it on the wiki VM
[23:03]
arkiver1 day is too short
we need quite a bit longer
[23:04]
jrwrI've designed caches before
How many keys?
5M? 10M?
[23:04]
arkiverthat is as in records? [23:05]
jrwrYes [23:05]
arkiverwould 100M+ be possible? [23:05]
JAAHaving it on the main server would probably be best, since the handling of new records could then be done entirely on that machine, rather than transferring it to somewhere else. [23:05]
jrwrUm
JAA: Thats doable
100M+ would be doable
[23:05]
JAAIf the server has the capacity, of course. [23:05]
arkiverok [23:05]
jrwrSince its a simple flag we are storing
KISS
[23:06]
arkiverwell a flag and the actual data that needs to be returned [23:06]
jrwrwhat data? [23:06]
JAAA list of hashes, really [23:06]
arkiverof the record we are deduplicating from [23:06]
jrwrah
So a simple "We Got it from X Record"
or its from X Date
if its just X Timestamp thats easy, I do not suggest much more
[23:06]
arkiverit's optional but I'd like to include the record ID as well [23:07]
jrwrI've saved a billion records for a mining pool before [23:07]
arkiverso that would be [23:07]
jrwrarkiver: thats doable
SHA256:TIMESTAMP:RECORDID
[23:07]
arkiver"we got this from URL, at data with record ID
"
hmm
[23:07]
jrwrwell it knows the URL already (the record ID) [23:08]
arkiveractually it should ignore the protocol in the URLs [23:09]
jrwrya, I would [23:09]
arkivercurrently we're sending the full URLs to the wayback API, not a hash [23:09]
jrwrYa
I suggest hash due to size
[23:09]
arkiverI'm sometimes a little concerned we'll get a collision with URLs
with the hashes*
but I guess that is just a very small chance
right
[23:09]
jrwrSHA256 [23:10]
arkiveryeah, I know [23:10]
jrwrHA, We would win a fucking Google Bug Bounty if we did
I think SHA1 would be fine for this
[23:10]
arkivernah, that's cracked
never know what funny URLs people put on the internet now
[23:10]
jrwrI /guess/ [23:11]
arkiverso that would be
SHA256:TIMESTAMP:RECORDID:PROTOCOL
[23:11]
jrwrThats fine [23:11]
arkiverwhere protocol can just be an s or no s [23:11]
jrwrso a 1 or 0 [23:11]
arkiversure [23:12]
JAAWhy store the protocol? You could strip that before hashing the URL. [23:12]
jrwrI would [23:12]
arkiveryes, but
hold on
[23:12]
jrwrNow, A fun way to do this is nginx+memcache
let the web server handle the lookups
[23:13]
arkiverit's not yet in the official specification
but
https://github.com/iipc/openwayback/wiki/How-OpenWayback-handles-revisit-records-in-WARC-files
proposes a WARC-Refers-To-Target-URI and a WARC-Refers-To-Date
IA is using those
[23:14]
***stns4_ has quit IRC (Remote host closed the connection) [23:14]
arkiverand I'd like to add those as much as is possible [23:14]
***stns4_ has joined #newsgrabber [23:15]
jrwrAnyway, How I plan to store it is Nginx+Redis (We will use redis as the key-store) [23:15]
arkiverI know we used couchdb for a very big database of vine videos [23:15]
jrwrhave the url format of domain.com/lkup/SHA256(URL):SHA256(FILE)
I've done stupid numbers in redis+nginx
[23:16]
arkiverFILE should be sha1, since that's what's currently being used [23:16]
jrwrthats fine
mostly its a direct interface to the redisDB readonly
in the form of a URL
[23:16]
arkiverrght
right
[23:17]
jrwrso, how ever we store keys, it can read [23:17]
JAAIf we only need to detect duplicates, not whether a URL has changed, we could go even further and only store one hash. [23:17]
jrwrMATTER OF FACT
SHA215(URL + SHA1(FILE))
Bam! One key
[23:17]
arkiverI guess [23:17]
JAASHA215, love it. [23:17]
jrwrI typoed [23:17]
JAA:-P [23:17]
arkiverI'll go over this with the wayback team too, see if they think this is good [23:17]
jrwrYa
If we have the file sha and the url and wrap it all up into a single key
that makes this super easy mode
[23:18]
arkiveror we use sha512 :)
I guess 256 is good enough
[23:18]
jrwrI can have that setup in a day [23:18]
arkiversounds good
how do you see data ingestion happening?
[23:19]
jrwrBackend service dumping keys into the Redis [23:19]
arkiverok [23:19]
jrwrIt runs on the same machine as the RedisDB
have the service set a TTL that makes sense (A Month or something)
[23:19]
arkiverI'll have a second look at this tomorrow, see if we miss any WARC fields
but that sounds good
[23:20]
jrwrJAA: where is this main box, I can setup the front end right now [23:20]
arkiverTTL as in, deleted after a certain amount of time if it has not been requested in that time?
or just verything deleted after this time
[23:21]
JAAjrwr: I have no clue about NewsGrabber actually. [23:21]
jrwrThe service can refresh the TTL
the backend service I say
but if its not refreshed in X time, it is deleted
I cannot do last access with the frontend the way it is
[23:21]
JAAI'm following it because I want to contribute various local news outlets and resources. I know nothing about the internal design, who runs what, etc. [23:22]
arkiverrefreshed as in data overwritten or requested by a warrior machine/
?
[23:22]
jrwrRefreshed as the backend services scraping warcs says Hay! You are important, Stick around for another month [23:23]
arkiverI see [23:23]
jrwrWe can move that all to a external service as well
so Redis doesn't expire keys, something else does
[23:23]
arkiverso in the WARCs we could search for revisit records and let the database know about that [23:24]
jrwrya [23:24]
arkiverthen reset the timer since the data was used again to deduplicate
ok, good
[23:24]
jrwrNow I can host all of this on a Scaleway VM, the backend scraper can SSH into the box and use tunnels and such to update the DB [23:24]
arkiverI guess
our newsbuddy server might be best
but I'm not sure if it can handle that
[23:25]
jrwrYa [23:26]
arkivertogether with everything else it's doing [23:26]
jrwrI can tune it to fuck all and it get speedy
get it*
[23:26]
arkiverright
HCross2 runs the newsbuddy server
[23:26]
jrwrThats right [23:27]
arkiverI think this is good
I'll try to set something up tomorrow
to get the data out of the WARCs
[23:27]
jrwrIll make the frontend right now
are we going with single hash or two hashes
I vote for single hash
[23:28]
arkiverI think that's ok
I might thing of more important fields tomorrow though
[23:28]
jrwrIll design for that [23:29]
arkiverso be prepared for possible changes
and I need to go over this with IA next week
[23:29]
jrwrya [23:29]
arkiverto make sure it's ok [23:29]
jrwrDo keep the stored values short
keep ONLY the data needed to make a choice on the warriors
[23:29]
arkiverhmm [23:30]
jrwrwe can make a slower disk based DB for bigger items [23:30]
arkiverrecord ID and date and URL are not mandatory in revisit records
but I think they are important information
let's make a decision on those tomorrow
[23:30]
jrwrya thats fine if you store epoctime
no storing crazy json
keep it to 1KB
[23:31]
arkiverah yeah, I'm sure we can do that
example record ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb>
from which we can just take 16da6da0-bcdc-49c3-927e-57494593bbbb
maybe even remove -
and example date 2007-03-06T00:43:35Z
which will be converted to 20070306004335
that's about as short as I can get it
[23:31]
jrwrstore the time as Linux Epoc time
its smaller
[23:33]
arkiverah yeah
will do that
[23:34]
JAANitpick: unix/epoch time doesn't handle leap seconds. :-/ [23:34]
jrwrLOOK HERE MISTER
:P
[23:34]
arkiver? [23:34]
JAA;-)
Well, if your computer supports leap seconds and you happen to retrieve a resource at e.g. 2016-12-31T23:59:60Z, stuff might break later on.
[23:34]
arkiverah
maybe stick to the 20070306004335 format
[23:36]
jrwrthats fine [23:36]
JAABut in reality it's really a non-issue. [23:36]
jrwrYa
Edge case of a Edge case
you will find clock drift a bigger issue
[23:36]
JAACorner case? [23:37]
arkiverI think there's a pretty big chance we'll get that edge case [23:37]
JAAIndeed.
Actually, clock drift wouldn't be a problem, because the date would still match the one recorded in the WARC.
[23:37]
jrwranyway, Ill make a auth'd endpoint url for you to inject things into my DBs [23:42]
arkiversounds
sounds good
[23:42]
jrwrIll be doing some heavy caching on the frontend [23:43]
arkiverwe'll probably not start yet through before I talked this over
but I'll keep this channel informed on that
[23:43]
jrwrshould be able to handle 40k/req/s [23:43]
arkiverthat's very good [23:43]
jrwrcurrently I'm showing something like 40% of the requests will be cached on the frontend before it even hits the db [23:44]
arkiverwe can go archive almost 3.5 billion URLs/day :) [23:44]
jrwrYAY! [23:44]
arkivernot that that will happen any time soon though
all of IA is not every close to 3.5 billion/day
not even close*
[23:44]
JAAChallenge accepted.
:-P
[23:49]
arkiverhaha [23:51]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)