#newsgrabber 2017-07-12,Wed

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***kurt_ has quit IRC (ny.us.hub hub.efnet.us)
Igloo_ has quit IRC (ny.us.hub hub.efnet.us)
ErkDog has quit IRC (ny.us.hub hub.efnet.us)
Fletcher| has quit IRC (ny.us.hub hub.efnet.us)
Fletcher- has quit IRC (ny.us.hub hub.efnet.us)
underscor has quit IRC (ny.us.hub hub.efnet.us)
joepie91 has quit IRC (ny.us.hub hub.efnet.us)
chfoo has quit IRC (ny.us.hub hub.efnet.us)
luckcolor has quit IRC (ny.us.hub hub.efnet.us)
MrRadar has quit IRC (ny.us.hub hub.efnet.us)
midas1 has quit IRC (ny.us.hub hub.efnet.us)
dxrt has quit IRC (ny.us.hub hub.efnet.us)
arkiver has quit IRC (ny.us.hub hub.efnet.us)
midas has quit IRC (ny.us.hub hub.efnet.us)
lainu has quit IRC (ny.us.hub hub.efnet.us)
SmileyG has quit IRC (ny.us.hub hub.efnet.us)
stns4_ has quit IRC (ny.us.hub hub.efnet.us)
jrwr has quit IRC (ny.us.hub hub.efnet.us)
ivan has quit IRC (ny.us.hub hub.efnet.us)
JAA has quit IRC (ny.us.hub hub.efnet.us)
SmileyG has joined #newsgrabber
Fletcher| has joined #newsgrabber
Fletcher- has joined #newsgrabber
kurt_ has joined #newsgrabber
Igloo_ has joined #newsgrabber
stns4_ has joined #newsgrabber
luckcolor has joined #newsgrabber
jrwr has joined #newsgrabber
MrRadar has joined #newsgrabber
ivan has joined #newsgrabber
underscor has joined #newsgrabber
midas1 has joined #newsgrabber
joepie91 has joined #newsgrabber
chfoo has joined #newsgrabber
JAA has joined #newsgrabber
lainu has joined #newsgrabber
midas has joined #newsgrabber
arkiver has joined #newsgrabber
dxrt has joined #newsgrabber
hub.efnet.us sets mode: +oo chfoo arkiver
ErkDog has joined #newsgrabber
[12:41]
arkiverjrwr: HCross2: any idea where we can set this deduplication macihne up?
shall we host it at the main newsbuddy server?
[12:48]
HCross2arkiver: I don't see why not for now
We can always move it later
[13:00]
.... (idle for 17mn)
***r3c0d3x has quit IRC (Ping timeout: 260 seconds)
r3c0d3x has joined #newsgrabber
[13:17]
arkiverok [13:24]
there is one problem with the deduplication
we currently only know a WARC is not bad if it derives correctly on IA
There is a very very small possibility that a megaWARC created by us is bad
if we would extract records for deduplication from that bad WARC, but then the WARC is not added to the wayback machine
then some later on deduplicated records will point to records that have never been indexed
so I'm planning on creating this deduplication database from the CDX files on IA
jrwr: how exactly should I feed the data to our deduplication program?
[13:29]
........ (idle for 39mn)
we're going to check for new CDX data every 5 minutes
and I'm thinking of increasing WARC size to 100 GB
or more
so we have less records to check
jrwr: we'll only have an URL, hash and date, so no recordID for example
since the URL, hash and date are the only data for a record the in CDX
example record:
com,gravatar,0)/avatar/0dd53c96b968bcceb63afe919567652a?d=mm&r=g&s=168 20170620213556 http://0.gravatar.com/avatar/0dd53c96b968bcceb63afe919567652a?s=168&d=mm&r=g image/jpeg 200 LXVX664Z6AIZ4LBYSB4F5YHB6ZXBWKJP - - 3018 51246019757 archiveteam_newssites_20170623154212/newssites_20170623154212.megawarc.warc.gz
(that is in the CDX)
[14:12]
JAABut we need the record ID for WARC-Refers-To on the duplicate records, don't we? [14:28]
arkiverit's not mandatory
and the record ID is not indexed in the CDX, so the current wayback machine is not using it either
and thus also not WARC-Refers-To
currently it's just using the URL and the hash to find the record the revisit record refers to
[14:29]
JAAHm [14:32]
............ (idle for 59mn)
arkiverI have the scripts ready
the scripts don't have to be run on the main server
[15:31]
...... (idle for 27mn)
jrwrOk
Morning arkiver
PM me a SSH key, its a redis database
[15:58]
arkiverI don't have much experiece with redis [16:00]
............... (idle for 1h13mn)
***Etamin has joined #newsgrabber [17:13]
jrwrwatching this: https://www.youtube.com/watch?v=-2ZTmuX3cog
bwhaha
[17:15]
................. (idle for 1h24mn)
HCross2Ohh... Here we go. First impeachment proceedings filed against Trump [18:39]
......... (idle for 44mn)
jrwrIts watergate all over again
I dont want him to cop out
[19:23]
.......... (idle for 48mn)
arkiverscript for feeding data into redis https://github.com/ArchiveTeam/NewsGrabber-Deduplication-Feeder
going to start testing soon :D
[20:11]
HCross2: do we have a special newsbuddy account on IA?
we need an account for the feeder to login to IA
[20:22]
HCross2We don't have a special account so far, we're uploading onto mine [20:22]
arkiverok
shall I create a special account for the feeder for the deduplication?
the deduplication stuff is not hosted on the main server currently
[20:23]
HCross2Sure
arkiver: won't it need upload perms to put things in our collection
[20:26]
arkiverit won't put anything in the collection
we just an account to get a list of items from the collection through metamgr
for this https://github.com/ArchiveTeam/NewsGrabber-Deduplication-Feeder/blob/master/indexer.py#L25-L26
would it be possible to create a @newsbuddy.net account?
like deduplication@newsbuddy.net er so
or*
[20:27]
jrwrarkiver: afk 1 hr [20:35]
arkiverok [20:36]
jrwrnote there is a 60m hard cache in place
you might hit it on the frontend
[20:36]
arkiverwhat does that mean? [20:36]
jrwr60 Minutes
if you request a key on the webserver
it will be cached in ram for 60 minutes and not pulled from the db
[20:36]
arkiverI see
that should be fine I think
[20:36]
jrwrnvm
I turned it off for your testing
[20:37]
arkiverwe're not changing values once they're in
ok
yeah, for testing it's better turned off
[20:37]
HCross2arkiver: can do. Will take me a little while to get newsbuddy.net hooked up onto my mail server [20:40]
arkiverif it's too much work, I can just create a random email adres
we'll use this one only for feeding redis
[20:41]
....... (idle for 32mn)
made an account with a 20minutemail
we don't need special privs after all
[21:13]
....... (idle for 31mn)
jrwr: ready for a test whenever you are
account and all is set yp
up
[21:45]
................ (idle for 1h15mn)
KazI've got a grandfathered GApps account, if we need it [23:00]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)