#newsgrabber 2017-04-29,Sat

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***Chickenhe has joined #newsgrabber [01:50]
.................................................. (idle for 4h5mn)
Aranje has quit IRC (Quit: Three sheets to the wind) [05:55]
...................................................................................... (idle for 7h5mn)
HCross2arkiver: where do we go in terms of the warrior and uploading data please [13:00]
arkiverwhat do you think
should we be making megaWARCs
or shall we get it as just these files into items on IA
[13:00]
HCross2If we can make megawarcs in RAM then we should do that
To alleviate the disk load
[13:04]
kurt_we could do like 10-15gb megawarcs and should be fine even with just 32gb of ram
then it takes some of the pain out of the small files
just need to work out if we can megawarc quicker than things come in
[13:07]
arkiverwhat do you think of deduplication
should be still do deduplication on the main server
or shall we leave it to the discoverer
that will mean we'll get some duplicate articles
but given that we check with IA CDX API at least images should not be duplicated
and other static stuff
which makes it a lot less painful to duplicate an article
[13:08]
kurt_we don't want the main server to be doing dedupe, really
it's already going to be rammed out taking data, warcing and uploading
[13:10]
HCross2^ it already spends 98% of the time on fire [13:11]
kurt_although, I guess the 'main' server only needs to be giving lists to discovery nodes, and hosting for warriors to grab [13:13]
arkiveralright [13:13]
kurt_the rsync ingest/megawarc/upload could be separate, which we could split off a bit, and scale that way [13:13]
arkiverso leave discovered URLs deduplication to the discoverers
and archive URLs deduplication with IA CDX API to the warriors
archived URLs*
[13:13]
kurt_discovered urls could be deduped on the main server, i guess
that's only a small workload, I guess?
archived urls should be on warriors
[13:14]
arkiverkurt_: it requires keeping a list of URLs on the main server, which can become quite big
(we might clean it up every now and then though)
[13:16]
kurt_ah
we could leave that to discovery nodes then
[13:16]
.............................. (idle for 2h29mn)
arkiverchfoo: we want be able to submit lists of items to the tracker from the main server of newsbuddy
chfoo: do you have any recommendations on how we should do that?
[15:46]
................ (idle for 1h15mn)
chfoothere's actually a backdoor in the tracker that lets you add items without login
i think that might be the easiest without needing to add a new API
[17:02]
HCross2chfoo: can that be done by just locking down to an IP? [17:10]
........................ (idle for 1h57mn)
chfooit's already part of the tracker [19:07]
..... (idle for 23mn)
HCross2chfoo: does the tracker do IPv6 at all please? [19:30]
chfooit should already support ipv6 [19:34]
................. (idle for 1h20mn)
***midas3 has quit IRC (Remote host closed the connection)
midas has quit IRC (Read error: Connection reset by peer)
[20:54]
.............................. (idle for 2h25mn)
hook54321 has quit IRC (Connection closed)
hook54321 has joined #newsgrabber
hook54321 has quit IRC (Connection closed)
[23:19]
....... (idle for 31mn)
hook54321 has joined #newsgrabber [23:54]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)