#newsgrabber 2017-10-29,Sun

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***GK_1WM-SU has joined #newsgrabber
GK_1WM-SU has left
[03:37]
.................... (idle for 1h35mn)
anonymoos has quit IRC (Quit: Leaving) [05:12]
.................. (idle for 1h25mn)
kyan has quit IRC (Read error: Operation timed out) [06:37]
..................................... (idle for 3h2mn)
mls has joined #newsgrabber
mls has left
[09:39]
............ (idle for 58mn)
GK-1WM-SU has joined #newsgrabber
GK-1WM-SU has left
[10:40]
........................... (idle for 2h10mn)
HCross2arkiver: does the tracker give you errors when you try to requeue items? [12:50]
***GK-1WM-SU has joined #newsgrabber
GK-1WM-SU has left
DEEP-BOOK has joined #newsgrabber
DEEP-BOOK has left
[12:51]
............... (idle for 1h11mn)
GK-1WM-SU has joined #newsgrabber
GK-1WM-SU has left
[14:02]
.... (idle for 17mn)
Aoede has quit IRC (Ping timeout: 255 seconds)
Aoede has joined #newsgrabber
Aoede has quit IRC (Connection closed)
Aoede has joined #newsgrabber
[14:19]
...... (idle for 26mn)
arkiverHCross2: yes, when there's a large number of out items
Try the requeue option in Workarounds
[14:46]
.......... (idle for 47mn)
JAAarkiver: What a coincidence... Two buffer overflows were announced/fixed in wget yesterday, "which could result in the execution of arbitrary code when connecting to a malicious HTTP server" (obviously something that is impossible to prevent in the context of NewsGrabber or even the warrior): https://www.debian.org/security/2017/dsa-4008
I wasn't able to find any information on whether the version wget-lua is based on is affected though.
Given that Debian fixed 1.13.x on Wheezy though, and wget-lua is based on 1.14.something, it probably is.
[15:34]
Yep, just checked, the relevant code is present in the wget-lua version used by the warrior. [15:41]
***DEEP-BOOK has joined #newsgrabber
DEEP-BOOK has left
GK-1WM-SU has joined #newsgrabber
GK-1WM-SU has left
[15:54]
..... (idle for 23mn)
arkiverJAA: :/
maybe we can look into wget-lua, see if we can update it to a more recent wget version
could we simply merge the new version of wget into wget-lua?
https://github.com/alard/wget-lua/commit/a8697ba233a26ed16de2399df38b25c290034b7b
"Merge upstream wget version 1.17.1"
[16:17]
........................ (idle for 1h58mn)
JAAThat'd be a start. We'd have to backport those two fixes also. [18:18]
................. (idle for 1h22mn)
arkiverYes
I got a little network running now of staging servers and crawlers
now setting up the crawling part
[19:40]
............... (idle for 1h10mn)
currently connecting a crawler to a max of 5 staging servers [20:50]
staging all know about each other at the moment, if the network becomes very big, we'll have to change that
I'm thinking a job is spread over a certain number of staging servers
URLlist is split across them, staging servers shares URLs with crawlers after consulting with other servers that have the same job
crawler crawls single URL, deduplicates, and reports found URLs back to staging server that gave the job
servers consults with other servers and chares number of URLs equally, make sure no duplicate with URLs on other servers
and then send out URLs to crawler again
until number of URLs in the queue for a job on all servers is 0
I'm thinking a few privileged servers with account on IA and S3 keys, they receive megaWARCs and upload them to IA
[20:56]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)