#newsgrabber 2017-10-31,Tue

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
arkiverwhat is currently in:
a job is started in a single staging server. this staging server gets the list of URLs
The staging server chooses up to 5 random other staging servers. The list of URLs is spread among these
Each list of URLs that goes to a staging servers, is spread to 3 other staging servers again as backup
(if the server goes down, this backup will be used to restore URLs)
Each server that received URLs spreads these URLs randomly across it's crawlers
each crawlers crawls the URLs it received
it reports it's done with the URL to the server that queued it to the crawler
when the server received this from the crawler, is spreads this across other servers and the URL is removed from backups
receives*
it's*
err it*
the crawler reports back to server that queued the URL all the new URLs it found.
the server receives these URLs and the cycle of spreading and backup starts again
when the crawlers, report no new URLs and all backups are empty (all URLs have been crawled), the job is done
(first comma should not be there in last line)
please let me know what you think or if you have comments about this
I'm thinking WARCs from crawler will be send back to server that queued URL. this server than sends the WARCs to a server with special IA uploading privileges
Still need to figure out how deduplication is going to fit in this picture
and special wpull/wget-lua scripts for archiving and updates to these scripts
[00:32]
also, the server keeps a little backup of URLs send to a crawler, in case a crawler goes down these URLs will be spread again among other crawlers
and the server again removes an URL from backup when the crawler has finished it
[00:48]
.......................................... (idle for 3h26mn)
***CoolChris has quit IRC (se.hub irc.efnet.nl)
hook54321 has quit IRC (se.hub irc.efnet.nl)
MrRadar2 has quit IRC (se.hub irc.efnet.nl)
[04:15]
.................... (idle for 1h38mn)
CoolChris has joined #newsgrabber
hook54321 has joined #newsgrabber
MrRadar2 has joined #newsgrabber
[05:53]
..................... (idle for 1h44mn)
HCross2 has joined #newsgrabber
svchfoo3 sets mode: +o HCross2
[07:37]
.................................................................................................................. (idle for 9h29mn)
kyan has joined #newsgrabber
figpucker has joined #newsgrabber
[17:06]
.................. (idle for 1h28mn)
blitzed has joined #newsgrabber [18:38]
........... (idle for 50mn)
meszi has joined #newsgrabber
figpucker has quit IRC (Ping timeout: 260 seconds)
[19:28]
.... (idle for 16mn)
Fletcher_ has joined #newsgrabber [19:44]
....................... (idle for 1h52mn)
kyan has quit IRC (Quit: Leaving) [21:36]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)