#newsgrabber 2017-06-15,Thu

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***MrRadar has quit IRC (ny.us.hub irc.servercentral.net)
dxrt has quit IRC (ny.us.hub irc.servercentral.net)
arkiver has quit IRC (ny.us.hub irc.servercentral.net)
midas has quit IRC (ny.us.hub irc.servercentral.net)
lainu has quit IRC (ny.us.hub irc.servercentral.net)
MrRadar has joined #newsgrabber
dxrt has joined #newsgrabber
arkiver has joined #newsgrabber
midas has joined #newsgrabber
lainu has joined #newsgrabber
irc.servercentral.net sets mode: +o arkiver
[00:14]
kyan has joined #newsgrabber
newsbuddy has joined #newsgrabber
[00:26]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:29]
***newsbuddy has quit IRC (Remote host closed the connection) [00:32]
newsbuddy has joined #newsgrabber [00:39]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:39]
***newsbuddy has quit IRC (Remote host closed the connection)
newsbuddy has joined #newsgrabber
[00:39]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:41]
arkiverURL loading just became a lot faster
looks like filling up the memory also became a lot faster :P
[00:43]
***newsbuddy has quit IRC (Remote host closed the connection)
newsbuddy has joined #newsgrabber
[00:44]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:44]
***newsbuddy has quit IRC (Remote host closed the connection) [00:44]
newsbuddy has joined #newsgrabber [00:50]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:50]
***newsbuddy has quit IRC (Remote host closed the connection)
newsbuddy has joined #newsgrabber
[00:55]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:55]
***newsbuddy has quit IRC (Remote host closed the connection)
newsbuddy has joined #newsgrabber
[00:57]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:57]
***newsbuddy has quit IRC (Remote host closed the connection)
newsbuddy has joined #newsgrabber
[00:57]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:57]
***newsbuddy has quit IRC (Remote host closed the connection) [01:00]
arkiverHCross2: looking well
lists uploaded to the tracker as expected
now adding automatic removal of lists as WARCs come in
I'm just going to get the megaWARCs factory going
[01:00]
................................................................. (idle for 5h21mn)
***nOgAnOo has joined #newsgrabber [06:23]
.............................. (idle for 2h29mn)
chfoo has joined #newsgrabber
joepie91 has joined #newsgrabber
hub.dk sets mode: +o chfoo
[08:52]
........................... (idle for 2h14mn)
Kazlooking good here? [11:06]
.......... (idle for 47mn)
***HCross2 has quit IRC (Quit: Connection closed for inactivity) [11:53]
............ (idle for 56mn)
arkiveryep
only need to test uploading part
[12:49]
....... (idle for 33mn)
Kazgood stuff
I'm home in an hour or so - might start up the workers again
[13:23]
arkiverI saw the rsync targets for the discoverers are still reachable
so that's good
[13:24]
Kaznice
are we at a point where we can automatically inject discovered urls to the tracker?
[13:25]
arkiveryes
https://tracker.archiveteam.org/newsgrabber/
those to-do items were added by the script yesterday
[13:27]
Kaznice
I'm going to start some bits up
am I okay to start grabbing, or should I hold off that for now?
master.newsbuddy.net doesn't seem to be correct
[13:31]
.... (idle for 15mn)
arkiverit's down currently [13:52]
................................................... (idle for 4h10mn)
***blitzed has joined #newsgrabber [18:02]
........... (idle for 53mn)
jrwrIm going to work on the wiki today [18:55]
arkiverawesome [18:55]
***HCross2 has joined #newsgrabber [19:02]
KazHCross2: master.newsbuddy.net doesn't resolve [19:11]
HCross2Kaz: yea. I moved newsbuddy.net to my OVH account and forgot it didnt carry the DNS [19:12]
Kazoops [19:12]
HCross2Kaz: arkiver DNS should be back [19:16]
arkiveryes
I'm in
[19:17]
Kazlooks good, I've restarted my disco [19:17]
arkiverdo you think we can have an URL point to /NewsBuddy/warriorlists ? [19:17]
HCross2sure [19:17]
arkivermaybe not /NewsBuddy/, as the keys for uploading will be there too
testing uploading now
and setting up megaWARC
[19:18]
HCross2: I'm adding you IA keys to the config again for newssites
your*
fine with that?
[19:30]
HCross2thats fine [19:30]
arkivershall we do 50 GB WARCs? [19:37]
HCross2I dont see why not
arkiver: my concern is disk IO
[19:37]
arkiverI guess we'll just see how that goes
indeed megaWARCs do take up more IO overall
but we'll see I guess
[19:38]
HCross2arkiver: http://master.newsbuddy.net [19:46]
arkiveryes! [19:46]
HCross2arkiver: is https://github.com/ArchiveTeam/NewsGrabber-Discovery still the valid disco scripts? [19:50]
arkiveryes
run it with main.py
[19:50]
HCross2Ok, discoveries on the way back up. Bangalore is booting, will be followed by Luxembourg and Manchester [19:53]
arkiververy very nice
I'm now testing dedup in the warrior again
with the new warcio
the old problem we had should be solved
that is this one https://github.com/webrecorder/warcio/pull/8
fixed with https://github.com/webrecorder/warcio/pull/13
[19:54]
HCross2Luxembourg and Bangalore are online [19:57]
..... (idle for 21mn)
arkiverawesome
so deduplication is looking good
looks like the error we had previously is gone
[20:18]
HCross2fantastic [20:26]
....... (idle for 34mn)
jrwrarkiver: so is only gzip supported by IA? [21:00]
arkiverfor WARCs? [21:00]
jrwrya [21:00]
arkiverthis is the standard http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
let me find a quote in there for you
[21:00]
jrwrFor this purpose, the GZIP format with customary "deflate" compression is recommended, as defined in [RFC1950], [RFC1951], and [RFC1952]. Freely available source code implementing this format is available, and the technique is free of patent encumbrances. The GZIP format is also widely used and supported across many free and commercial software packages and operating systems.
This section documents recommended, but optional, practices for compressing WARC files with GZIP
Page 20
[21:01]
arkiverA.1 also
ah yes
that is page 20
we always use .warc.gz
[21:02]
jrwrWhats the target we are using
there are "methods" to help
[21:04]
arkivertarget as in rsync target for the warrior? [21:05]
jrwrya [21:05]
arkivermaster.newsbuddy.net
since that also hold the warrior URL list files
so we can remove those lists easily as WARCs come in
[21:05]
jrwrCool
How much ram does it have
[21:06]
HCross216GB [21:06]
jrwrHow much Disk? [21:06]
HCross212TB split over 3x4TB [21:06]
jrwrOverall, Since it only has 16GB I would just renice the megawarc factory
make it IO Friendly
[21:07]
HCross2its one of these https://www.ovh.co.uk/dedicated_servers/fs/160fs1.xml but with DDR4 RAM and a D-1521 instead of the Atom [21:07]
arkiveractually deduplication is not working yet with the new warcio [21:07]
jrwrNice! [21:07]
arkiverwhich sucks
I was using a modified warcio
[21:07]
jrwrRAID10? [21:08]
arkivermistook it as using the 'real' warcio [21:08]
HCross20 (I know, terrible - but it gives IO) [21:08]
jrwrThats fine you can get pretty close with ZFS Raid [21:08]
arkiverwhich means
we'll probably go with the modified warcio
[21:08]
jrwrZFS Raid + 2x Parity so you can loose a disk or two without much issue [21:09]
HCross2part of the reason I chose the RAID0 is that quite often we run into upload issues and fill space too [21:11]
jrwrYa [21:11]
JAAAlso, you don't really want to run RAIDZ2 on 3 hard drives. [21:12]
HCross2mainly though - the disks are a staging zone on which most data resides for a couple of hours at maximum [21:12]
jrwrRight
and thats how it should be
[21:13]
.......... (idle for 45mn)
***ui has joined #newsgrabber
ui has quit IRC (Client Quit)
[21:58]
............. (idle for 1h1mn)
kyan has quit IRC (se.hub ny.us.hub)
MrRadar has quit IRC (se.hub ny.us.hub)
dxrt has quit IRC (se.hub ny.us.hub)
arkiver has quit IRC (se.hub ny.us.hub)
midas has quit IRC (se.hub ny.us.hub)
lainu has quit IRC (se.hub ny.us.hub)
blitzed has quit IRC (se.hub ny.us.hub)
joepie91 has quit IRC (se.hub ny.us.hub)
chfoo has quit IRC (se.hub ny.us.hub)
johnny5 has quit IRC (se.hub ny.us.hub)
luckcolor has quit IRC (se.hub ny.us.hub)
kurt has quit IRC (se.hub ny.us.hub)
SpaffGarg has quit IRC (se.hub ny.us.hub)
randomdes has quit IRC (se.hub ny.us.hub)
JAA has quit IRC (se.hub ny.us.hub)
jrwr has quit IRC (se.hub ny.us.hub)
phuzion has quit IRC (se.hub ny.us.hub)
ivan has quit IRC (se.hub ny.us.hub)
[23:00]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)