#newsgrabber 2017-10-23,Mon

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
Igloolol, Let me know if you need resources still? [11:37]
........ (idle for 39mn)
JensRexJust finished a >8GB item. What news article is 8GB?
newsbuddy:warrior_496_1508753601.62
[12:16]
JAAMultiple videos maybe? [12:16]
JensRexStill, that's almost 2 DVD's worth of video. [12:20]
JAAYeah, true. [12:20]
JensRexI just noticed this because I got disk usage warning mails from Digital Ocean. [12:21]
................... (idle for 1h34mn)
HCross21080p video will do that
You might have gotten a documentary or something
[13:55]
........ (idle for 38mn)
IglooDo you need more firepower HCross2? Or is it OK for the time being? [14:33]
.... (idle for 16mn)
jrwrglad to see the project back in swing
I'm also happy that my dedupe is working so well
[14:49]
.............. (idle for 1h6mn)
Igloohey HCross2 you'll know, I've got the ctrl-v uploading to IA
How do I get someone to add that to wayback?
[15:55]
HCross2Mediatype wev
Web
[15:56]
Igloo--header "x-archive-meta-mediatype:web"
Is what i've set
So that should work (hopefully!)
[15:57]
JAAIgloo: That's ctrlv.in? [15:59]
IglooJAA: Yeah
The main grab has ~1million entries to go. It's done 900k. Most of them now are images
I've uploaded the incremental grab from 1,001,800 - 1,001,8499
[16:07]
JAAAwesome, thanks! I can scratch that off my list then. :-) [16:09]
IglooAnymore on your list?
I've been absent for a while, I have a fair amount of resource available
Should be completed in about 24 hours for the main grab
Assuming I don't get blocked ;-)
[16:09]
JAAHehe
My list consists mainly of crazy ideas for huge projects. Like archiving all of wordpress.com, Disqus, etc. Also tons of small stuff which I'll throw into ArchiveBot once the queue is drained. Can't think of anything intermediate (i.e. feasible but too large or complex for ArchiveBot) at the moment.
[16:11]
IglooNo worries. I can always add another AB grabber. But I keep being told not required *shrug* [16:16]
JAAI think the proper terminology would be "too much effort".
We've been in need of pipelines for months. It got worse about 2 weeks ago when two guys joined and threw over 100 jobs into the queue in a few days.
Now it's getting better, around 50 pending jobs left, down from 90 yesterday evening. (I added another pipeline for 5 jobs yesterday.)
[16:16]
IglooFair enough, It's an always standing offer though if required [16:21]
JAAPersonally, I'd highly appreciate additional pipelines. Since I've been given access to the control node, I could add them, too. I'm just not sure if David would be okay with that. (He gave me access so I could add my own machines.) [16:25]
IglooI understand that [16:25]
.... (idle for 17mn)
They're still uploading to ctrl-v somehow. Which is a bit of a flipping pain :D [16:42]
JAAMaybe via email or something? http://ctrlv.in/help [16:44]
IglooI've emailed already =] [16:51]
......................... (idle for 2h0mn)
***kyan has joined #newsgrabber [18:51]
.... (idle for 17mn)
HCross2I offered to run a pipeline, got told to wait for a bit and then never heard back [19:08]
JAAYeah, that sounds familiar. [19:11]
HCross2Was fine while I had free M24Seven credit [19:20]
...... (idle for 26mn)
IglooI've got a couple of 1Gbps servers practically idle from projects [19:46]
JensRexnewsgrabber project is a bit too intense for my Raspberry Pi that I'm usually using for my home connection.
Jobs like the 8GB one from earlier is even pushing the limit of my Digital Ocean node.
I only use it for IRC and OwnCloud. Threw archive jobs at it, because why not.
Perhaps I could use some USB scratch disk for the RPi, but I think deduplication is way too CPU intensive for a gen 1 RPi ARM cpu.
</monologue>
[19:51]
HCross2I'd like to stabilise discovery a bit more, then go hard adding services [19:58]
JensRexI think what we really need the most (in my uninformed opinion) is timeout for jobs. Requeue jobs automatically if not returned within N hours.
Saves having to bug admins to requeue things all the time.
[19:58]
HCross2Yeah, and we've got so many to requeue I can't do it as it's broken the tracker [19:59]
JensRexImproperly configured clients can pull down and fail hundreds of jobs. [20:00]
HCross2^ I've had that [20:00]
JensRexOutstanding jobs is nearly always some absurd number.
Me too, during Yahoo days.
Yahoo banned me. Client was dumb and kept pulling new work.
Distributed computing projects already do this. F@H jobs have a 1 week deadline or something. Client also knows this, and discards outdated work.
But I'm just some guy on the internet. I'm an electrician, not a coder. I don't know how much of a pain in the ass it is to create this feature.
You need a fire alarm installed, I'm your guy.
[20:00]
HCross2Can you come to the UK and run cat6 round my house please :p [20:04]
JensRexI've wired entire hospitals with shielded Cat6 :)
I'd get into a huge wreck in less than 5 minutes in the UK, with the left hand driving.
[20:04]
HCross2Need your mine or school hooked up with IT and backups.. I'm your man [20:07]
IglooNeed your National Health Service network infrastructure building.. Hoi ;)
Do you need more firepower on this HCross2 ? I can deploy either a disco or grabber or wahtever
[20:09]
HCross2A discovery would be nice please
And the more grabbers the merrier
[20:10]
IglooSure, Disco coming up shortly [20:10]
HCross2I've been focusing on exotic discovery locations. We've got Singapore, Los Angeles, Luxembourg and Bangalore, India [20:11]
IglooDisco setup instructions?
New York, San Francisco, Amsterdam, Singapore, London, Frankfurt, Toronto, Bangalore?
[20:12]
HCross2London would be nice - https://github.com/ArchiveTeam/NewsGrabber-Discovery/blob/master/README.md [20:13]
IglooOkie dokie [20:14]
HCross2Create an "assigned_services" directory, make an rsync target and send that to me
I'm going to look at making an LXC container pre setup for this
[20:14]
IglooFile "/home/newsgrabber/NewsGrabber-Discovery/log.py", line 22, in log [20:25]
HCross2Also create a file called "target" with this in it "rsync://master.newsbuddy.net/incoming_urls
Without the speech marks
[20:30]
IglooDone
178.62.4.171/NewsBuddyDisco
will be ready in a moment
[20:32]
HCross2Thank you. Firing up my tablet and I'll add you
All done
[20:34]
IglooI've tested rsync and that works
I'll setup some grabbers shortly, If you let me know where you want them they'll be ready tomorrow night to run
[20:40]
...... (idle for 29mn)
HCross2Wherever will do
Lots of smaller concurrent
[21:09]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)