arkiver: I installed ia on the box ready when you are ah yeah, let's test now started jrwr: data is now being added nice the added items can be found in /home/ark/NewsGrabber-Deduplication-Feeder/indexed cool the script is already working on the frontend yes, it's adding entries how is everything holding up? this is going to take a bit of time Good http://163.172.138.207/18ceeefe89d42e22ea26d10582d4a6b87f48ff71a7365af36ce4c2310e25c52a Right now we are at 116k Keys hmm sure? that's low or wait it doesn't add revisit records I'll do some testing locally 167k now it's looking good will add this to the warrior project now ok jrwr: are you fine with me hardcoding 163.172.138.207 in the script? Ya Og nuts http://163.172.138.207/status it's working great in my test with one URL Good 124143 done + 19297 out + 1117099 to do those numbers are correct? I would add a little small cache on the pipeline keep a few hot keys in memory what hot keys do you mean ah, sorry I see URLs often requested if its within the same job, it shouldn't pull more then once from the dedupe so it doesn't have to peg every single time for /favicon Would hot keys be stuff like the BBC logo etc? :) ya Mostly just a small local cache I see, yes that would be good to do I first want to get this new version started, and then look into local caching for the warrior if that's fine with you jrwr: is it correct that the the requests number on http://163.172.138.207/status shows the number of times http://163.172.138.207/status is requested? ya since its "totals" ok cool scripts are updated!!! so exciting :D HCross2: Do you think we should start fresh? since we have a very large backlog now that we should be faster, might be nice to start with 0 to do items again I don't see why not clearing items now all gone newsbuddy being restarted Hello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot uh ooh sorry, have to clean something up Hello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot there we go Also increasing this to 50 URLs/item actually no, current average item size is already pretty high we might move to 50 later on if needed Hello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot afk 2h ok everything should be running fine arkiver: so it's about an hour and a bit until we see results now yes exciting :D on the warrior page, search for "&& install -y" - should be "&& apt-get install -y" *warrior github page thanks, fixed started :D Turning things on now arkiver: think we can run this as default project? yes, it's the default now what's a good concurrency for E3's with HT? trvz: try 20 and see what CPU performance looks like Ah, here comes the storm look at dem stats It'll use a lot of CPU initially as everything will be grabbing at once, but it'll set itself into a rythm I've got a few at Online.net, that network fine with you? Sure It's where I am All you really need is a moderately good throughput to OVH trvz: dedupe is at online.net paris so jrwr: my rsync ingest is "having fun" Bwhahaha so I need a backbone that isn What kind of "Load" so I need a backbone that isn't congested to peers who offer free peering jrwr: 0.11 load now.. but wait until we megawarc and upload Ya Thats what killed my EG-16 Dedupe is handling it well arkiver: we've got 7.7tb of something on this server the cache is already warm! HCross2: hmm, strange maybe old projects No HTTP response received from tracker. The tracker is probably overloaded. Retrying after 60 seconds... haha, we're doing a few hundred URLs/second that kcaj kid is bad news arkiver: trying to track down the disk usage now ok dedupe server is doing amazing http://163.172.138.207/status Upload just started spewing hundreds of Mbps towards the Archives jrwr: it's crazy fast Bwhahahha I made fast things Its what I do yes :) oh man, traffic just doubled! weee arkiver: we've still got over 3 TB of Panoramio really... Yep I'll get that uploaded now why did the items to do drop to 30k? And a fair bit of FTP government trvz: requeueing data we have a new dedupe we are using, and new code oh and we removed old items wanted to start clean but it'll go up again? arkiver: fair bit of Google Code in /var/www And photosynyh Photosynth nice I will admit, the disks at Scaleway do have really high IOPs jrwr: I've got some nvme stuff in London that may be better We are at 1.01 Load handling all requests at about 1ms nginx is spending more time writing to the TCP stack then reading from disk/ram We need a custom TCP stack now :p arkiver wasn't kidding on the number of keys, I'm glad I tested with 100 Million how many we have now HCross2: I moved default project to somewhere else again. I tried it myself and the warrior didn't run it 2171742 sounds good Oh, Your values are larger then my test set thats fine, ill moniter 46GB of swap should do the trick :) :D Very soon we'll have a Google sized datacentre.. just for this haha everything is kept in redis for a month if not requested in that month, it is removed again from redis else it stays for another month starting from the last request for the data Its last update not last request "you won't believe how big a percentage of our archived news articles are plain clickbait titles that should be purged forever" Well, Im pretty sure titles->content jrwr: this should set expiration to another month right? https://github.com/ArchiveTeam/NewsGrabber-Deduplication-Feeder/blob/master/indexer.py#L75-L77 expire date* ya this new deduplication is pretty cool :) Hecatz: yes :) http://163.172.138.207/stats.php top number is number of keys the rest is normal redis stats jrwr: can we increase max file size for https://wiki.newsbuddy.net/Special:Upload to 40 MB? Ya Give me a moment hmm I created an account and checked the box to sent a random password never got a mail was just an account for the screencapture bot HCross2: stop stealing my internet points there's room for plenty of more warriors not much stealing happening at the moment :P ah yes my OVH box is already screaming at me, perfect