[07:10] Does the warrior work in VMware ESXi, and if so, what do I need to change in the OVF to make it load? [17:54] The tracker has been taking a hit lately. Would it be possible for the tracker to have an overloaded signal that causes a longer sleep time or have an exponential backoff on repeated rate limiting? [17:55] maybe where i is number of rate limits in a row do sleep i^2+random? [17:58] with a cap at 5 minutes + random, or so? [18:51] aaaaaaaaa: possibly, but then I think people would just remove it from the code [18:51] we really aren't actually falling over that badly; the 502 rate is up but the tracker still works [18:53] ok, Its just oomed a few times and was thrasing badly for awhile [18:54] so just an idea. [18:54] the OOM is actually Redis [18:54] when people load too many items into the tracker + there's a lot of stuff leftover in done-tracking sets [18:55] rate-limiting the warrior wouldn't address that issue -- for that we probably need to fix up our done-set draining scripts so that e.g. they run with proper supervision [18:55] oh, and redis is completely in memory. [18:55] yes [18:55] ok [18:56] you'd know better than me, but I thought the two were related. [18:56] fack [18:56] used_memory:778689520 [18:56] used_memory_human:742.62M [18:56] so it is thrashing? [18:56] aaaaaaaaa: sort of -- the more requests, the more pressure on the Ruby app, and the more memory it uses [18:57] but the biggest consumer of memory is definitely stuff in Redis [18:57] whats the stack like? [18:57] the C stack isn't 742 MB deep [18:57] it's jemalloc-allocated heap [18:57] oh, did you mean software stack [18:58] I did, could have been more specific, sorry ;) [18:58] the tracker is a Ruby app, written using Sinatra, and deployed using Passenger+nginx [18:58] datastore is Redis 2.6.something [18:59] my goto stack, not much experience with redis though [18:59] Passenger is configured to kill workers after they've handled 10000 requests [18:59] so even if we have leaks in the workers (doesn't seem like any major ones) they can't go too crazy [18:59] i still have the tracker i used for puush if we need it [18:59] I should remove that kill option [19:00] it just causes intermittent timeouts [19:01] you mean passenger killing workers results in delays while spinning up new ones? [19:01] yeah, if no other workers are available to service requests [19:01] when we hit high load that happens a lot [19:01] right, if there are no significant leaks, it might improve things a little [19:03] what kind of hardware is the tracker running on? (just curious, don't want to waste your time) [19:04] The tracker runs on a Linode 1 GB instance according to the wiki [19:07] it has 2 GB now, but yeah [19:07] still at linode [19:15] quite amazed by the throughput tbh :)