Improve efficiency of the FD heartbeat
We could, instead of letting everyone send heartbeats to everyone, let each node in the node ring only send heartbeats to N number of nodes after him in the node ring (and since every node does that we are covering/monitoring the whole cluster that way - by continuously shifting 1 step in the ring- including any potential jumps between racks or data centers).
If a node is marked as unreachable then that information is gossiped out (to the whole cluster) anyway through the regular gossip protocol.
I can't see why this scheme would be any less efficient in detecting failure. But would be a lot more efficient in terms of resources.
Thoughts?
If a node is marked as unreachable then that information is gossiped out (to the whole cluster) anyway through the regular gossip protocol.
I can't see why this scheme would be any less efficient in detecting failure. But would be a lot more efficient in terms of resources.
Thoughts?
Leave a comment
on 2012-06-28 08:02 *
By Patrik Nordwall
That is the same idea that I had with ticket #2283.
Perhaps we should use the deputy nodes in the algorithm also, typically in different datacenters.
When rebalancing (changing buddies) you need to tell the monitor that you will stop heartbeating. Changing buddies for heartbeating shouldn't be done too often, because it resets the heartbeat history.
Perhaps we should use the deputy nodes in the algorithm also, typically in different datacenters.
When rebalancing (changing buddies) you need to tell the monitor that you will stop heartbeating. Changing buddies for heartbeating shouldn't be done too often, because it resets the heartbeat history.
on 2012-06-28 08:17 *
By Jonas Bonér
Funny since the idea came to me when walking back from lunch, before seeing your ticket.
on 2012-10-07 23:48 *
By Patrik Nordwall
Assigned to set to Patrik Nordwall
Status changed from New to Test
Since this is the number one known scalability bottleneck I took a stab at it. I use consistent hashing instead of ring order, since this will have better re-balancing characteristics. https://github.com/akka/akka/pull/787
on 2012-10-15 02:54 *
By Patrik Nordwall
Milestone changed from Coltrane to 2.1-RC1
Status changed from Test to Fixed