Gossip merge in large cluster

When state is changed (such detecting failure and reap unreachable) at the same time on different members there will be gossip merge conflicts. That is as expected, but it looks like these can't be resolved when using a large (e.g. 25 node) cluster. There is an ignored failing test in LargeClusterSpec to reproduce this.

I have tried the LargeClusterSpec with 25 nodes and also turned off reapUnreachable for all but 4 nodes. When shutting down 5 nodes this never ending merge conflict occurs.

Leave a comment

on 2012-06-29 11:11 *

By Patrik Nordwall

Status changed from New to Accepted

on 2012-06-29 11:21 *

By bjorn.antonsson@typesafe.com

When you say never ending. Does that mean that they just gossip indefinitely, or is the merge code getting stuck?

on 2012-06-29 11:55 *

By Patrik Nordwall

They gossip gossip indefinitely, or at least as long as I waited (a few minutes).

on 2012-07-02 21:04 *

By Patrik Nordwall

Status changed from Accepted to Test

on 2012-07-02 21:06 *

By Patrik Nordwall

Trying to simultaneously resolving conflicts at several nodes creates new conflicts.
Therefore the leader resolves conflicts to limit divergence. To avoid overload there
is also a configurable rate limit of how many conflicts that are handled by second.

I also found and "fixed" this problem when shutting down nodes:
Netty blocks when sending to broken connections. ClusterHeartbeatSender actor
isolates sending to different nodes by using child workers for each target
address and thereby reduce the risk of irregular heartbeats to healthy
nodes due to broken connections to other nodes.

on 2012-07-05 12:27 *

By Patrik Nordwall

Status changed from Test to Fixed