Gossip merge in large cluster
When state is changed (such detecting failure and reap unreachable) at the same time on different members there will be gossip merge conflicts. That is as expected, but it looks like these can't be resolved when using a large (e.g. 25 node) cluster. There is an ignored failing test in LargeClusterSpec to reproduce this.
I have tried the LargeClusterSpec with 25 nodes and also turned off reapUnreachable for all but 4 nodes. When shutting down 5 nodes this never ending merge conflict occurs.
I have tried the LargeClusterSpec with 25 nodes and also turned off reapUnreachable for all but 4 nodes. When shutting down 5 nodes this never ending merge conflict occurs.
Leave a comment
on 2012-06-29 11:21 *
By bjorn.antonsson@typesafe.com
When you say never ending. Does that mean that they just gossip indefinitely, or is the merge code getting stuck?
on 2012-06-29 11:55 *
By Patrik Nordwall
They gossip gossip indefinitely, or at least as long as I waited (a few minutes).
on 2012-07-02 21:06 *
By Patrik Nordwall
Trying to simultaneously resolving conflicts at several nodes creates new conflicts.
Therefore the leader resolves conflicts to limit divergence. To avoid overload there
is also a configurable rate limit of how many conflicts that are handled by second.
I also found and "fixed" this problem when shutting down nodes:
Netty blocks when sending to broken connections. ClusterHeartbeatSender actor
isolates sending to different nodes by using child workers for each target
address and thereby reduce the risk of irregular heartbeats to healthy
nodes due to broken connections to other nodes.
Therefore the leader resolves conflicts to limit divergence. To avoid overload there
is also a configurable rate limit of how many conflicts that are handled by second.
I also found and "fixed" this problem when shutting down nodes:
Netty blocks when sending to broken connections. ClusterHeartbeatSender actor
isolates sending to different nodes by using child workers for each target
address and thereby reduce the risk of irregular heartbeats to healthy
nodes due to broken connections to other nodes.
Updating tickets (#939, #940, #1941, #2213, #2214, #2215, #2219, #2222, #2223, #2239, #2240, #2249, #2250, #2252, #2253, #2254, #2256, #2259, #2263, #2264, #2265, #2267, #2270, #2271, #2275, #2277, #2286, #2287, #2289, #2290, #2303, #2304, #2308, #2310, #2311, #2317, #2323, #2331, #2374, #2392, #2405, #2423, #2425, #2440, #2444, #2445, #2453, #2456, #2459, #2473, #2477, #2491, #2495, #2523, #2534, #2541, #2544, #2545, #2549, #2582, #2583, #2589, #2626)
Updating tickets (#939, #940, #1941, #2081, #2126, #2213, #2214, #2215, #2219, #2222, #2223, #2239, #2240, #2249, #2250, #2252, #2253, #2254, #2256, #2259, #2263, #2264, #2265, #2267, #2270, #2271, #2275, #2277, #2286, #2287, #2289, #2290, #2303, #2304, #2308, #2310, #2311, #2317, #2323, #2331, #2374, #2392, #2394, #2405, #2408, #2423, #2424, #2425, #2440, #2444, #2445, #2449, #2453, #2456, #2459, #2461, #2473, #2477, #2485, #2491, #2495, #2498, #2501, #2505, #2515, #2517, #2523, #2534, #2541, #2544, #2545, #2549, #2582, #2583, #2588, #2589, #2598, #2599, #2618, #2623, #2626, #2627, #2630, #2631, #2633, #2634, #2635, #2637, #2638, #2642, #2643, #2646, #2647, #2648, #2649, #2650, #2653, #2655, #2657, #2658)