Re-join cluster after auto-down
Periodic auto join should solve this.
Leave a comment
on 2012-06-19 18:05 *
By Jonas Bonér
Good. thanks .
on 2012-06-21 10:52 *
By Patrik Nordwall
We must decide how partition healing is supposed to work. I don't think this auto re-join is the full story.
Assume that deputy nodes == seed nodes.
4 nodes: N1, N2, S1, S2
seed nodes: S1, S2
S1 and S2 crash
N1 and N2 forms a new cluster
S1 and S2 are started again and connects to each other
How is N1 and N2 supposed to connect with S1 and S2 again?
Now we do extra gossip to deputy nodes, which currently is the live members of the seed nodes.
Is it this deputy node gossip that should handle that? I don't think current gossip merge algo support that.
We also do extra gossip to unreachable nodes. I'm not sure if the result of that is what we want, with current impl. That means that the unreachable state would propagate over to the other cluster.
4 nodes: N1, N2, N3, N4
split into 2 parts
part1: N1-N2
part2: N3-N4
N1(unreachable = {N3,N4})
N3(unreachable = {N1,N2})
N1 gossipTo N3 => N3(unreachable={N1,N2, N3,N4})
Assume that deputy nodes == seed nodes.
4 nodes: N1, N2, S1, S2
seed nodes: S1, S2
S1 and S2 crash
N1 and N2 forms a new cluster
S1 and S2 are started again and connects to each other
How is N1 and N2 supposed to connect with S1 and S2 again?
Now we do extra gossip to deputy nodes, which currently is the live members of the seed nodes.
Is it this deputy node gossip that should handle that? I don't think current gossip merge algo support that.
We also do extra gossip to unreachable nodes. I'm not sure if the result of that is what we want, with current impl. That means that the unreachable state would propagate over to the other cluster.
4 nodes: N1, N2, N3, N4
split into 2 parts
part1: N1-N2
part2: N3-N4
N1(unreachable = {N3,N4})
N3(unreachable = {N1,N2})
N1 gossipTo N3 => N3(unreachable={N1,N2, N3,N4})
Another much simpler scenario:
What should happen if none of the seed nodes are available at startup. Should the singleton cluster retry to join the seed nodes periodically?
What should happen if none of the seed nodes are available at startup. Should the singleton cluster retry to join the seed nodes periodically?
After discussion with Jonas and Björn we decided to not implement auto retry of join. If initial join fails then nothing more is done automatically. Manual action is to reboot the node or manually join another node. We don't do this automatically because joining two clusters is complicated when they have state (actors).
Updating tickets (#939, #940, #1941, #2213, #2214, #2215, #2219, #2222, #2223, #2239, #2240, #2249, #2250, #2252, #2253, #2254, #2256, #2259, #2263, #2264, #2265, #2267, #2270, #2271, #2275, #2277, #2286, #2287, #2289, #2290, #2303, #2304, #2308, #2310, #2311, #2317, #2323, #2331, #2374, #2392, #2405, #2423, #2425, #2440, #2444, #2445, #2453, #2456, #2459, #2473, #2477, #2491, #2495, #2523, #2534, #2541, #2544, #2545, #2549, #2582, #2583, #2589, #2626)
Updating tickets (#939, #940, #1941, #2081, #2126, #2213, #2214, #2215, #2219, #2222, #2223, #2239, #2240, #2249, #2250, #2252, #2253, #2254, #2256, #2259, #2263, #2264, #2265, #2267, #2270, #2271, #2275, #2277, #2286, #2287, #2289, #2290, #2303, #2304, #2308, #2310, #2311, #2317, #2323, #2331, #2374, #2392, #2394, #2405, #2408, #2423, #2424, #2425, #2440, #2444, #2445, #2449, #2453, #2456, #2459, #2461, #2473, #2477, #2485, #2491, #2495, #2498, #2501, #2505, #2515, #2517, #2523, #2534, #2541, #2544, #2545, #2549, #2582, #2583, #2588, #2589, #2598, #2599, #2618, #2623, #2626, #2627, #2630, #2631, #2633, #2634, #2635, #2637, #2638, #2642, #2643, #2646, #2647, #2648, #2649, #2650, #2653, #2655, #2657, #2658)