Re-join cluster after auto-down

Periodic auto join should solve this.

Leave a comment

on 2012-06-19 18:05 *

By Jonas Bonér

Good. thanks .

on 2012-06-21 10:28 *

By Patrik Nordwall

Status changed from New to Accepted

on 2012-06-21 10:52 *

By Patrik Nordwall

We must decide how partition healing is supposed to work. I don't think this auto re-join is the full story.
Assume that deputy nodes == seed nodes.

4 nodes: N1, N2, S1, S2
seed nodes: S1, S2
S1 and S2 crash
N1 and N2 forms a new cluster
S1 and S2 are started again and connects to each other
How is N1 and N2 supposed to connect with S1 and S2 again?

Now we do extra gossip to deputy nodes, which currently is the live members of the seed nodes.
Is it this deputy node gossip that should handle that? I don't think current gossip merge algo support that.

We also do extra gossip to unreachable nodes. I'm not sure if the result of that is what we want, with current impl. That means that the unreachable state would propagate over to the other cluster.
4 nodes: N1, N2, N3, N4
split into 2 parts
part1: N1-N2
part2: N3-N4
N1(unreachable = {N3,N4})
N3(unreachable = {N1,N2})
N1 gossipTo N3 => N3(unreachable={N1,N2, N3,N4})

on 2012-06-21 11:34 *

By Patrik Nordwall

Assigned to set to Patrik Nordwall

Another much simpler scenario:
What should happen if none of the seed nodes are available at startup. Should the singleton cluster retry to join the seed nodes periodically?

on 2012-06-25 15:49 *

By Patrik Nordwall

Status changed from Accepted to Invalid

After discussion with Jonas and Björn we decided to not implement auto retry of join. If initial join fails then nothing more is done automatically. Manual action is to reboot the node or manually join another node. We don't do this automatically because joining two clusters is complicated when they have state (actors).

on 2012-10-25 21:05 *

By viktorklang

Milestone changed from Coltrane to 2.1