Ignore gossip from unreachable

(No description)

Leave a comment

on 2012-06-25 08:19 *

By Patrik Nordwall

Status changed from New to Accepted

on 2012-06-25 09:27 *

By Patrik Nordwall

Status changed from Accepted to Test

on 2012-06-25 15:37 *

By Patrik Nordwall

Status changed from Test to Fixed

on 2012-06-25 16:41 *

By Peter Vlugter

Doesn't gossip from an unreachable node automatically make it reachable again?

on 2012-06-25 17:16 *

By Jonas Bonér

No, it has to be DOWNed and then reJOIN.

on 2012-06-25 17:51 *

By Peter Vlugter

Okay. So auto-down and auto-rejoin become much more important if the cluster wants to be resilient to transient network issues.

on 2012-06-26 02:16 *

By Patrik Nordwall

We talked about auto re-join yesterday, and decided to not implement that (now), see #2252.
If auto-down followed by a re-join does the same thing as a manual fresh re-boot would do I think it would be possible to safely implement auto re-join, but using re-join to try to heal network partitions (joining two clusters) is scary business.

on 2012-06-26 02:31 *

By Peter Vlugter

Yes, that's I wondered why unreachable nodes can't become reachable again, to allow transient network partitions.

on 2012-06-26 03:28 *

By Patrik Nordwall

I have only thinking in the terms that Jonas has said, that it must down and join again after becoming unreachable.
The failure detector has configurable acceptable-heartbeat-pause which will allow for short (configurable) transient network partitions.

I wonder what consequences it would have if we continue to send heartbeat to unreachable and that we do the reverse reapUnreachableMembers to detect that an unreachable has become available again (move it from unreachable to members).

on 2012-06-26 04:10 *

By Jonas Bonér

I might be overly precocious. I just think that it sounds scary to just let an unreachable node continue as nothing have happened. It might have been down for minutes/hours and the world have moved on. If we allow just continuing then we need to have a timeout window of some sort. The idea was, as Patrik says, to allow the 'acceptable-heartbeat-pause' window to allow transient network partitions. But I am not 100 % sure what is best. WDYT?

on 2012-06-26 04:33 *

By Jonas Bonér

So my point is:
1. To allow this we need to have a timeout window (in Terracotta we had this and the default was 1 minute).
2. We already have this timeout window in the 'acceptable-heartbeat-pause' FD option that is there just to allow transient network partitions
3. So, what is the difference in practice? Is it not better to just have a single knob to turn?

on 2012-06-26 04:37 *

By Patrik Nordwall

I agree. Let's start like that.
For reference, I created a new ticket for auto-restart, see #2273.

on 2012-06-26 17:24 *

By Peter Vlugter

Having any unreachable nodes stops the (cluster) world from moving on. Prevents convergence, so no membership or actor partitioning changes are possible. The down action allows the world to move on, and a downed node always goes back through the joining process for this reason. But maybe we just no longer have the original unreachable/down distinction? If a node gets marked unreachable by the failure detector, then it's always on a path to down.

on 2012-06-27 03:52 *

By Patrik Nordwall

I think the distinction is still valid when using manual admin (auto-down=off). Then it is good that unreachable prevents convergence, so leader actions are not performed until human has decided what to do.

on 2012-06-27 04:58 *

By Peter Vlugter

Yes, the distinction is important. Getting back to the beginning of this conversation: if auto-down is off then it could be useful to have an unreachable node that starts to gossip become reachable again. While unreachable the cluster won't change through leader actions, and this allows transient network failures to correct without user intervention. There's probably a big gap for me between thinking about it last year, and changes since, so don't worry if things are different now.

on 2012-06-27 10:28 *

By Jonas Bonér

Status changed from Fixed to New

Yep. I agree. That could work. But then I still think it should be a timeout that cuts a line. Reopening ticket until we have discussed this and reached a conclusion.

on 2012-07-03 09:51 *

By Patrik Nordwall

Status changed from New to Fixed

Discussed again, we rely on acceptable-heartbeat-pause for transient network failures. We can investigate if manual operation of moving unreachable back again is possible.