allow transition from unreachable to reachable
What implications would it have to be able to manually tell that an unreachable is not unreachable and should become member again.
I think it will at least break the current gossip merge algo.
I think it will at least break the current gossip merge algo.
Leave a comment
I started to think about this. I see two challenges so far.
1. It will not be possible to just move it from the unreachable set back to the member set, because then we would not be able to merge conflicting gossip.
2. The failure detector and the current gossip of the node is screwed, since other nodes have stopped heartbeating or the heartbeats were lost for real.
1. It will not be possible to just move it from the unreachable set back to the member set, because then we would not be able to merge conflicting gossip.
2. The failure detector and the current gossip of the node is screwed, since other nodes have stopped heartbeating or the heartbeats were lost for real.
What about a node coming back by itself? Consider auto-down=off, someone unplugs a network cable, node becomes UNREACHABLE, that previously clumsy someone reconnects the network (after crimping on a new connector), things start getting heart-beats again, everything should be dandy. No? What is missing to make that happen?
on 2012-12-07 09:19 *
By Jonas Bonér
In Terracotta we had a configurable timeout during which we allowed this, but after it timed out the node had to reset and join as new member.
This is definitely advanced usage. Simplest is just to say, reset and rejoin.
This is definitely advanced usage. Simplest is just to say, reset and rejoin.
on 2012-12-10 11:44 *
By Patrik Nordwall
Summary from our discussion.
We agreed that it would be a useful feature.
Unreachable is not a Member state (status) in the same way as Joining, Up, Leaving, Down, Removed, etc, (and that is also the reason for current implementation with members and unreachable sets). Unreachable is something that is observed (measured). It's a "flag" that can be changed, including reviving unreachable back to normal.
If one member sends system messages to a node and don't get acknowledgement that they have been received they will be buffered, and eventually dropped. When system messages have been dropped the node sending the messages will veto to revive the previously unreachable back to reachable.
We would need to keep the "unreachable flag" per observer (member) instead of having a shared Set. This information could be used for deciding if it's alright to change unreachable to reachable.
We should have configuration option for auto reviving with optional fallback to auto downing after timeout.
Manual reviving could also make sense.
How to handle vector clock versions and merge conflicts is still an open question.
We agreed that it would be a useful feature.
Unreachable is not a Member state (status) in the same way as Joining, Up, Leaving, Down, Removed, etc, (and that is also the reason for current implementation with members and unreachable sets). Unreachable is something that is observed (measured). It's a "flag" that can be changed, including reviving unreachable back to normal.
If one member sends system messages to a node and don't get acknowledgement that they have been received they will be buffered, and eventually dropped. When system messages have been dropped the node sending the messages will veto to revive the previously unreachable back to reachable.
We would need to keep the "unreachable flag" per observer (member) instead of having a shared Set. This information could be used for deciding if it's alright to change unreachable to reachable.
We should have configuration option for auto reviving with optional fallback to auto downing after timeout.
Manual reviving could also make sense.
How to handle vector clock versions and merge conflicts is still an open question.
on 2012-12-10 16:44 *
By Patrik Nordwall
What about this structure:
val reachableInfo = Map[Address, Set[ReachableInfo]] // as seen by each member
case class ReachableInfo(
version: Long, // sequence number or timestamp, to be able to pick latest entry when merging
peer: Member, // the observed member,
reachableStatus: ReachableStatus)
trait ReachableStatus
case object Unreachable extends ReachableStatus
case object Reachable extends ReachableStatus
case object Terminated extends ReachableStatus // not allowed back
reapUnreachableMembers would only change the entry of the self address based on failure detector verdict of monitored members, and increase the version number of that entry when changed. It will never change from Terminated, which should be used for the "dropped system messages case".
A member is considered reachable when all entries for that member are Reachable, i.e. all thinks it is reachable.
A member is considered unreachable when there is at least one entry for that member with Unreachable/Terminated, i.e. someone thinks it is unreachable.
In this way it would still be possible to automatically merge, wouldn't it.
I'm not sure this is a good solution, and if it covers what we want to do. Suggestions? Alternatives?
val reachableInfo = Map[Address, Set[ReachableInfo]] // as seen by each member
case class ReachableInfo(
version: Long, // sequence number or timestamp, to be able to pick latest entry when merging
peer: Member, // the observed member,
reachableStatus: ReachableStatus)
trait ReachableStatus
case object Unreachable extends ReachableStatus
case object Reachable extends ReachableStatus
case object Terminated extends ReachableStatus // not allowed back
reapUnreachableMembers would only change the entry of the self address based on failure detector verdict of monitored members, and increase the version number of that entry when changed. It will never change from Terminated, which should be used for the "dropped system messages case".
A member is considered reachable when all entries for that member are Reachable, i.e. all thinks it is reachable.
A member is considered unreachable when there is at least one entry for that member with Unreachable/Terminated, i.e. someone thinks it is unreachable.
In this way it would still be possible to automatically merge, wouldn't it.
I'm not sure this is a good solution, and if it covers what we want to do. Suggestions? Alternatives?
on 2013-03-11 10:24 *
By Patrik Nordwall
We should consider how this affects the decision of who is leader. Probably unreachable should not influence who is leader.
Also, the need for the separate unreachable Set might not be needed.
Also, the need for the separate unreachable Set might not be needed.
on 2013-07-05 06:01 *
By Patrik Nordwall
on 2013-08-27 06:42 *
By Patrik Nordwall
Assigned to set to Patrik Nordwall
Status changed from New to Accepted