Make quarantine permanent

(No description)

Leave a comment

on 2013-12-06 09:44 *

By Patrik Nordwall

I think this will be fixed by #3765, changing default to something like 5 days

on 2013-12-07 15:48 *

By ericacm

Say we have actor systems A and B. Both have fairly long watch-failure-detector.acceptable-heartbeat-pause settings (say 30s). But even then, actor system A has a long GC pause that causes B to get deathwatch for A's actors. How do A and B re-establish communications with each other now that quarantine is permanent?

Other details: A is a server application with a lot of state in its actors. It is not realistic to restart the application or the actor system that it is running. Some of its clients are relatively stateless, other clients have stateful actors.

on 2013-12-08 11:42 *

By Patrik Nordwall

When using remote death watch the failure detector decision is final, and there is no way back.
Changing the default values also includes increasing the default acceptable-heartbeat-pause.

A GC pause of 30 seconds is a strong indication that something should be re-designed in your application. If your application must handle this unresponsiveness it cannot detect failures within the same duration magnitude. There is no way to distinguish a dead node from an unresponsive node in a distributed system. Increase the acceptable-heartbeat-pause is the simple answer. You can still define a different quarantine period and completely change the semantics of Terminated, but then you should stay away from remote deployed actors.

A side note: When using cluster death watch we have more possibilities, and there (in 2.3) the failure detection is an indication that something is wrong, but it might get back, and the final decision triggering Terminated is made by removing the member from the cluster (via down operation).

on 2013-12-09 14:33 *

By viktorklang

At some point, important system signals have been lost, and after that the node just cannot be allowed back.

on 2013-12-09 14:44 *

By drewhk

Or irreversible decisions have been made (signalling of Termination). The point is, if you want to tolerate 30s delays, then you have to set the FD accordingly. But no matter what, after a while a decision has to be made if there are no life-signs from the remote node.

on 2013-12-09 21:13 *

By ericacm

In the short term we have set acceptable-heartbeat-pause = 30s and that seems to have solved the problem. We're also investigating the root cause of why Deathwatch was firing. It actually wasn't GC (was just using that as an example - it may have been due to very large messages).

Is the alternative to very large acceptable-heartbeat-pauses restarting one of the actor systems? (I have read that in a few places but I wasn't sure if that is an official recommendation). For example, if A has a lot of state in its actors then we can restart B if it gets a Terminated from A. But restarting A would be painful because of all that state (think of it as the middle of the error kernel).

For restarting, is the reason for restart because, in this example, A's actor system/remoting system has state about B that is now invalid/corrupt? If that is the case, would it be possible to purge the A's actor/remoting system of the data relating to B, so it would be in a state as if B was never seen in the first place (for example at the end of the quarantine period)? Or is it more fundamental?

I'm aiming for the least invasive/lowest impact way of dealing with Termination, where restarting A or getting humans involved are the worst case scenarios.

Thanks!

on 2013-12-10 09:13 *

By drewhk

The issue is fundamental. The whole idea of using remote deathwatch is to be notified when the other system is down. Unfortunately "downness" cannot be reliably detected if the other side does not respond. Is it really down? Is it only a networking problem? No matter what, eventually the system HAS to make a decision (since you asked it to watch the other one) that "I consider the other system now down". You can configure this time to be as long as you want, but since you asked for remote deathwatch there has to be a decision deadline -- infinite timeout does not make sense, because it is the equivalent of not watching at all. Now once the system decided that the other is down, and signaled Terminated, it cannot allow that other system to come back EVER -- because then all the actors that has been signaled as Terminated might pop back into existence -- very bad.

Now the other possible case of Quarantine is corrupted system state (system messages undelivered for a long time, mismatched sequence numbers between the two systems, etc.). In this case nothing can be done, no system message traffic can be trusted anymore, the two systems must isolate each other. This scenario is rare, and usually ActorSystems do not send system messages between each other unless you remote deploy actors or watch.