REMOTE: Support trap exit of remote actor crash generically

Currently we only track trap exit when sending the reply message after a ? message.

Proposed solution:

Clustered actor has two parts:

The client/ClusterActorRef on the "client" cluster node
The remote instance(s) (that the client talks to) on another cluster node

Part 1 - The remote instance side of things
1. Create a hidden "system" Supervisor actor on the cluster node
2. Link the remote actor instance to this Supervisor
3. ZK is used to set a Watch (ephemeral node) for linked actor
4. The path for this Watch is send back to the ClusterActorRef that represents this instance (see part 2 for how that is managed on in the ClusterActorRef)
5. Now when remote actor fails the Watch should be removed (which will trigger the ClusterActorRef on the client side)
6. ... wait for ClusterActorRef to do its job
7. When Supervisor receives a Restart message from the ClusterActorRef then it sends an Exit message to its linked actor to restart the remote actor.

Part 2 - The client/ClusterActorRef side of things
1. The ClusterActorRef receives the path to the Watch from the remote instance
2. It then adds a Listener to this path (Watch)
3. When Watch is triggered, notifying our Listener then we send the remote actor a Restart message to the "system" Supervisor on this node (go to part 1 item 7)

Use the RemoteDaemonActor channel (in ClusterNode) for all "system" communication. Add necessary commands to the ClusterProtocol.proto.

All life-cycle messages (Restart, Exit, etc.) should be sent to a the EventHandler.

Leave a comment

on 2010-07-02 18:11 *

By Jonas Bonér

Milestone changed from 0.10 to 0.11

Updating tickets (#135, #257, #258, #259, #261, #262, #263, #264, #265, #207, #273, #208, #194, #225, #227, #228, #231, #198, #299, #300, #255, #117)

on 2010-08-19 19:08 *

By Jonas Bonér

Milestone changed from 0.11 to 1.0

Updating tickets (#135, #181, #194, #207, #208, #209, #228, #230, #231, #250, #255, #257, #258, #259, #261, #262, #263, #264, #265, #273, #298, #299, #300, #322, #323, #326, #343, #346, #363, #364, #373, #378, #380, #381, #384, #385, #387, #399)

Skipping 0.11. Next release is 1.0.

on 2010-09-02 14:03 *

By Jonas Bonér

Description changed from Currently we only track tra... to Currently we only track tra...

Priority changed from Normal (3) to Highest (1)

on 2010-09-23 13:53 *

By viktorklang

I'm guessing that this comment is linked to this: -)

private def notifySupervisorWithMessage(notification: LifeCycleMessage) = {
// FIXME to fix supervisor restart of remote actor for oneway calls, inject a supervisor proxy that can send notification back to client
_supervisor.foreach { sup =>
if (sup.isShutdown) { // if supervisor is shut down, game over for all linked actors
shutdownLinkedActors
stop
} else sup ! notification // else notify supervisor
}
}

Couldn't it be solved in implementing "notifySupervisorWithMessage" differently in RemoteActorRef (which I am assuming that _supervisor is transformed into)?

on 2010-10-15 12:17 *

By viktorklang

Milestone changed from 1.0-MILESTONE1 to 1.0-MILESTONE2

on 2010-10-28 14:49 *

By Jonas Bonér

Assigned to set to viktorklang

This is highest prio IMHO.

on 2010-10-28 14:54 *

By viktorklang

I'd like some assistance with this, what's needed and which solution is most viable?

on 2010-11-21 20:11 *

By viktorklang