Improve shutdown facilities
I run in to various problems when trying to shutdown nodes.
First thing I realized was that I needed a special controller node, since I sometime tried to shutdown the first node (running the controller). Probably not much to do about this. Just good to know.
A few surprises in the testConductor api:
testConductor.getNodes only works on controller node
testConductor.shutdown only works on controller node
testConductor.removeNode only works on controller node
Most important is how to do testConductor.shutdown combined with testConductor.removeNode. I never got it to work even though I tried different combinations. Also, I get timeout from testConductor.removeNode.await.
First thing I realized was that I needed a special controller node, since I sometime tried to shutdown the first node (running the controller). Probably not much to do about this. Just good to know.
A few surprises in the testConductor api:
testConductor.getNodes only works on controller node
testConductor.shutdown only works on controller node
testConductor.removeNode only works on controller node
Most important is how to do testConductor.shutdown combined with testConductor.removeNode. I never got it to work even though I tried different combinations. Also, I get timeout from testConductor.removeNode.await.
Leave a comment
I got it working with barriers in the right place.
https://github.com/akka/akka/commit/e3eec7e344c26cc912add339e611f5a8786029e9
One thing that still doesn't work is using await in this combination:
testConductor.shutdown(leader, 0)
testConductor.removeNode(leader).await
Sometimes it is ok, so there is some race.
Not sure if we need it?
Feel free to close the ticket as invalid if you think everything is working as designed and no improvements are needed.
https://github.com/akka/akka/commit/e3eec7e344c26cc912add339e611f5a8786029e9
One thing that still doesn't work is using await in this combination:
testConductor.shutdown(leader, 0)
testConductor.removeNode(leader).await
Sometimes it is ok, so there is some race.
Not sure if we need it?
Feel free to close the ticket as invalid if you think everything is working as designed and no improvements are needed.
Hmm, there is a basic test of the actors implementing this here: https://github.com/akka/akka/blob/master/akka-remote-tests/src/test/scala/akka/remote/testconductor/BarrierSpec.scala#L295
It is meant to be used in the order Remove->Disconnect (where the latter happens as a result of shutdown(node)). What fails if you invert the order in the test you linked to?
It is meant to be used in the order Remove->Disconnect (where the latter happens as a result of shutdown(node)). What fails if you invert the order in the test you linked to?
on 2012-05-28 05:43 *
By Jonas Bonér
A note: You need to grab the 'leader' address before you do shutdown/removeNode and not read it twice in the two operations. But I'm sure you have done that.
on 2012-05-28 09:08 *
By Patrik Nordwall
Jonas, yes, I have learnt that misstake :-)
It fails on a barrier timeout when I changing the order to:
I think it is easiest if you play with the latest LeaderElectionSpec in master.
Perhaps we should provide an "atomic" shutdown-remove thing? I think it will be used a lot.
It fails on a barrier timeout when I changing the order to:
testConductor.removeNode(leader)
testConductor.shutdown(leader, 0)
I think it is easiest if you play with the latest LeaderElectionSpec in master.
Perhaps we should provide an "atomic" shutdown-remove thing? I think it will be used a lot.
on 2012-05-28 12:57 *
By Jonas Bonér
Good idea. I like the atomic ops idea.
on 2012-05-28 12:57 *
By Jonas Bonér
I am not surprised that remove and then shutdown does not work. Since you are removing the role and then use it.
ah, that is the thinko, then. But I will not attempt a fix in my current state of mind.
Background: remove was just meant to calm down the BarrierCoordinator, who would otherwise freak out if a node just drops out (and rightfully so).
Background: remove was just meant to calm down the BarrierCoordinator, who would otherwise freak out if a node just drops out (and rightfully so).
on 2012-06-01 08:00 *
By Patrik Nordwall
Just mentioning, it is definitely some race with the shutdown/remove/enter.
What I think we need something that makes this work reliably:
[node2] testConductor.enter("foo")
[node2] testConductor.enter("bar")
[node3] testConductor.enter("foo")
[node1] testConductor.shutdown(node3)
[node1] testConductor.enter("bar")
// all done
What I think we need something that makes this work reliably:
[node2] testConductor.enter("foo")
[node2] testConductor.enter("bar")
[node3] testConductor.enter("foo")
[node1] testConductor.shutdown(node3)
[node1] testConductor.enter("bar")
// all done
on 2012-06-04 05:01 *
By Patrik Nordwall
(In revision:52f122107c04e88d1a9ef9dee4fe002b5653c05c) Fix shutdown/remove race as described by @rkuhn, see #2137
Branch: wip-2137-remove-shutdown-patriknw
- Skip nodes removal
- Ignore removed client when enter barrier
- Change order of testConductor.shutdown and testConductor.removeNode
Branch: wip-2137-remove-shutdown-patriknw
on 2012-06-04 06:52 *
By Patrik Nordwall
(In revision:52f122107c04e88d1a9ef9dee4fe002b5653c05c) Fix shutdown/remove race as described by @rkuhn, see #2137
Branch: wip-2137-remove-shutdown-patriknw
- Skip nodes removal
- Ignore removed client when enter barrier
- Change order of testConductor.shutdown and testConductor.removeNode
Branch: wip-2137-remove-shutdown-patriknw
on 2012-06-04 08:14 *
By Patrik Nordwall
(In revision:52f122107c04e88d1a9ef9dee4fe002b5653c05c) Fix shutdown/remove race as described by @rkuhn, see #2137
Branch: master
- Skip nodes removal
- Ignore removed client when enter barrier
- Change order of testConductor.shutdown and testConductor.removeNode
Branch: master
on 2012-06-04 08:14 *
By Patrik Nordwall
(In revision:93afc09a7251a0a3e8c14063e79e1f3f3e1408dc) Merge pull request #510 from akka/wip-2137-remove-shutdown-patriknw
Fix shutdown/remove race as described by @rkuhn, see #2137
Branch: master
Fix shutdown/remove race as described by @rkuhn, see #2137
Branch: master
on 2012-06-04 08:14 *
By Patrik Nordwall
(In revision:b1c507f3b95bd69eb75d8fa2ee13adb494c16d23) Shutdown does removeNode, see #2137
Branch: wip-2164-convergence-patriknw
Branch: wip-2164-convergence-patriknw
on 2012-06-04 08:16 *
By Patrik Nordwall
(In revision:b1c507f3b95bd69eb75d8fa2ee13adb494c16d23) Shutdown does removeNode, see #2137
Branch: master
Branch: master
on 2012-06-04 08:18 *
By Patrik Nordwall
Assigned to changed from rkuhn to Patrik Nordwall
Status changed from Accepted to Fixed
on 2012-06-04 16:33 *
By Patrik Nordwall
(In revision:52f122107c04e88d1a9ef9dee4fe002b5653c05c) Fix shutdown/remove race as described by @rkuhn, see #2137
Branch: wip-2162-membership-change-listener-spec-jboner
- Skip nodes removal
- Ignore removed client when enter barrier
- Change order of testConductor.shutdown and testConductor.removeNode
Branch: wip-2162-membership-change-listener-spec-jboner
on 2012-06-04 16:33 *
By Patrik Nordwall
(In revision:b1c507f3b95bd69eb75d8fa2ee13adb494c16d23) Shutdown does removeNode, see #2137
Branch: wip-2162-membership-change-listener-spec-jboner
Branch: wip-2162-membership-change-listener-spec-jboner
on 2012-06-04 16:33 *
By Patrik Nordwall
(In revision:93afc09a7251a0a3e8c14063e79e1f3f3e1408dc) Merge pull request #510 from akka/wip-2137-remove-shutdown-patriknw
Fix shutdown/remove race as described by @rkuhn, see #2137
Branch: wip-2162-membership-change-listener-spec-jboner
Fix shutdown/remove race as described by @rkuhn, see #2137
Branch: wip-2162-membership-change-listener-spec-jboner