activator-akka-cluster-sharding-scala NoSuchElementException
[Tested with Akka 2.3.0, akka-persistence-mongo-casbah 0.4-SNAPSHOT]
As mentioned in this thread ExampleName
when running the activator akka-cluster-sharding example like
1. sbt "run-main sample.blog.BlogApp 2551"
2. sbt "run-main sample.blog.BlogApp 2552"
3. sbt "run-main sample.blog.BlogApp 0"
and then stop the first seed node(2551), the ShardCoordinator Singleton is restarted on node 2552 which fails with a lot
NoSuchElementExceptions.
Sometimes this happens already after the first stop of node 2551, sometimes I have to do this more often but eventually I always end
up with the error condition.
As mentioned in this thread ExampleName
when running the activator akka-cluster-sharding example like
1. sbt "run-main sample.blog.BlogApp 2551"
2. sbt "run-main sample.blog.BlogApp 2552"
3. sbt "run-main sample.blog.BlogApp 0"
and then stop the first seed node(2551), the ShardCoordinator Singleton is restarted on node 2552 which fails with a lot
NoSuchElementExceptions.
Sometimes this happens already after the first stop of node 2551, sometimes I have to do this more often but eventually I always end
up with the error condition.
Leave a comment
on 2014-04-03 01:18 *
By Patrik Nordwall
I have tried to reproduce this, but I can't. I have tried with shared leveldb running in a separate jvm, not joining the cluster. I have also tried with akka-persistence-mongo-casbah 0.4-SNAPSHOT. Tried many shutdown sequences without any such errors.
Looking at the code.
The exception is here.
regions(region) does not contain the terminated region
That means that the region is terminated without being registered.
In normal operation this can not happen, because there is a check here.
The exception occurs during recovery, and then my conclusion is that the events are replayed in the wrong order.
One likely reason is that you have actually been running two clusters at the same time, i.e. two active shard coordinators storing into the same persistent event stream.
That can happen if you start 2551, 2552, 0, 0'
stop 2551, 2552, and then start 2551 again.
Then 0 and 0' are still running in their own cluster, and since the seed nodes are configured to 2551 and 2552, the startup of 2551 will form its own cluster.
Is it possible that you have done that? Otherwise I would need more detailed steps of how to reproduce. (remember to start the scenario with clean mongodb)
Looking at the code.
The exception is here.
regions(region) does not contain the terminated region
That means that the region is terminated without being registered.
In normal operation this can not happen, because there is a check here.
The exception occurs during recovery, and then my conclusion is that the events are replayed in the wrong order.
One likely reason is that you have actually been running two clusters at the same time, i.e. two active shard coordinators storing into the same persistent event stream.
That can happen if you start 2551, 2552, 0, 0'
stop 2551, 2552, and then start 2551 again.
Then 0 and 0' are still running in their own cluster, and since the seed nodes are configured to 2551 and 2552, the startup of 2551 will form its own cluster.
Is it possible that you have done that? Otherwise I would need more detailed steps of how to reproduce. (remember to start the scenario with clean mongodb)
Hello Patrick,
thanks for looking into this.
I have just tried it again in the sequence mentioned in the initial post (first dropped the store database in mongo)
And it failed again when stopping the 2551 process.
What I did then is that I dropped all databases in my mongo instance (not only the store database) and since then everything works
like a charm (starting, stopping in various orders, deleting/not deleting the store database)
I will observe this further to see if this is happening again, but for now I think you can close this ticket.
If I split the cluster as you assumed, mongo will raise DuplicateKeyExceptions in processor.persist and the processor is killed.
Thanks again for your help,
michael
thanks for looking into this.
I have just tried it again in the sequence mentioned in the initial post (first dropped the store database in mongo)
And it failed again when stopping the 2551 process.
What I did then is that I dropped all databases in my mongo instance (not only the store database) and since then everything works
like a charm (starting, stopping in various orders, deleting/not deleting the store database)
I will observe this further to see if this is happening again, but for now I think you can close this ticket.
If I split the cluster as you assumed, mongo will raise DuplicateKeyExceptions in processor.persist and the processor is killed.
Thanks again for your help,
michael
Thanks a lot for testing it again. Sounds weird that dropping other databases should influence this. (timing?)
There could of course be other reasons:
- bug in akka-persistence (replay in wrong order)
- bug in the mongo journal (replay in wrong order)
- bug in the sharding logic for updating that state (I don't want to put guards in there, because I want to see if something is wrong)
Closing it for now, but don't hesitate to re-open if you can re-produce it.
There could of course be other reasons:
- bug in akka-persistence (replay in wrong order)
- bug in the mongo journal (replay in wrong order)
- bug in the sharding logic for updating that state (I don't want to put guards in there, because I want to see if something is wrong)
Closing it for now, but don't hesitate to re-open if you can re-produce it.
Yes it sounds strange.
As we are in the midst of refactoring our application to akka-persistence/cluster sharding I have tested the replay behaviour with the
the mongo plugin (using our own processors) quite extensively. This works without a problem but in the moment is a single process, still
have to write multi jvm tests.
Anyway, I think akka-persistence and cluster sharding are great concepts, so thanks for the implementation!!
As we are in the midst of refactoring our application to akka-persistence/cluster sharding I have tested the replay behaviour with the
the mongo plugin (using our own processors) quite extensively. This works without a problem but in the moment is a single process, still
have to write multi jvm tests.
Anyway, I think akka-persistence and cluster sharding are great concepts, so thanks for the implementation!!
on 2014-04-03 11:20 *
By rocketraman
I was able to reproduce this from a clean database with a restart of the cluster -- its not the same as the split-cluster scenario but similar I guess:
1. Run 2551
2. Run 2552
3. Run 0
4. Wait for some posts to accumulate in the journal
5. Shutdown 0
6. Shutdown 2552
7. Shutdown 2551
8. Run 2551
9. Run 2552
10. Run 0
11. Wait for some posts to accumulate.
12. Shutdown 2551
1. Run 2551
2. Run 2552
3. Run 0
4. Wait for some posts to accumulate in the journal
5. Shutdown 0
6. Shutdown 2552
7. Shutdown 2551
8. Run 2551
9. Run 2552
10. Run 0
11. Wait for some posts to accumulate.
12. Shutdown 2551
Great, then it is open again. I will try.
on 2014-04-04 07:27 *
By benoit.heinrich
Hi Team,
I've also been able to reproduce that defect with latest akka 2.3.1 version by running the following sequence:
1. Set seed-nodes to be 2551 address only
2. Start 2551, 2552 and 2553
3. Wait until you see "Posts by Martin:
Post 1 from ClusterSystem@127.0.0.1:2553" on 2553
4. Shutdown 2553, 2552, 2551 rapidly
5. Set seed-nodes to be 2552 address only
6. Start 2552 then 2553
7. Shutdown 2552
8. Wait for 2552 to be detected as down on 2553
9. Errors on 2553
10. Stop 2553
11. Set seed-nodes to be 2553 address only
12. Start 2553
13. Errors on 2553
At that point you can easily repeat steps 12 and 13.
As you see, I've also use a bot with a predictable port number so we can easily reproduce the same scenario all the time (this is 2553).
I've been trying this sequence many number of times and it always reproduced.
I've also tried variations of that sequence and only that particular sequence reproduced the problem every time.
I hope it'll help you finding the root cause.
Cheers,
/Benoit
I've also been able to reproduce that defect with latest akka 2.3.1 version by running the following sequence:
1. Set seed-nodes to be 2551 address only
2. Start 2551, 2552 and 2553
3. Wait until you see "Posts by Martin:
Post 1 from ClusterSystem@127.0.0.1:2553" on 2553
4. Shutdown 2553, 2552, 2551 rapidly
5. Set seed-nodes to be 2552 address only
6. Start 2552 then 2553
7. Shutdown 2552
8. Wait for 2552 to be detected as down on 2553
9. Errors on 2553
10. Stop 2553
11. Set seed-nodes to be 2553 address only
12. Start 2553
13. Errors on 2553
At that point you can easily repeat steps 12 and 13.
As you see, I've also use a bot with a predictable port number so we can easily reproduce the same scenario all the time (this is 2553).
I've been trying this sequence many number of times and it always reproduced.
I've also tried variations of that sequence and only that particular sequence reproduced the problem every time.
I hope it'll help you finding the root cause.
Cheers,
/Benoit
on 2014-04-04 10:35 *
By Patrik Nordwall
Thank you. Was that with the mongodb journal or shared leveldb on another port that is not part of the cluster?
on 2014-04-04 10:42 *
By rocketraman
Patrik, in my case it was a mongodb journal using this tree: https://github.com/rocketraman/activator-akka-cluster-sharding-scala/tree/mongodb-journal. As I recall, I was also able to replicate this with the other MongoDB journal linked from the community contributions page. If I get a chance later, I'll try it out with Martin's cassandra journal.
on 2014-04-04 11:39 *
By benoit.heinrich
I've been using the cassandra journal from https://github.com/krasserm/akka-persistence-cassandra/
on 2014-04-04 11:51 *
By Patrik Nordwall
Thank you for narrowing it down.
on 2014-04-06 14:32 *
By Patrik Nordwall
Assigned to set to Patrik Nordwall
Milestone changed from 2.3.x to Current
Status changed from New to Accepted
on 2014-04-06 14:50 *
By Patrik Nordwall
Nailed it! The problem was that some address information was lost in the serialization. Explained in more detail in the pull request: https://github.com/akka/akka/pull/2117