LeaderDowningNodeThatIsUnreachableSpec failed once
I ran cluster tests 10 times on scalable1. One time LeaderDowningNodeThatIsUnreachableSpec failed.
Full log attatched.
git commit: 36385f981b91cc7418567f5da25a55557ab55410
I used -Dakka.test.timefactor=2 so the timeout that fails here is 60 seconds.
Note that it is convergence that is not reached in MultiNodeClusterSpec.awaitUpConvergence
Full log attatched.
git commit: 36385f981b91cc7418567f5da25a55557ab55410
I used -Dakka.test.timefactor=2 so the timeout that fails here is 60 seconds.
Note that it is convergence that is not reached in MultiNodeClusterSpec.awaitUpConvergence
[JVM-Node3] - be able to DOWN a 'last' node that is UNREACHABLE *** FAILED ***
[JVM-Node3] java.lang.AssertionError: assertion failed: timeout 54691814077 nanoseconds expired
[JVM-Node3] at scala.Predef$.assert(Predef.scala:160)
[JVM-Node3] at akka.testkit.TestKit.poll$1(TestKit.scala:207)
[JVM-Node3] at akka.testkit.TestKit.awaitCond(TestKit.scala:213)
[JVM-Node3] at akka.cluster.MultiNodeClusterSpec$$anonfun$awaitUpConvergence$1.apply$mcV$sp(MultiNodeClusterSpec.scala:67)
[JVM-Node3] at akka.cluster.MultiNodeClusterSpec$$anonfun$awaitUpConvergence$1.apply(MultiNodeClusterSpec.scala:64)
[JVM-Node3] at akka.cluster.MultiNodeClusterSpec$$anonfun$awaitUpConvergence$1.apply(MultiNodeClusterSpec.scala:64)
[JVM-Node3] at akka.testkit.TestKit.within(TestKit.scala:244)
[JVM-Node3] at akka.testkit.TestKit.within(TestKit.scala:258)
[JVM-Node3] at akka.cluster.MultiNodeClusterSpec$class.awaitUpConvergence(MultiNodeClusterSpec.scala:64)
[JVM-Node3] at akka.cluster.LeaderDowningNodeThatIsUnreachableSpec.awaitUpConvergence(LeaderDowningNodeThatIsUnreachableSpec.scala:35)
[JVM-Node3] at akka.cluster.LeaderDowningNodeThatIsUnreachableSpec$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$4.apply$mcV$sp(LeaderDowningNodeThatIsUnreachableSpec.scala:80)
[JVM-Node3] at akka.remote.testkit.MultiNodeSpec.runOn(MultiNodeSpec.scala:178)
[JVM-Node3] at akka.cluster.LeaderDowningNodeThatIsUnreachableSpec$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LeaderDowningNodeThatIsUnreachableSpec.scala:71)
[JVM-Node3] at akka.cluster.LeaderDowningNodeThatIsUnreachableSpec$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(LeaderDowningNodeThatIsUnreachableSpec.scala:45)
[JVM-Node3] at akka.cluster.LeaderDowningNodeThatIsUnreachableSpec$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(LeaderDowningNodeThatIsUnreachableSpec.scala:45)
Leave a comment
on 2012-06-01 11:07 *
By Patrik Nordwall
Milestone changed from Infrastructure and Framework For Distributed and Load Testing to Coltrane
on 2012-06-04 11:38 *
By Patrik Nordwall
It fails once in while. Attached new log at DEBUG level.
on 2012-06-04 22:31 *
By Patrik Nordwall
Another failure: http://scalable1:8080/job/akka-multi-node/75/
This is interesting because it is after the shutdown/remove fix, so we can't blame that one any more. Verify that it same error as before (I haven't checked).
This is interesting because it is after the shutdown/remove fix, so we can't blame that one any more. Verify that it same error as before (I haven't checked).
on 2012-06-04 23:12 *
By Jonas Bonér
What shutdown/remove fix?
on 2012-06-05 00:12 *
By Patrik Nordwall
on 2012-06-05 13:31 *
By Patrik Nordwall
In job 75 (link above) the failure is due to fourth node becomes unreachable (phi > 4). Two pecuilar stats from the log:
Gossiping from fourth is done less frequent than from other nodes:
Gossiping to fourth is done less frequent than to other nodes:
deputyNodes could perhaps explain why gossiping to fourth is done less frequent but not wht gossiping from fourth is done less frequent than from other nodes.
I think it is time to do #2163 first.
Gossiping from fourth is done less frequent than from other nodes:
$ cat log1.txt | grep "\[first\]" | grep "Selecting random node to gossip to" | wc -l
196
$ cat log1.txt | grep "\[second\]" | grep "Selecting random node to gossip to" | wc -l
275
$ cat log1.txt | grep "\[third\]" | grep "Selecting random node to gossip to" | wc -l
272
$ cat log1.txt | grep "\[fourth\]" | grep "Selecting random node to gossip to" | wc -l
26
Gossiping to fourth is done less frequent than to other nodes:
$ cat log1.txt |grep "\[first\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@second" | wc -l
89
$ cat log1.txt |grep "\[first\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@third" | wc -l
85
$ cat log1.txt |grep "\[first\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@fourth" | wc -l
22
$ cat log1.txt |grep "\[second\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@first" | wc -l
95
$ cat log1.txt |grep "\[second\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@third" | wc -l
164
$ cat log1.txt |grep "\[second\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@fourth" | wc -l
16
$ cat log1.txt |grep "\[third\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@first" | wc -l
93
$ cat log1.txt |grep "\[third\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@second" | wc -l
168
$ cat log1.txt |grep "\[third\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@fourth" | wc -l
11
$ cat log1.txt |grep "\[fourth\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@first" | wc -l
4
$ cat log1.txt |grep "\[fourth\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@second" | wc -l
12
$ cat log1.txt |grep "\[fourth\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@third" | wc -l
10
deputyNodes could perhaps explain why gossiping to fourth is done less frequent but not wht gossiping from fourth is done less frequent than from other nodes.
I think it is time to do #2163 first.
on 2012-06-08 11:36 *
By Patrik Nordwall
I have verified that the gossiping distribution is ok, with the SunnyWeatherSpec.
on 2012-06-08 18:45 *
By Patrik Nordwall
Still failing http://scalable1:8080/job/akka-local/44/
on 2012-06-08 18:54 *
By Jonas Bonér
Still doesn't detect failure. Can it be a timing issue? Do we need to give it more time? Or lower the threshold?
on 2012-06-12 17:14 *
By Patrik Nordwall
I tried this with FailureDetectorPuppet and it fails fairly often if I add the markNodeAsUnavailable on second and third also.
Then it becomes merge and failure. Don't think it fails every time there is a merge, but might be a combination of merge and a race.
runOn(second, third) {
testConductor.enter("down-fourth-node")
// mark the node as unreachable in the failure detector
markNodeAsUnavailable(fourthAddress)
Then it becomes merge and failure. Don't think it fails every time there is a merge, but might be a combination of merge and a race.
on 2012-06-14 10:04 *
By Patrik Nordwall
Assigned to changed from Jonas Bonér to Patrik Nordwall
Status changed from Accepted to Test
I'm pretty sure this was fixed by the corrections of gossip merge, see #2077
I have run this 20 times with the extra markNodeAsUnavailable. No failures. It was easy to reproduce before 2077.
It has also passed all nightly test runs.
I have run this 20 times with the extra markNodeAsUnavailable. No failures. It was easy to reproduce before 2077.
It has also passed all nightly test runs.
on 2012-06-14 11:45 *
By Jonas Bonér
Awesome job Patrik. Gold star.
All (10) nightly runs pass. Fixed!
on 2012-06-15 12:45 *
By Jonas Bonér
Awesome indeed.
Updating tickets (#620, #679, #725, #750, #752, #753, #754, #763, #789, #870, #893, #922, #953, #954, #971, #977, #983, #985, #987, #991, #1026, #1045, #1051, #1060, #1061, #1084, #1098, #1099, #1133, #1134, #1135, #1136, #1137, #1194, #1225, #1226, #1243, #1245, #1247, #1248, #1254, #1261, #1300, #1317, #1391, #1412, #1791, #1793, #1901, #1908, #1911, #1912, #1913, #1914, #1915, #1916, #1917, #1922, #1983, #1987, #1996, #1997, #1998, #2066, #2077, #2105, #2117, #2133, #2143, #2149, #2151, #2152, #2153, #2155, #2157, #2158, #2159, #2160, #2161, #2162, #2163, #2164, #2165, #2167, #2171, #2175, #2176, #2177, #2180, #2182, #2184, #2185, #2193, #2199, #2202, #2204, #2206, #2207, #2209, #2210)