#2155

Created on
2012-05-30 22:56
Reported by
Patrik Nordwall
Status
Fixed
Priority
Highest (1)
Assigned to
Patrik Nordwall
Milestone
2.1-RC1
Plan Level
-
Due Date
-

LeaderDowningNodeThatIsUnreachableSpec failed once

I ran cluster tests 10 times on scalable1. One time LeaderDowningNodeThatIsUnreachableSpec failed.
Full log attatched.
git commit: 36385f981b91cc7418567f5da25a55557ab55410

I used -Dakka.test.timefactor=2 so the timeout that fails here is 60 seconds.
Note that it is convergence that is not reached in MultiNodeClusterSpec.awaitUpConvergence

[JVM-Node3] - be able to DOWN a 'last' node that is UNREACHABLE *** FAILED ***
[JVM-Node3]   java.lang.AssertionError: assertion failed: timeout 54691814077 nanoseconds expired
[JVM-Node3]   at scala.Predef$.assert(Predef.scala:160)
[JVM-Node3]   at akka.testkit.TestKit.poll$1(TestKit.scala:207)
[JVM-Node3]   at akka.testkit.TestKit.awaitCond(TestKit.scala:213)
[JVM-Node3]   at akka.cluster.MultiNodeClusterSpec$$anonfun$awaitUpConvergence$1.apply$mcV$sp(MultiNodeClusterSpec.scala:67)
[JVM-Node3]   at akka.cluster.MultiNodeClusterSpec$$anonfun$awaitUpConvergence$1.apply(MultiNodeClusterSpec.scala:64)
[JVM-Node3]   at akka.cluster.MultiNodeClusterSpec$$anonfun$awaitUpConvergence$1.apply(MultiNodeClusterSpec.scala:64)
[JVM-Node3]   at akka.testkit.TestKit.within(TestKit.scala:244)
[JVM-Node3]   at akka.testkit.TestKit.within(TestKit.scala:258)
[JVM-Node3]   at akka.cluster.MultiNodeClusterSpec$class.awaitUpConvergence(MultiNodeClusterSpec.scala:64)
[JVM-Node3]   at akka.cluster.LeaderDowningNodeThatIsUnreachableSpec.awaitUpConvergence(LeaderDowningNodeThatIsUnreachableSpec.scala:35)
[JVM-Node3]   at akka.cluster.LeaderDowningNodeThatIsUnreachableSpec$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$4.apply$mcV$sp(LeaderDowningNodeThatIsUnreachableSpec.scala:80)
[JVM-Node3]   at akka.remote.testkit.MultiNodeSpec.runOn(MultiNodeSpec.scala:178)
[JVM-Node3]   at akka.cluster.LeaderDowningNodeThatIsUnreachableSpec$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LeaderDowningNodeThatIsUnreachableSpec.scala:71)
[JVM-Node3]   at akka.cluster.LeaderDowningNodeThatIsUnreachableSpec$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(LeaderDowningNodeThatIsUnreachableSpec.scala:45)
[JVM-Node3]   at akka.cluster.LeaderDowningNodeThatIsUnreachableSpec$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(LeaderDowningNodeThatIsUnreachableSpec.scala:45)

Leave a comment

on 2012-05-30 22:56 *

By Patrik Nordwall

Attachment akka-cluster-tests-7.log added

file:aCWrkuQOKr4Am8acwqjQYw

on 2012-06-01 11:07 *

By Patrik Nordwall

Milestone changed from Infrastructure and Framework For Distributed and Load Testing to Coltrane

on 2012-06-04 11:37 *

By Patrik Nordwall

Attachment LeaderDowningNodeThatIsUnreachable.txt added

file:arJ6OgRHGr4yNZacwqjQWU

on 2012-06-04 11:38 *

By Patrik Nordwall

It fails once in while. Attached new log at DEBUG level.

on 2012-06-04 22:31 *

By Patrik Nordwall

Another failure: http://scalable1:8080/job/akka-multi-node/75/
This is interesting because it is after the shutdown/remove fix, so we can't blame that one any more. Verify that it same error as before (I haven't checked).

on 2012-06-04 22:32 *

By Patrik Nordwall

Priority changed from Normal (3) to Highest (1)

on 2012-06-04 23:12 *

By Jonas Bonér

What shutdown/remove fix?

on 2012-06-05 00:12 *

By Patrik Nordwall

http://www.assembla.com/spaces/akka/tickets/2137-improve-shutdown-facilities

on 2012-06-05 10:31 *

By Patrik Nordwall

Assigned to set to Patrik Nordwall

on 2012-06-05 10:32 *

By Patrik Nordwall

Status changed from New to Accepted

on 2012-06-05 13:31 *

By Patrik Nordwall

In job 75 (link above) the failure is due to fourth node becomes unreachable (phi > 4). Two pecuilar stats from the log:

Gossiping from fourth is done less frequent than from other nodes:

$ cat log1.txt | grep "\[first\]" | grep "Selecting random node to gossip to" | wc -l
     196
$ cat log1.txt | grep "\[second\]" | grep "Selecting random node to gossip to" | wc -l
     275
$ cat log1.txt | grep "\[third\]" | grep "Selecting random node to gossip to" | wc -l
     272
$ cat log1.txt | grep "\[fourth\]" | grep "Selecting random node to gossip to" | wc -l
      26

Gossiping to fourth is done less frequent than to other nodes:

$ cat log1.txt |grep "\[first\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@second" | wc -l
      89
$ cat log1.txt |grep "\[first\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@third" | wc -l
      85
$ cat log1.txt |grep "\[first\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@fourth" | wc -l
      22
$ cat log1.txt |grep "\[second\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@first" | wc -l
      95
$ cat log1.txt |grep "\[second\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@third" | wc -l
     164
$ cat log1.txt |grep "\[second\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@fourth" | wc -l
      16
$ cat log1.txt |grep "\[third\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@first" | wc -l
      93
$ cat log1.txt |grep "\[third\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@second" | wc -l
     168
$ cat log1.txt |grep "\[third\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@fourth" | wc -l
      11
$ cat log1.txt |grep "\[fourth\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@first" | wc -l
       4
$ cat log1.txt |grep "\[fourth\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@second" | wc -l
      12
$ cat log1.txt |grep "\[fourth\]" | grep "Gossiping to \[Actor\[akka://MultiNodeSpec@third" | wc -l
      10

deputyNodes could perhaps explain why gossiping to fourth is done less frequent but not wht gossiping from fourth is done less frequent than from other nodes.

I think it is time to do #2163 first.

on 2012-06-08 11:36 *

By Patrik Nordwall

I have verified that the gossiping distribution is ok, with the SunnyWeatherSpec.

on 2012-06-08 18:45 *

By Patrik Nordwall

Still failing http://scalable1:8080/job/akka-local/44/

on 2012-06-08 18:54 *

By Jonas Bonér

Still doesn't detect failure. Can it be a timing issue? Do we need to give it more time? Or lower the threshold?

on 2012-06-12 14:13 *

By Patrik Nordwall

Assigned to changed from Patrik Nordwall to Jonas Bonér

on 2012-06-12 17:14 *

By Patrik Nordwall

I tried this with FailureDetectorPuppet and it fails fairly often if I add the markNodeAsUnavailable on second and third also.

      
  runOn(second, third) {
        testConductor.enter("down-fourth-node")
        // mark the node as unreachable in the failure detector
        markNodeAsUnavailable(fourthAddress)

Then it becomes merge and failure. Don't think it fails every time there is a merge, but might be a combination of merge and a race.

on 2012-06-14 10:04 *

By Patrik Nordwall

Assigned to changed from Jonas Bonér to Patrik Nordwall

Status changed from Accepted to Test

I'm pretty sure this was fixed by the corrections of gossip merge, see #2077
I have run this 20 times with the extra markNodeAsUnavailable. No failures. It was easy to reproduce before 2077.
It has also passed all nightly test runs.

on 2012-06-14 11:45 *

By Jonas Bonér

Awesome job Patrik. Gold star.

on 2012-06-15 11:40 *

By Patrik Nordwall

Status changed from Test to Fixed

All (10) nightly runs pass. Fixed!

on 2012-06-15 12:45 *

By Jonas Bonér

Awesome indeed.

on 2012-09-24 18:56 *

Milestone changed from Coltrane to 2.1-RC1

Updating tickets (#620, #679, #725, #750, #752, #753, #754, #763, #789, #870, #893, #922, #953, #954, #971, #977, #983, #985, #987, #991, #1026, #1045, #1051, #1060, #1061, #1084, #1098, #1099, #1133, #1134, #1135, #1136, #1137, #1194, #1225, #1226, #1243, #1245, #1247, #1248, #1254, #1261, #1300, #1317, #1391, #1412, #1791, #1793, #1901, #1908, #1911, #1912, #1913, #1914, #1915, #1916, #1917, #1922, #1983, #1987, #1996, #1997, #1998, #2066, #2077, #2105, #2117, #2133, #2143, #2149, #2151, #2152, #2153, #2155, #2157, #2158, #2159, #2160, #2161, #2162, #2163, #2164, #2165, #2167, #2171, #2175, #2176, #2177, #2180, #2182, #2184, #2185, #2193, #2199, #2202, #2204, #2206, #2207, #2209, #2210)

Name	Size
akka-cluster-tests-7.log	67.8 KB	Added by Patrik Nordwall on 2012-05-30 - Upload new version

LeaderDowningNodeThatIsUnreachable.txt	5.75 MB	Added by Patrik Nordwall on 2012-06-04 - Upload new version

Drop the files anywhere in this page to upload them as attachments.

Related Tickets

Add people from your team or external to follow ticket activity

Followers will receive email updates about new ticket activity or emails sent to akka+2155@tickets.assembla.com