Start sending heartbeats immediately when joining
node A joins B
B receives join command, and initiates the failure detector with first heartbeat.
This first heartbeat is important to be able to detect that A becomes unavailable if it doesn't send any heartbeats.
However, A will not start sending heartbeats to B until it receives a gossip telling him that he is part of the cluster, i.e. B exists in gossip.members.
This initial gossip roundtrip makes the initial failure detection special case, and there is a risk that B thinks A is unavailable, but it is only the gossip that is delayed.
I think it would be better to start sending heartbeats from A to B immediately after sending the join command.
This implies that B must keep track of that he is in "joining phase" with A, so that he knows that he should send heartbeats (B is not in gossip.members yet).
This can be handled by a separate Set[Address] in the State (not in Gossip) that B is added to when A joins B, and that is removed from when the Gossip with B arrives.
Even better would be to keep track of the join timestamp also Map[Address, Timestamp] so that it can be evicted if no gossip arrives (B unreachable), and heartbeats to B can stop.
WDYT?
B receives join command, and initiates the failure detector with first heartbeat.
This first heartbeat is important to be able to detect that A becomes unavailable if it doesn't send any heartbeats.
However, A will not start sending heartbeats to B until it receives a gossip telling him that he is part of the cluster, i.e. B exists in gossip.members.
This initial gossip roundtrip makes the initial failure detection special case, and there is a risk that B thinks A is unavailable, but it is only the gossip that is delayed.
I think it would be better to start sending heartbeats from A to B immediately after sending the join command.
This implies that B must keep track of that he is in "joining phase" with A, so that he knows that he should send heartbeats (B is not in gossip.members yet).
This can be handled by a separate Set[Address] in the State (not in Gossip) that B is added to when A joins B, and that is removed from when the Gossip with B arrives.
Even better would be to keep track of the join timestamp also Map[Address, Timestamp] so that it can be evicted if no gossip arrives (B unreachable), and heartbeats to B can stop.
WDYT?
Leave a comment
on 2012-06-19 13:31 *
By Jonas Bonér
Yep. I like the latter alternative (Map). Good analysis.
on 2012-06-20 11:23 *
By Patrik Nordwall
(In revision:529c25f3dc22467c9f1b1c0358c13b0d5845797b) Start sending heartbeats immediately when joining, see #2249
Branch: wip-2249-heartbeats-after-join-patriknw
- Keep track of joins that are in progress in State.joinInProgress,
- Add test that fails without this feature
Branch: wip-2249-heartbeats-after-join-patriknw
Updating tickets (#939, #940, #1941, #2213, #2214, #2215, #2219, #2222, #2223, #2239, #2240, #2249, #2250, #2252, #2253, #2254, #2256, #2259, #2263, #2264, #2265, #2267, #2270, #2271, #2275, #2277, #2286, #2287, #2289, #2290, #2303, #2304, #2308, #2310, #2311, #2317, #2323, #2331, #2374, #2392, #2405, #2423, #2425, #2440, #2444, #2445, #2453, #2456, #2459, #2473, #2477, #2491, #2495, #2523, #2534, #2541, #2544, #2545, #2549, #2582, #2583, #2589, #2626)
Updating tickets (#939, #940, #1941, #2081, #2126, #2213, #2214, #2215, #2219, #2222, #2223, #2239, #2240, #2249, #2250, #2252, #2253, #2254, #2256, #2259, #2263, #2264, #2265, #2267, #2270, #2271, #2275, #2277, #2286, #2287, #2289, #2290, #2303, #2304, #2308, #2310, #2311, #2317, #2323, #2331, #2374, #2392, #2394, #2405, #2408, #2423, #2424, #2425, #2440, #2444, #2445, #2449, #2453, #2456, #2459, #2461, #2473, #2477, #2485, #2491, #2495, #2498, #2501, #2505, #2515, #2517, #2523, #2534, #2541, #2544, #2545, #2549, #2582, #2583, #2588, #2589, #2598, #2599, #2618, #2623, #2626, #2627, #2630, #2631, #2633, #2634, #2635, #2637, #2638, #2642, #2643, #2646, #2647, #2648, #2649, #2650, #2653, #2655, #2657, #2658)