Create cluster metrics-aware Adaptive Load Balancing Routers
Create cluster metric-aware Adaptive Load Balancing Routers that gracefully and pro-actively respond to node metric threshold conditions by redirecting traffic to least-loaded nodes when bottleneck conditions arise, so that fault tolerance is increased and throughput is not decreased.
- Each type of Adaptive Load Balancing Router is aware in near-real-time (see metric collection and metric gossip interval settings in cluster conf) of particular conditions on the nodes (network latency, load, memory, etc)
- Proactively handle low resource situations versus responding to failure situations that have already occurred
Leave a comment
on 2012-09-28 05:24 *
By Patrik Nordwall
Helena, will you take the lead on this one?
Should we start next week to discuss what it should do and how?
Should we start next week to discuss what it should do and how?
on 2012-09-30 07:43 *
By Patrik Nordwall
My first thoughts on the dynamic load balancing router.
In the end I think it should be a weighted round-robin or random router. The ratio of messages routed to the different nodes are proportional to the weights. E.g. 3 nodes A:B:C with weights 1:2:12 means that 6.7% of the messages are routed to A, 13.3% of msg to B and 80% to C.
It should be pretty straightforward to implement the weighted round-robin or random router. Inspiration can be found here https://svn.apache.org/repos/asf/camel/trunk/camel-core/src/main/java/org/apache/camel/processor/loadbalancer/
I'm not sure that way of implementing it is what we want, because it's based on a mutable data structure. The routing logic must be thread safe, and therefor that might not be the best way. An alternative way would be to think of it as a ordinary list of routees, but with multiple entries for nodes with larger weight. Then the selection can be done in same simple way as ordinary RoundRobinRouter or RandomRouter.
The weights should be dynamically changed based on the metrics of the nodes. Transform the metrics to utilization value (max 100%). E.g heap 200MB of 512MB => utilization 39%. The utilization is then transformed to weights. Node with highest utilization correspond to weight 1. E.g. nodes A:B:C with utilization 60%:30%:5% => weights 1:2:12
Since we already have smoothing of the metrics on the collection side (exp weighted moving average), it might not be needed with additional damping in the router when changing weights, but if that turns out to be needed it's possible to add that, but let's start without it.
This doesn't cover everything, but I hope it can be a starting point for discussion. WDYT?
In the end I think it should be a weighted round-robin or random router. The ratio of messages routed to the different nodes are proportional to the weights. E.g. 3 nodes A:B:C with weights 1:2:12 means that 6.7% of the messages are routed to A, 13.3% of msg to B and 80% to C.
It should be pretty straightforward to implement the weighted round-robin or random router. Inspiration can be found here https://svn.apache.org/repos/asf/camel/trunk/camel-core/src/main/java/org/apache/camel/processor/loadbalancer/
I'm not sure that way of implementing it is what we want, because it's based on a mutable data structure. The routing logic must be thread safe, and therefor that might not be the best way. An alternative way would be to think of it as a ordinary list of routees, but with multiple entries for nodes with larger weight. Then the selection can be done in same simple way as ordinary RoundRobinRouter or RandomRouter.
The weights should be dynamically changed based on the metrics of the nodes. Transform the metrics to utilization value (max 100%). E.g heap 200MB of 512MB => utilization 39%. The utilization is then transformed to weights. Node with highest utilization correspond to weight 1. E.g. nodes A:B:C with utilization 60%:30%:5% => weights 1:2:12
Since we already have smoothing of the metrics on the collection side (exp weighted moving average), it might not be needed with additional damping in the router when changing weights, but if that turns out to be needed it's possible to add that, but let's start without it.
This doesn't cover everything, but I hope it can be a starting point for discussion. WDYT?
I have concerns about the stability of the actual implementation (overshooting and oscillations). This is quite a bit of controls system theory kind of thing, although I don't think the theory will be too much help here. I strongly recommend running some simplified simulations first.
This is somewhat related to the random weighting, although the scenario is slightly different: http://en.wikipedia.org/wiki/Stochastic_universal_sampling
on 2012-09-30 08:10 *
By Patrik Nordwall
Yes, I agree. I think it should be a matter of not changing the weights too quickly. When we have the basics in place we should do some simulations and compare with damping of weight changes.
/Patrik
/Patrik
on 2012-09-30 10:55 *
By Helena Edelson
Description set to Create cluster metric-aware...
Summary changed from Create cluster metrics-related load balancing routers to Create cluster metrics-aware Adaptive Load Balancing Routers
I added the description and justification
on 2012-09-30 11:00 *
By Helena Edelson
Patrik, how far along are you on the cluster router work in general would you say?
on 2012-09-30 12:09 *
By Patrik Nordwall
That is done. See http://doc.akka.io/docs/akka/snapshot/cluster/cluster-usage.html#cluster-aware-routers
It's akka.cluster.routing.ClusterRouterConfig that manages the routees based on cluster members. This AdaptiveLoadBalancingRouter will be used together with that. It's responsible for selecting which routee to use for each message, and will therefore be similar to all other routers, such as RoundRobinRouter.
It's akka.cluster.routing.ClusterRouterConfig that manages the routees based on cluster members. This AdaptiveLoadBalancingRouter will be used together with that. It's responsible for selecting which routee to use for each message, and will therefore be similar to all other routers, such as RoundRobinRouter.
on 2012-10-01 08:01 *
By Helena Edelson
Assigned to set to Helena Edelson
Status changed from New to Accepted
Branch created. i'll push a basic framework this week then post the diff here for brainstorming; Drew if you want to collab/help that would be great. When do we need this ready to merge?
on 2012-10-01 08:18 *
By Helena Edelson
Do you prefer these to be 'dynamic' vs 'adaptive' by name?
on 2012-10-01 09:09 *
By Patrik Nordwall
Great! I think Adaptive is nice. I had a short discussion with Endre today and clarified the goal, which is not as advanced as he might have thought of initially. One case it will not handle very well is when all all nodes are 100% utilized, because then it will send same amount of messages to all, even though the machines might not be equally powerful. Using a mix of the metrics to calculate utilization might be interesting, and later adding mailbox size would improve on that scenario.
Let's do the basics first.
Let's do the basics first.
on 2012-10-01 09:25 *
By Helena Edelson
Can we simulate spinning up another node If all are !00% utilized, until akka.cluster can talk to the vm apis and actually spin one up?
Agreed, I think it can be seen a quite straight forward, for v1 of this.
The only tricky part I see so far is how the routers will be aware of the threshold values.
Agreed, I think it can be seen a quite straight forward, for v1 of this.
The only tricky part I see so far is how the routers will be aware of the threshold values.
- CPU utilization - already available
- Heap memory, on most OS, heap used and committed can be banked against heap max however on some max will not be available - what will we want to tell the user in that case?
- Network Latency - we have max read/write and tx - this needs to be flushed out
- For the least msg - haven't looked at that use case yet
on 2012-10-01 09:42 *
By Patrik Nordwall
In my proposal there is no threshold values.
Pushed to https://github.com/helena/akka/commit/a6bf53df3ecca31fbb57b2c1ab1716f31f043a4e
If you are interested in it I can do a PR.
All preliminary metric work is completed
- creation of NodeMetricsComparator for ordering of (Address, Long/Double) values in question to iterate through the nodes based on available routees (see the load balancing router) and return the address with min/max depending
- creationg of sealed trait MetricValues and its impls of HeapMemory, NetworkLatency and CPU for clean extraction (conversion) of node.metric. particular metric (heap mem used, system load average, etc) and delegation to the cluster metrics api vs exposing in the cluster router api
- creation of MetricsAwareClusterNodeSelector for evaluation, extraction, and getting of the address of the node that fulfils the criteria of the load balancing router implementation in question.
- the above was created as a trait to allow for the created ClusterAdaptiveMetricsLoadBalancingRouter, which will be able to provide by all metrics vs just one.
- creation of MetricsAwareClusterNodeSelector for
- extraction of the data w/out metric logic in the router package
Creation of the following Router and Router Impls
- ClusterAdaptiveLoadBalancingRouterLike extends RoundRobinLike with LoadBalancer
Status: complete the strategy and getNext() algorithm for round robin selection based on healthiest node
All of these are either completed or well stubbed out in MetricsAwareClusterNodeSelector:
- MemoryLoadBalancingRouter - returns the node address of healthiest by memory - see ClusterAdaptiveLoadBalancingRouterLike
- CpuLoadBalancer - algorithm impl needed for systemLoadAverage / combinedCPU, processors, cores to produce the node address of healthiest by memory - see ClusterAdaptiveLoadBalancingRouterLike
- NetworkLatencyLoadBalancer - finalization of the algorithm (min or max really) to return the node address of healthiest by memory - see ClusterAdaptiveLoadBalancingRouterLike
- ClusterAdaptiveMetricsLoadBalancingRouter - algorithm needed to select by overall health (all monitored metrics: mem, cpu, network)
Specs Created - multi-jvm
- For now the ClusterAdaptiveLoadBalancingRouter (router like and impls) are in multi-jvm/scala/akka.cluster.routing
- stubbed out the router's multi spec which is still in need of config and the rest of the tests. One basic test is in.
Specs Created - unit/integration
- metric value comparator Spec for new ordering by address and min/max value
- MetricsAwareClusterNodeSelector Spec
- Created a stubbed out
If you are interested in it I can do a PR.
All preliminary metric work is completed
- creation of NodeMetricsComparator for ordering of (Address, Long/Double) values in question to iterate through the nodes based on available routees (see the load balancing router) and return the address with min/max depending
- creationg of sealed trait MetricValues and its impls of HeapMemory, NetworkLatency and CPU for clean extraction (conversion) of node.metric. particular metric (heap mem used, system load average, etc) and delegation to the cluster metrics api vs exposing in the cluster router api
- creation of MetricsAwareClusterNodeSelector for evaluation, extraction, and getting of the address of the node that fulfils the criteria of the load balancing router implementation in question.
- the above was created as a trait to allow for the created ClusterAdaptiveMetricsLoadBalancingRouter, which will be able to provide by all metrics vs just one.
- creation of MetricsAwareClusterNodeSelector for
- extraction of the data w/out metric logic in the router package
Creation of the following Router and Router Impls
- ClusterAdaptiveLoadBalancingRouterLike extends RoundRobinLike with LoadBalancer
Status: complete the strategy and getNext() algorithm for round robin selection based on healthiest node
All of these are either completed or well stubbed out in MetricsAwareClusterNodeSelector:
- MemoryLoadBalancingRouter - returns the node address of healthiest by memory - see ClusterAdaptiveLoadBalancingRouterLike
- CpuLoadBalancer - algorithm impl needed for systemLoadAverage / combinedCPU, processors, cores to produce the node address of healthiest by memory - see ClusterAdaptiveLoadBalancingRouterLike
- NetworkLatencyLoadBalancer - finalization of the algorithm (min or max really) to return the node address of healthiest by memory - see ClusterAdaptiveLoadBalancingRouterLike
- ClusterAdaptiveMetricsLoadBalancingRouter - algorithm needed to select by overall health (all monitored metrics: mem, cpu, network)
Specs Created - multi-jvm
- For now the ClusterAdaptiveLoadBalancingRouter (router like and impls) are in multi-jvm/scala/akka.cluster.routing
- stubbed out the router's multi spec which is still in need of config and the rest of the tests. One basic test is in.
Specs Created - unit/integration
- metric value comparator Spec for new ordering by address and min/max value
- MetricsAwareClusterNodeSelector Spec
- Created a stubbed out
I start a new job Monday and unfortunately will not have time for a while to do more work on this for a while; at least not for at least a month or two so someone else may want to pick it up
Note that there is a FIXME comment in the akka.cluster.DataStream scaladoc comment that needs looking at.
Thanks,
Helena
Note that there is a FIXME comment in the akka.cluster.DataStream scaladoc comment that needs looking at.
Thanks,
Helena
on 2012-10-18 13:56 *
By Patrik Nordwall
Thanks a lot for the start of these routers. I'll continue. Good luck with your new job.
I think it's easiest if I just grab a patch of your commit and apply it to my wip branch, keeping you as author of the commit.
Thanks,
Patrik
I think it's easiest if I just grab a patch of your commit and apply it to my wip branch, keeping you as author of the commit.
Thanks,
Patrik
on 2012-10-18 17:41 *
By Helena Edelson
Sounds good. Sorry to bail. Tweet when it's in master?
on 2012-10-19 00:44 *
By Patrik Nordwall
No worries, you had informed about the new job so it was not a surprise.
on 2012-10-19 01:19 *
By Patrik Nordwall
Assigned to set to Patrik Nordwall
Status changed from Test to Accepted