Add internal API for metrics in Akka OSS cluster

Needs to be retrieved from Git history and adapted to the new clustering.

https://github.com/akka/akka/commit/3cee2fc8ec18f6e5aa61e714083b033c5ac21d38

Leave a comment

on 2011-06-25 16:41 *

By Jonas Bonér

Updating tickets (#769, #774, #875, #889, #917, #920, #928, #929, #930, #931, #932, #933, #934, #935, #936, #938, #939, #940, #941, #942, #943, #944, #951, #958, #959, #960, #962, #630, #870, #891, #895)

on 2011-06-25 17:03 *

By Jonas Bonér

Milestone changed from 2.0 to 2.1

on 2011-07-06 12:16 *

By Jonas Bonér

Assigned to changed from pveentjer to -none-

Updating tickets (#87, #620, #644, #679, #750, #752, #753, #754, #764, #875, #876, #929, #938, #939, #940, #941, #942, #943, #944, #953, #954, #977, #983, #987, #996, #630, #643, #725, #892, #893)

on 2011-08-17 09:46 *

By vasil.remeniuk

If noone's working on this ticket so far (and also tickets #940-942) - I can take care of them. In the past I've had experience in making monitoring services (with Sigar).

on 2011-08-17 09:48 *

By Jonas Bonér

Assigned to set to vasil.remeniuk

Awesome. Thanks a lot.

on 2011-08-19 16:24 *

By vasil.remeniuk

Status changed from New to Accepted

on 2011-08-28 15:16 *

By vasil.remeniuk

Status changed from Accepted to Fixed

(In revision:3cee2fc8ec18f6e5aa61e714083b033c5ac21d38) Internal Metrics API. Fixes #939

Retreives metrics snapshots of the system the node is running on through JMX monitoring MBeans or Hyperic Sigar (is Sigar library is plugged)
Allows to set metrics alteration monitors that are triggered, when specific conditions are satisfied (e.g., not enough memory left on the node)
Nodes publish their local metrics to ZNodes
In order to maintain good performance, metrics manager internally caches snapshots, and refreshes them from time to time from ZooKeeper

Branch: master

on 2012-04-24 14:36 *

By Jonas Bonér

Assigned to changed from vasil.remeniuk to -none-

Description set to Needs to be retrieved from ...

Status changed from Fixed to New

on 2012-06-12 12:29 *

By Jonas Bonér

Description changed from Needs to be retrieved from ... to Needs to be retrieved from ...

on 2012-06-25 19:50 *

By Jonas Bonér

Assigned to set to login

on 2012-07-07 22:20 *

By Helena Edelson

Starting this week.

on 2012-07-08 21:23 *

By Helena Edelson

Sum of Child Work remaining changed from 3.0 to 0.0

Is the requirement to have Sigar be optionally plugged in (not added in AkkaBuild), and thus you require this to use reflection (quite a lot of it from the cherry-pick pulled in)? What I mean is, do you really want that much reflection in your design? Or is simply adding the Sigar dep all that bad.

on 2012-07-08 22:40 *

By Patrik Nordwall

Sigar must be optional.

What is sigar used for here? Only systemLoadAverage?

Personally I would not use reflection. In Atmos we didn't use reflection and it's still optional. However, since the reflection code is already implemented, why not start with that and we can revise later. Very isolated change, and I don't think it's much code.

We could also discuss if the dependency should propagate as a transitive dependency or not. I think I vote for letting users add sigar dependency themselves, if they use this feature and sigar is needed for their environment. Sigar might not be in maven central.

on 2012-07-09 01:59 *

By Helena Edelson

(Comment removed)

on 2012-07-09 02:09 *

By Helena Edelson

Status changed from New to Accepted

on 2012-07-30 17:40 *

By Helena Edelson

Sum of Child Work remaining changed from 3.0 to 0.0

on 2012-08-02 18:49 *

By Helena Edelson

Which function should I use to update the new gossip that has Gossip.meta for the detected metric alert conditions?
environment.publishLatestGossip(newGossip) or notifyListeners(newGossip) ?

/**

Receives metrics alerts and gossips out a flag of indicators if one or more alerts have been detected.

*/
def receiveAlerts(alerts: FailureAlerts): Unit = {
val localGossip = latestGossip
val newGossip = localGossip copy (meta = localGossip.meta ++ alerts.meta)
notifyListeners(newGossip)
}

thanks.

on 2012-08-02 20:36 *

By Patrik Nordwall

I don't know how you have implemented the full thing, so I'm unsure, but publishLatestGossip will be invoked periodically and is probably the way to publish meta (part of gossip) also.

/Patrik

on 2012-08-03 23:36 *

By Helena Edelson

Here is the latest status that is squashed
https://github.com/helena/akka/commit/61be9d41cbb1f111fae772f5b555fccadd2bf4d5
Branch: wip-939-cluster-metrics-helena-unsquashed
Status and To Do items on the commit msg.

on 2012-08-04 07:11 *

By Patrik Nordwall

I look fw to give feedback on your work, and I will be able to do that at the end of next week.

/Patrik

on 2012-08-07 15:44 *

By Helena Edelson

(Comment removed)

on 2012-08-09 16:17 *

By Helena Edelson

Hi Patrik,
I'm cleaning up and testing the metrics and statistical analysis and have a question:

What are the requirements for the sampled metrics, do you prefer JVM-specific heap only? With sigar I can do the current process, which I think would be akka (the akka pid).

With SIGAR I can get system hard limits for cpu, memory, etc as well as the current process cpu, memory, etc.
The process memory stats are what you would see with the ps/top/etc commands, but for a Java process jmx has more detail on the JVM managed heap than sigar of course.

Of course sigar has many stats that the JVM doesn't provide about the system itself.

on 2012-08-09 17:28 *

By Patrik Nordwall

For memory metrics I think JVM heap is most interesting.

GC rate could also be interesting.

For CPU I think load average from sigar is best. 1, 5 or 15 minutes? Perhaps 5? Alternative is "CPU combined" from sigar. Not for the pid.

Perhaps we should support a few alternatives - configurable.

I can provide more details tomorrow.

/Patrik

on 2012-08-09 17:44 *

By Helena Edelson

(Comment removed)

on 2012-08-13 02:20 *

By Helena Edelson

(Comment removed)

on 2012-08-13 07:25 *

By Patrik Nordwall

Did you see my feedback on https://github.com/helena/akka/commit/61be9d41cbb1f111fae772f5b555fccadd2bf4d5 ?

I think we should talk about how this information is going to be used and how to publish it in best way. I'll start thinking more about that today.

Then we should also discuss exactly what metrics we should grab from sigar, but that feels rather easy to change when we know more about the usage.

on 2012-08-13 16:10 *

By Patrik Nordwall

Hi Helena,

We had a discussion about the use case for the metrics. The primary use case, and only one to start with, is for load balancing routers. Such a router would use the metrics to send messages to the least loaded nodes. We skip the threshold and alert conditions (anomalies). Collect absolute metrics and aggregate to exponential moving average, which are published to other nodes.

We separate the transport of the metrics from the ordinary versioned Gossip, i.e. don't use Gossip.meta for the metrics. Instead a separate metrics message should be used. We start with the same kind of gossip protocol as the membership Gossip, but we might change to something else later (perhaps a more subscription based model). The metrics gossip can be simplified compared membership gossip, since it should not be versioned, i.e. no conflicts.

The Cluster publish the metrics to the local event bus, and the routers subscribe to the event bus.

The router will use the metrics to weight probability where to send messages. Not only to the best one. It will not perform immediate big changes to the routing, but adjusting in baby steps, to avoid oscillations.

Metrics to collect:
- cpu combined
- number of processors (cores)
- heap used, committed, max
- network IO read/write (max value?)

Note that we skip the physical memory metrics.

Later we will add:
- network latency (via ping-pong messages)
- mailbox sizes
- ...

Can I do anything to help you with these "new" requirements? Would you like to talk about it over Skype? Would you like that I add a skeleton for the metrics gossip part (and you can fill in the details)?

on 2012-08-13 21:19 *

By Helena Edelson

Hi,

I just saw the review on that (now fairly old) squashed branch https://github.com/helena/akka/commit/61be9d41cbb1f111fae772f5b555fccadd2bf4d5. I've commented and made all the changes. That was most definitely unfinished work, vs work in pull-request status yet, thus the commented out work, missing docs in some places.. which are all cleaned up now.

I'll ponder the new requirements today, which sound good. If you can do the routers that will be a huge help. Work has been very busy so progress on this is slower that I'd like.

Let's do a skype chat Tuesday (tomorrow). 4PM your time? My skype id is holly.edelson

Thanks,
Helena

on 2012-08-13 21:33 *

By Helena Edelson

Aggregation of metrics to one EWMA is far more simple, it was getting very inelegant to implement the initial requirements.
Refactoring this now.

Adding latency is excellent. That's something I can easily add when you are ready.

on 2012-08-13 21:44 *

By Patrik Nordwall

Good, talk to you tomorrow, 4pm CET.
My skype id is patrik.nordwall

/Patrik

on 2012-08-30 20:46 *

By Helena Edelson

Sum of Child Work remaining changed from 3.0 to 0.0

on 2012-09-07 00:20 *

By Helena Edelson

Assigned to changed from login to Patrik Nordwall

Status changed from Accepted to Test

on 2012-09-07 00:27 *

By Helena Edelson

(Comment removed)

on 2012-09-07 08:10 *

By Patrik Nordwall

Assigned to changed from Patrik Nordwall to login

You removed the comment? Is the linked commit ready for review. I'll take a look at it and then you can open a pull request.

The metrics aware, adaptive routers should definitely be done as separate tickets, and this can go in first.

You don't need to assign this to me. You can own it until completed. I'll review anyway.

on 2012-09-07 19:48 *

By Helena Edelson

Pull Request