Distributed & scalable threshold rule evaluation for alarms

Registered by Eoghan Glynn on 2013-03-05

A simple method of detecting threshold breaches for alarms is to do so directly "in-stream" as the metric datapoints are ingested. However this approach is overly restrictive when it comes to wide dimension metrics, where a datapoint from a single source is insufficient to perform the threshold evaluation. The in-stream evaluation approach is also less suited to the detection of missing or delayed data conditions.

An alternative approach is to use a horizontally scaled array of threshold evaluators, partitioning the set of alarm rules across these workers. Each worker would poll for the aggregated metric corresponding to each rule they've been assigned.

The allocation of rules to evaluation workers could take into account both locality (ensuring rules applying to the same metric are handled by the same workers if possible) and fairness (ensuring the workload is evenly balanced across the current population of workers).

The polling cycle would also provide a logical point to implement policies such as:

  * correcting for metric lag
  * gracefully handling sparse metrics versus detecting missing expected datapoints
  * selectively excluding chaotic data.

The allocation of rules to evaluation workers could take into account both locality (ensuring rules applying to the same metric are handled by the same workers if possible) and fairness (ensuring the workload is evenly balanced across the current population of workers).

Blueprint information

Status:
Complete
Approver:
Julien Danjou
Priority:
High
Drafter:
Eoghan Glynn
Direction:
Approved
Assignee:
Eoghan Glynn
Definition:
Drafting
Series goal:
Accepted for havana
Implementation:
Implemented
Milestone target:
milestone icon 2013.2
Started by
Eoghan Glynn on 2013-05-16
Completed by
Julien Danjou on 2013-07-08

Related branches

Whiteboard

Getting the ball rolling on reviews with:

  https://review.openstack.org/34468

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.