Distributed & scalable threshold rule evaluation for alarms
A simple method of detecting threshold breaches for alarms is to do so directly "in-stream" as the metric datapoints are ingested. However this approach is overly restrictive when it comes to wide dimension metrics, where a datapoint from a single source is insufficient to perform the threshold evaluation. The in-stream evaluation approach is also less suited to the detection of missing or delayed data conditions.
An alternative approach is to use a horizontally scaled array of threshold evaluators, partitioning the set of alarm rules across these workers. Each worker would poll for the aggregated metric corresponding to each rule they've been assigned.
The allocation of rules to evaluation workers could take into account both locality (ensuring rules applying to the same metric are handled by the same workers if possible) and fairness (ensuring the workload is evenly balanced across the current population of workers).
The polling cycle would also provide a logical point to implement policies such as:
* correcting for metric lag
* gracefully handling sparse metrics versus detecting missing expected datapoints
* selectively excluding chaotic data.
The allocation of rules to evaluation workers could take into account both locality (ensuring rules applying to the same metric are handled by the same workers if possible) and fairness (ensuring the workload is evenly balanced across the current population of workers).
Blueprint information
- Status:
- Complete
- Approver:
- Julien Danjou
- Priority:
- High
- Drafter:
- Eoghan Glynn
- Direction:
- Approved
- Assignee:
- Eoghan Glynn
- Definition:
- Drafting
- Series goal:
- Accepted for havana
- Implementation:
-
Implemented
- Milestone target:
-
2013.2
- Started by
- Eoghan Glynn
- Completed by
- Julien Danjou
Whiteboard
Getting the ball rolling on reviews with:
Work Items
Dependency tree

* Blueprints in grey have been implemented.