Add Alarm Inhibition Functionality

Registered by Andrea Adams

Alarm inhibition adds the ability to inhibit certain notifications based on other certain notifications. This is configured by adding an inhibit rule the API. A user needs to specify a source and a target alarm class, and a set of matchers. When an alarm matching the source alarm class definition transitions to alarm from OK or undetermined, this should cause all target alarms that have matchers equal to the source alarm to be inhibited and not send a notification. This will be configurable via the API using a new API resource (see alarm managers API resource blueprint for more details). Before any notification is sent out, the alarm needs to be checked against existing inhibition rules. If the alarm is inhibited by some other alarm that is already in the alarm state, then no notification would be sent. All target alarms will wait in a queue to see if a source alarm matching the inhibition rules comes along. There will also be a tag where exclusions can be added. If an alarm matches an inhibition rule but it also matches the exclusions, then it is not inhibited.

The reason for adding this feature would be so that a user can define a set of notifications that all indicate the same problem. If one big failure or bug causes potentially thousands of notifications to alert, this feature would mute all notifications after the one indicating the main problem.
The needed changes for this include changes to the API, the python-monascaclient, Notification engine.

For API changes, please see the API resources blueprint. The Notification Enginge will start by sending any notifications that match the alarms actions. Then, the Notification Engine will need to check an alarm against all alarm grouping, inhibition, and silencing rules. If an alarm matches any silence, inhibit or grouping rule it is sent to a Kafka topic that relates to the group wait time associated with the alarm or some default wait time. After the wait time an Alarm Manager Engine will pull off the batch of alarms for processing. Then it will first check to see if an alarm in inhibited by comparing a target alarm to a source alarm. See the “Does the target alarm send a notification?” table for more details. A source alarm in the alarm state always sends a notification and a source alarm in the OK or undetermined state never sends a notification. If a source alarm transitions to OK or undetermined from alarm, then all target alarms are uninhibited and no notification is sent. Then the Alarm Manager Engine checks the alarm for any silencing and grouping rules. The Alarm Manager and the Notification Engine will read rules on start up from the database, and in runtime from a Kafka topic called Alarm Manager Rules and keep the rules in memory. For more information about the flow, see the diagram. For more information about silencing and grouping, see the silencing and grouping blueprints.

Does the target alarm send a notification?
Source State OK Alarm OK Alarm
Target State OK OK Alarm Alarm
Notification sent Yes No Yes No

Please see examples in Monasca wiki page: https://wiki.openstack.org/wiki/Monasca#Alarm_Managers

Blueprint information

Status:
Not started
Approver:
None
Priority:
High
Drafter:
Andrea Adams
Direction:
Needs approval
Assignee:
Andrea Adams
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/alarm-inhibition,n,z

Addressed by: https://review.openstack.org/434537
    [WIP]Modify Notification Engine to allow inhibit, silence, and group

Addressed by: https://review.openstack.org/438032
    Documentation for alarm state transition flow

Addressed by: https://review.openstack.org/447060
    Add alarm rule table in mysql for querying

Gerrit topic: https://review.openstack.org/#q,topic:group_silence_inhibit_rule,n,z

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.