Add Alarm Grouping Functionality

Registered by Kaiyan

Alarm grouping is used to categorize alarms of similar nature into a single notification. This functionality is very usefully especially during large outages when many systems fail at once and thousands of alarms go off simultaneously. For example, when a network partition occurs in your cluster, half of your service instances can not longer reach the database. In this case, hundreds of alarms will fire for different services. As a user, receiving hundreds of notifications caused by the same problem is not a good experience. Instead, having the functionality to group alarms by specific fields such as cluster name, alarm name and send one compact notification will be very helpful.

In order to implement alarm grouping, a new resource called alarm grouping manager will be added in Monasca API. Grouping rules are created by using alarm grouping manager resource and you can also query, update, patch or delete the grouping manager rules too. Please see more details in blueprint: Monasca API Alarm Managers.

Inside Monasca-notification, two check functions need to be added. First one is to check if there are alarm actions, ok actions or undetermined actions associated with the alarm transition. If there is, create and send notification using notification engine immediately. The second check is to read alarm manager rules and compare with the current alarm state transition. If it matches any alarm grouping, silencing or inhibition rule, this alarm state transition will be published back to Kafka with a new topic name 'filtered alarm transitions' and a unique key associate with it. Every once in a while, a new component called alarm manager engine wakes up and consumes all the filtered alarm transitions as well as alarm manager rules. It starts from the inhibition rule and compare with the filtered alarm transitions. Please see more details about inhibition in the alarm inhibition functionality blueprint. Then the alarm transitions will be compared with silencing rules. Please see the alarm silencing functionality blueprint for more details. The last step in alarm manager engine is to apply the grouping manager rule to the filtered alarm transitions and based on the rule to send one compact notification out.

Grouping example:

GroupingRule1 = '{"alarm-grouping-definition-created": {"name": "group_rule_1", "matchers": ["hostname"], "id": "b7163","repeat_interval": "2h", "group_wait": "30s", "exclusions": {"alarm_name": "cpu_percent_high"}, "tenantId": " d42bc", "alarm_actions": ["cd892"], "ok_actions": ["ad892"], "undetermined_actions": ["cf892"]}}'

Three alarm transitions: AT1, AT2 and AT3

AT1_hostname = host1
AT2_hostname = host1
AT3_hostname = host2
AT1_alarm_name = cpu_percent_high
AT2_alarm_name = cpu_system_perc_high
AT3_alarm_name = cpu_percent_high
AT1_state = ALARM
AT2_state = ALARM
AT3_state = ALARM

Output:
AT1 and AT3 match exclusions and send notifications immediately.
Generate a grouped notification “group_notification_rule_1_host1_alarm[1]” and send out using alarm_actions ["cd892"].

Note: There are no alarm_actions, ok_actions or undermined_actions associated with the AT1, AT2, AT3 alarm definitions.

Please see more examples in Monasca wiki page: https://wiki.openstack.org/wiki/Monasca#Alarm_Managers

Blueprint information

Status:
Not started
Approver:
None
Priority:
High
Drafter:
Kaiyan
Direction:
Needs approval
Assignee:
Kaiyan
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/alarm-inhibition,n,z

Addressed by: https://review.openstack.org/434537
    [WIP]Modify Notification Engine to allow inhibit, silence, and group

Addressed by: https://review.openstack.org/438032
    Documentation for alarm state transition flow

Addressed by: https://review.openstack.org/447060
    Add alarm rule table in mysql for querying

Gerrit topic: https://review.openstack.org/#q,topic:group_silence_inhibit_rule,n,z

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.