Templated alarm descriptions for human readable alerts

Registered by jobrs on 2017-01-30

As an SRE I want alarm notifications to have human-readable descriptions and a playbook (link), so that I can resolve the alarm root-cause easily.

Proposed Design

The alarm description field already provides enough space to accomodate actionable instructions to the recipients of the alarm. If we could refer to attributes of the actual alarms, then the description could be used to describe the root-cause of the alarm and provide links to additional information like dashboards, playbooks, ...

The proposed solution consists of the following parts:

API
* add the alarm-description field to alarm objects in addition to alarm-definition-id and alarm-definition-name (backwards compatible API extension)
* support Jinja2 syntax in alarm descriptions in order to support dynamic contents
* Make alarm attributes available for use in the description templates (most notably expose the dimensions of the alarm)
* support simplified MarkDown syntax in descriptions to permit hyperlinks and basic formatting

Notification
* support configuring Jinja2 templates for notifications
* make notification and alarm attributes available for use in the notification templates (most notable alarm-name, rendered description, state-change date, alarm-state, severity)
* support simplified MarkDown syntax in notification templates to permit hyperlinks and basic formatting

Example:

Here is an example how Slack alerting would work:

1. You have an alarm—description in Markdown syntax with Jinja2 template variables.

The consumer offsets {{consumer_group}} for {{topic}} are ahead of the actual queue contents.\n\n[Dashboard](https://grafana.xyz.org/dashboard/db/monasca-overview)

2. You have a channel template for Slack

slack:
  timeout: 60
  ca_certs: "/etc/ssl/certs/ca-certificates.crt"
  mime_type: application/json
  template:
    text: |
      {
        "username": "Monasca (de)",
        "icon_url": "…",
        "mrkdwn": true,
        "attachments": [ {
          "fallback": "{{alarm_description}}", "color": "{{ {'ALARM': '#d60000', 'OK': '#36a64f', 'UNDETERMINED': '#fff000'}[state] }}",
          "title": "{{ {'ALARM': '*Alarm triggered*', 'OK': 'Alarm cleared', 'UNDETERMINED':'Missing alarm data'}[state] }} for {{alarm_name}} in de",
          "title_link": "https://dashboard.xyz.org/mydomain/myproject/monitoring/alarms?id={{alarm_id}}",
          "text": "{% if state == 'ALARM' %}:bomb:{{alarm_description}}\n{{message}}{% elif state == 'OK' %}:white_check_mark: Resolved: {{alarm_description}}{% else %}:grey_question:{{alarm_description}}{% endif %}",
          "mrkdwn_in": ["text", "title", "fallback"] } ]
      }

3. You receive alarms like this:

*Monasca (de)*
*Alarm triggered*
The consumer offsets *monasca-persister* for *metrics* are ahead of the actual queue contents.

[https://grafana.xyz.org/dashboard/db/monasca-overview]

Blueprint information

Status:
Not started
Approver:
Roland Hochmuth
Priority:
High
Drafter:
jobrs
Direction:
Needs approval
Assignee:
jobrs
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

A first implementation for this is available for Slack here: https://github.com/sapcc/monasca-api, https://github.com/sapcc/monasca-notification

Unfortunately it is not part of a branch, so upstreaming is an extra step where the actual diff will be extracted.

Gerrit topic: https://review.openstack.org/#q,topic:bp/templated-alarms,n,z

Addressed by: https://review.openstack.org/437532
    Support templated alarm descriptions and notification templates

Addressed by: https://review.openstack.org/437548
    Support templated alarm descriptions and notification templates

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.