Enable health management for senlin

Registered by Cindia-blue

As a clustering service, Senlin can help to create/update/delete clusters of multiple nodes from customizable profiles. Then who will be responsible for health status of these clusters and nodes afterwards? There is no easy answer or unified solution to this problem for complexity of distributed environment and diversity of user requirements. The top consideration for senlin is to track health status of its creates clusters and expose interface for admins to define policies of health management, e.g. how to recover. Should senlin depend on other projects which working on monitoring and alarming to achieve the goal? Definitely, problems caused by large scale monitoring and data aggregation are out of scope of senlin. Cooperation with the projects which focus to resolve these problems is desirable to achieve flexible and high demand health management scenarios. At the same time, senlin should have the independent capability to know basic health status of created clusters where health management policy is attached, e.g. node aliveness, and execute recovery actions on the failed nodes. These capabilities are highly desirable for cluster life-cycle maintenance purpose and targets of this blueprint.

Blueprint information

Status:
Complete
Approver:
Qiming Teng
Priority:
High
Drafter:
Cindia-blue
Direction:
Approved
Assignee:
Cindia-blue
Definition:
Approved
Series goal:
Accepted for mitaka
Implementation:
Implemented
Milestone target:
milestone icon mitaka-3
Started by
Cindia-blue
Completed by
Cindia-blue

Related branches

Sprints

Whiteboard

This blueprint is targeted to scope and resolve health management for clusters and nodes created by Senlin. Health management service of Senlin takes care of status consistency of clusters/nodes and recovers of the nodes in “ERROR” status by given operations from users. Instead of monitoring to the whole infrastructure or cloud application in huge domain, the health management service will trigger periodical status check to the clusters. Users could use health_policy to define recover operations when error happens and bond the policy to the targeted clusters. Another advantage of policy bonding is to differentiate the clusters where higher health consistency is required from others. For these clusters, Health management service will enable embed listener for quick process once status changed.

Use Cases
=========

Two typical use cases are listed as follows but Health management service should not be limited to
the two use cases:
A) Auto-scaling cluster need the consistency of node health when scale out or scale in for accurate
calculation of node count to change.
B) When users list nodes or cluster status, consistent status can be provided with underlying nova.
This will allow applications or users to run based on the status kept by senlin.

Design
======

There are three parts of functions should be implemented for Health management design:
A) Detection of status inconsistency: both polling based and listener base functions should be
provided.
B) Recovery of cluster: recovery actions should be provided for both clusters and nodes. To make the
design extensible for different profile types, detailed recover operations should be implement and
override in profile.
C) Customization: Instead of direct change to engine of Senlin, users can define and override the
health_policy to include the recover operation and attach the policy to the clusters who think need
more health care than others.

Gerrit topic: https://review.openstack.org/#q,topic:bp/support-health-management,n,z

Addressed by: https://review.openstack.org/261990
    Implement do_check method for nova profile

Addressed by: https://review.openstack.org/262933
    Implement node_recover in Profile

Addressed by: https://review.openstack.org/264568
    Add Recover into Node Actions and Node Model

Addressed by: https://review.openstack.org/264727
    Add Description about Recover of Profile

Addressed by: https://review.openstack.org/267419
    Add Recover as a Cluster Action

Addressed by: https://review.openstack.org/267922
    Add Recover into RPC API

Addressed by: https://review.openstack.org/267994
    Expose Check Function in Profile

Addressed by: https://review.openstack.org/270020
    Add Doc for Check and Recover Actions

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.