Cluster healing

Registered by Ricardo Rocha

Cluster nodes can become unresponsive or unusable due to lost connectivity, daemons crashing, hardware issues and many other reasons. When this happens the COE will mark them as unusable or 'not ready' and reschedule workloads elsewhere, leaving the cluster at reduced capacity.

Magnum could handle the cluster recovery (healing) triggering node replacement or recovery whenever an issue is found. This could be done in two ways:
* triggered by the user, via a openstack coe cluster heal <cluster-id> command
* triggered by a periodic task, monitoring the state of the clusters

Blueprint information

Status:
Not started
Approver:
Spyros Trigazis
Priority:
Medium
Drafter:
Ricardo Rocha
Direction:
Approved
Assignee:
Ricardo Rocha
Definition:
Approved
Series goal:
Accepted for rocky
Implementation:
Unknown
Milestone target:
milestone icon rocky-final

Related branches

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/cluster-healing,n,z

Addressed by: https://review.openstack.org/529897
    Add Cluster Healing specification

Addressed by: https://review.openstack.org/570818
    Add health_status and health_status_reason to cluster

Gerrit topic: https://review.openstack.org/#q,topic:story/2002742-24593,n,z

Addressed by: https://review.openstack.org/638319
    Add health_status and health_status_reason to cluster

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.