Magnum

Cluster healing

Registered by Ricardo Rocha on 2017-08-29

Cluster nodes can become unresponsive or unusable due to lost connectivity, daemons crashing, hardware issues and many other reasons. When this happens the COE will mark them as unusable or 'not ready' and reschedule workloads elsewhere, leaving the cluster at reduced capacity.

Magnum could handle the cluster recovery (healing) triggering node replacement or recovery whenever an issue is found. This could be done in two ways:
* triggered by the user, via a openstack coe cluster heal <cluster-id> command
* triggered by a periodic task, monitoring the state of the clusters

Blueprint information

Status:: Not started

Approver:: Spyros Trigazis

Priority:: Medium

Drafter:: Ricardo Rocha

Direction:: Approved

Assignee:: Ricardo Rocha

Definition:: Approved

Series goal:: Accepted for rocky

Implementation:: Unknown

Milestone target:: rocky-final

Related branches

Related bugs

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/cluster-healing,n,z

Addressed by: https://review.openstack.org/529897
Add Cluster Healing specification

Addressed by: https://review.openstack.org/570818
Add health_status and health_status_reason to cluster

Gerrit topic: https://review.openstack.org/#q,topic:story/2002742-24593,n,z

Addressed by: https://review.openstack.org/638319
Add health_status and health_status_reason to cluster

(?)

Work Items

This blueprint contains Public information

Everyone can see this information.

Subscribers

Tim Bell