Add "restart" operation for Cluster

Registered by Drago

Having a restart operation would be useful to reset the COE setup if and when something goes wrong. Carina currently offers this, but differs from Nova rebuilds in that it retains all of the volumes and containers on the hosts.

See also: https://etherpad.openstack.org/p/magnum-newton-midcycle-bay-lifecycle-1

Proposed change

Introduce Cluster Restart, a user-initiated lifecycle operation that reprovisions the COE service on all cluster nodes. This operation would leave the volumes and user's application containers intact.

Implementation

1. Add a new endpoint, PATCH /clusters/{identifier}/actions/restart [2], that does not accept a request body and returns 204 [1] with no response body.
2. Continue leveraging Heat by using Heat SoftwareDeployment (SD). Introduce two SD resources into the cluster templates, one which installs the COE service, and one which uninstalls it. Only the install SD resource will be configured to run during cluster creation (via the "actions" resource property). Both SDs will have an unused input that will be linked to a template parameter. The SDs can be triggered by issuing a PATCH stack update with a new value for that parameter. This is the same mechanism that TripleO uses.
3. Add the software config agent to images that do not have it.
4. The implementation of the install/uninstall scripts will depend on the specific COE.

Concerns

The software config agent that lives on a node discovers that there is new software to be run by polling Heat, a Swift Temp URL, or a Zaqar queue.
- Having the agent poll too frequently can cause a lot of extra network traffic, especially if there are many clusters.
- Conversely, an agent that polls on too long of an interval can lead to an unacceptable user experience due to how long it would take to complete the operation.

Is there an acceptable polling interval?
- Using Swift Temp URLs may alleviate this because Swift should be able to handle higher traffic.
- Another option is to have the agent listen on a port and only check for an update when it recieves a signal. This would allow the request to be responded to immediately and eliminate unnecessary polling altogether. However, this raises a lot of security concerns.

The current decision is to use a polling interval of 2 minutes, at least for the initial POC while this topic can be discussed.

[1] (drago) I wrote 200 originally. I think 204 makes more sense.
[2] (drago) It was decided during the midcycle to use PATCH instead of PUT for these operations

Blueprint information

Status:
Started
Approver:
hongbin
Priority:
Undefined
Drafter:
Drago
Direction:
Approved
Assignee:
Drago
Definition:
New
Series goal:
None
Implementation:
Good progress
Milestone target:
None
Started by
Drago

Related branches

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/cluster-restart-operation,n,z

Addressed by: https://review.openstack.org/368981
    [WIP] Cluster restart lifecycle operation for k8s

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.