Add "rebuild" operation for Bay

Registered by Jamie Hannaford

This blueprint has been superseded. See the newer blueprint "Add "restart" operation for Cluster" for updated plans.

Having a rebuild operation would be useful to reset the COE setup if and when something goes wrong. Carina currently offers this, but differs from Nova rebuilds in that it retains all of the volumes and containers on the hosts.

Proposed change

Introduce Bay Rebuild, a user-initiated lifecycle operation that reprovisions the COE service on all bay nodes. This operation would leave the volumes and user's application containers intact.

Implementation

1. Add a new endpoint, PUT /bays/{identifier}/actions/rebuild, that does not accept a request body and returns 204 [1] with no response body.
2. Continue leveraging Heat by using Heat SoftwareDeployment (SD). Introduce two SD resources into the bay templates, one which installs the COE service, and one which uninstalls it. Only the install SD resource will be configured to run during bay creation (via the "actions" resource property). Both SDs will have an unused input that will be linked to a template parameter. The SDs can be triggered by issuing a PATCH stack update with a new value for that parameter. This is the same mechanism that TripleO uses.
3. Add the software config agent to images that do not have it.
4. The implementation of the install/uninstall scripts will depend on the specific COE.

Concerns

The software config agent that lives on a node discovers that there is new software to be run by polling Heat, a Swift Temp URL, or a Zaqar queue.
- Having the agent poll too frequently can cause a lot of extra network traffic, especially if there are many bays.
- Conversely, an agent that polls on too long of an interval can lead to an unacceptable user experience due to how long it would take to complete the operation.

Is there an acceptable polling interval?
- Using Swift Temp URLs may alleviate this because Swift should be able to handle higher traffic.
- Another option is to have the agent listen on a port and only check for an update when it recieves a signal. This would allow the request to be responded to immediately and eliminate unnecessary polling altogether. However, this raises a lot of security concerns.

The current decision is to use a polling interval of 2 minutes, at least for the initial POC while this topic can be discussed.

[1] (drago) I wrote 200 originally. I think 204 makes more sense.

Blueprint information

Status:
Complete
Approver:
hongbin
Priority:
Undefined
Drafter:
Jamie Hannaford
Direction:
Approved
Assignee:
Drago
Definition:
Superseded
Series goal:
None
Implementation:
Started
Milestone target:
None
Started by
Drago
Completed by
hongbin

Related branches

Sprints

Whiteboard

thomasem: Wondering about how Heat would support something like this - a partial re-provision?

Gerrit topic: https://review.openstack.org/#q,topic:bp/bay-rebuild-operation,n,z

Addressed by: https://review.openstack.org/368981
    [WIP] Cluster restart lifecycle operation for k8s

(?)

Work Items

Work items:
Make sure all images have the software config agent installed: TODO
Convert the bay templates to use SoftwareDeployment: TODO
Move the COE service install script into its own SD and add a SD with the uninstall script: TODO
Add the endpoint to the Magnum API, and add conductor code that will trigger the stack update: TODO

This blueprint contains Public information 
Everyone can see this information.