Make compute nodes 'kill' friendly

Registered by Brian Elliott

Compute nodes have long-running processes, such as resize. As such, we cannot gracefully stop a node without taking its control plane offline for a long period of time. If we make compute nodes safe to 'kill', then operators can simply bounce the service.

One way to accomplish this is to tag long-running options as being 'resumable'. Such operations would have a journal that breaks the process down into steps. Each step would have a forward and cleanup operation.

As a compute node process starts up, it would:
1) Check the journal to determine operations in progress
2) Performing any necessary step cleanup operations.
3) Resume processing of operations in progress.

Blueprint information

Status:
Complete
Approver:
None
Priority:
Undefined
Drafter:
None
Direction:
Needs approval
Assignee:
None
Definition:
Superseded
Series goal:
None
Implementation:
Unknown
Milestone target:
None
Completed by
Russell Bryant

Related branches

Sprints

Whiteboard

Per summit discussion, going to proceed with the graceful-shutdown blueprint for the Havana cycle.

https://blueprints.launchpad.net/nova/+spec/graceful-shutdown

In addition to the graceful-shutdown work, there is work going on to centralize the processing of long running tasks (such as the various forms of migrations) in nova-conductor. After all of that, solving this problem will be different than as described here. It will be ensuring that the conductor reliably tracks state, and can gracefully what happens if it has to stop in the middle, or if something fails in the middle. --russellb

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.