Make compute nodes 'kill' friendly
Compute nodes have long-running processes, such as resize. As such, we cannot gracefully stop a node without taking its control plane offline for a long period of time. If we make compute nodes safe to 'kill', then operators can simply bounce the service.
One way to accomplish this is to tag long-running options as being 'resumable'. Such operations would have a journal that breaks the process down into steps. Each step would have a forward and cleanup operation.
As a compute node process starts up, it would:
1) Check the journal to determine operations in progress
2) Performing any necessary step cleanup operations.
3) Resume processing of operations in progress.
Blueprint information
- Status:
- Complete
- Approver:
- None
- Priority:
- Undefined
- Drafter:
- None
- Direction:
- Needs approval
- Assignee:
- None
- Definition:
- Superseded
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
- Russell Bryant
Related branches
Related bugs
Sprints
Whiteboard
Per summit discussion, going to proceed with the graceful-shutdown blueprint for the Havana cycle.
https:/
In addition to the graceful-shutdown work, there is work going on to centralize the processing of long running tasks (such as the various forms of migrations) in nova-conductor. After all of that, solving this problem will be different than as described here. It will be ensuring that the conductor reliably tracks state, and can gracefully what happens if it has to stop in the middle, or if something fails in the middle. --russellb