OpenStack Compute (nova)

Make compute nodes 'kill' friendly

Registered by Brian Elliott on 2013-03-12

Compute nodes have long-running processes, such as resize. As such, we cannot gracefully stop a node without taking its control plane offline for a long period of time. If we make compute nodes safe to 'kill', then operators can simply bounce the service.

One way to accomplish this is to tag long-running options as being 'resumable'. Such operations would have a journal that breaks the process down into steps. Each step would have a forward and cleanup operation.

As a compute node process starts up, it would:
1) Check the journal to determine operations in progress
2) Performing any necessary step cleanup operations.
3) Resume processing of operations in progress.

Blueprint information

Status:: Complete

Approver:: None

Priority:: Undefined

Drafter:: None

Direction:: Needs approval

Assignee:: None

Definition:: Superseded

Series goal:: None

Implementation:: Unknown

Milestone target:: None

Completed by: Russell Bryant on 2013-05-02

Related branches

Related bugs

Sprints

Whiteboard

Per summit discussion, going to proceed with the graceful-shutdown blueprint for the Havana cycle.

https://blueprints.launchpad.net/nova/+spec/graceful-shutdown

In addition to the graceful-shutdown work, there is work going on to centralize the processing of long running tasks (such as the various forms of migrations) in nova-conductor. After all of that, solving this problem will be different than as described here. It will be ensuring that the conductor reliably tracks state, and can gracefully what happens if it has to stop in the middle, or if something fails in the middle. --russellb

(?)

Work Items

This blueprint contains Public information

Everyone can see this information.

Subscribers

QiangGuan

Qiu Yu

Tiantian Gao