Snapshot state consistency between glance and nova

Registered by Joshua Harlow on 2013-11-25

In order to get forward progress on the discussion @ the summit about https://etherpad.openstack.org/p/icehouse-summit-image-state-consistency we propose to begin movement of the snapshot code that exists in the nova-compute manager to a location (conductor) where it can be executed on-behalf of the nova-compute that has the VM that is to be snapshot (and uploaded to glance) so that the snapshot state can be recovered from reliably (or resumed) so that the VM that is snapshotted can end up in a agreed up-on (and well-defined) state (not ERROR or IMAGE-UPLOADING). This will help avoid the state inconsistency that happens when the upload is partially completed due to a service outage (or other network partition), allowing for the interaction between glance and nova to be a reliable one.

This will likely involve the following steps:

0. Document and understand the current workflow and its deficiencies.

1. Moving the conduction of the snapshot workflow to the conductor (reducing whats in nova compute to a smaller set). Handle the new and current error states of that workflow in the conductor that result from this modified workflow.

2. After getting the basics of conducting working in the conductor, support detection of stalled or erred out snapshot uploads into glance by having the nova<->glance interaction go through a more well defined workflow state-machine. This will likely involve going through a set of states involving [LOCAL_SNAPSHOT_STARTED, LOCAL_SNAPSHOT_COMPLETE, UPLOAD_BEGIN, UPLOAD_%s_COMPLETE, UPLOAD_COMPLETE, IMAGE_ACTIVE] for the snapshot happy path.

2a. For the error path there will need to be a mechanism to signal to the user of the snapshot process that can be queried via glance or nova to know at which stage nova is in the snapshotting process. If the conductor processing the workflow has stalled it would be nice to be able to have glance know this via some type of 'last state change' timestamp (this can be useful to let the user know when the last state change occurred). If the conductor has not stalled the then 'UPLOAD_%s_COMPLETE' (this may be a new state or a status of an existing state, or something else entirely) which should have the percentage of the upload completion has occurred will be useful to expose to clients that the upload is not complete.
2aa. In general this whole 'liveness' detection would be better handled by some type of 'shared' liveness storage system (for example a shared agreed upon path in zookeeper that can be used to know if nova has died during upload from glance), but a percent complete (and associated last state change timestamp?) as well as the request connection/socket itself is a good start.

3. Support the above snapshot workflow running in the conductor via taskflow (which brings in resumption, recovery, state tracking and various other benefits) by bringing in taskflow to aid in this process
  - https://blueprints.launchpad.net/nova/+spec/glance-snapshot-tasks-taskflow

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
Joshua Harlow
Direction:
Needs approval
Assignee:
Alexander Gorodnev
Definition:
Drafting
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Deferred to icehouse-3 as the blueprint was not approved by the icehouse-2 blueprint approval deadline. --russellb

Gerrit topic: https://review.openstack.org/#q,topic:bp/glance-snapshot-tasks,n,z

Addressed by: https://review.openstack.org/60492
    Add snapshot functionality to Conductor API

Addressed by: https://review.openstack.org/63735
    Split snapshot method in libvirt driver

Removed from next, as next is now reserved for near misses from the last milestone --johnthetubaguyDeferred to icehouse-3 as the blueprint was not approved by the icehouse-2 blueprint approval deadline. --russellb

Gerrit topic: https://review.openstack.org/#q,topic:bp/glance-snapshot-tasks,n,z

Addressed by: https://review.openstack.org/60492
    Add snapshot functionality to Conductor API

Addressed by: https://review.openstack.org/63735
    Split snapshot method in libvirt driver

Removed from next, as next is now reserved for near misses from the last milestone --johnthetubaguy

Marking this blueprint as definition: Drafting. If you are still working on this, please re-submit via nova-specs. If not, please mark as obsolete, and add a quick comment to describe why. --johnthetubaguy (20th April 2014)

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.