Recover from "stuck" states on compute manager start-up

Registered by David McNally on 2013-11-14

If a compute manager is stopped / fails during certain operations then the instance will be left stuck with a transitional task_state. Ideally during compute manager start-up we would identify instances in these states and transition them to a logical stable state.

Blueprint information

Status:
Started
Approver:
Dan Smith
Priority:
Medium
Drafter:
None
Direction:
Needs approval
Assignee:
David McNally
Definition:
Review
Series goal:
None
Implementation:
Needs Code Review
Milestone target:
milestone icon next
Started by
Dan Smith on 2013-11-14

Whiteboard

Moved to -next as this has two -2s and no responses --dansmith

Sponsors: John Garbutt and Dan Smith

(taken from https://etherpad.openstack.org/p/NovaCleaningUpStuckInstances):

Cleaning up "Stuck" instance state

What do you mean by "Stuck" ?
"Stuck" state in this context occurs when an action fails to complete in the computer manager.
Typically seen on failure / restart

Why do you care ?
In some as state gates actions it stops you from being able to move forwards
Relying on the user to clean up is a real pin when you want to migrate an instance
It's confusing for the users (which means we have to spend time diagnosing and helping to fix it)

Isn't this all going to be fixed by the task manager / clean-shutdown ?
Probably - but there some even quicker wins that also help towards that, and some issues that
are also going to be relevent to task manager.

Basic Premis: The one time you know there is no running thread in the compute manager is during start-up.
At that point there are some task states that can be safely cleared / re-processed. The tricky thing is to
disambiguate between an action which has started and failed to complete, and an action which is actually still
on the message queue (given that the compute manager may have been down for some time)

A bit of history:
We tried to address all of these and disambiguate the "still queued" case by recoding the task_state seen on the compute manager at the
start of the action, but that was (rightly) blocked on because it involved more DB access and is going to be fixed by task manager.
Are now re-working some easier cases that don't need the disambiguation.
https://review.openstack.org/#/c/47836/

Easy cases:
Deleting: It's always safe to go ahead and rerun the delete.

Buliding: Can always be put into an error state. If the message was still on the queue instance.host won't have been set

Image_pending_upload / Image_uploading: Can be cleared - these are only set in the compute manager.

Powering Off: re-run the power off. If the VM is already off, or the request is in the queue this is a no-op.

Powering On: re-run the power on: If the VM is already off, or the request is in the queue this is a no-op.

All accepted as worth doing - submit as separte patches

Harder cases:
Image_snaphot: (Set in API) - could be cleared on start-up and re-asserted on the compute manager at the start
of snapshot to cover the case of a still queued request

Rebooting:
    If the VM isn't running - reboot it (risk is a second reboot)
    If the VM is running - just clear the status (risk is a user needs to make another reboot)

Accepted to add additional task_state value to be set on compute manager to disambiguate the queued vs started case

Even harder:
Rebuilding: Would be nice to be able to treat this like Building and go to an error state, but we can't use instance.host to
disambiguate. We could do something here if we add an extra task state (Rebuild_started) that is set immediatly on the
compute manager. Could use the same approach to remove the risk of missed / additional reboots.

As above

Gerrit topic: https://review.openstack.org/#q,topic:bp/recover-stuck-state,n,z

Addressed by: https://review.openstack.org/56223
    Recover from IMAGE-* state on compute manager start-up

Gerrit topic: https://review.openstack.org/#q,topic:bug/1197024,n,z

Addressed by: https://review.openstack.org/47836
    Recover from build state on compute manager start-up

Addressed by: https://review.openstack.org/56272
    Make compute manager _init_instance use native objects

Addressed by: https://review.openstack.org/57967
    Recover from REBOOT-* state on compute manager start-up

[parthipan] how about task_state 'migrating'?

Still depends on https://review.openstack.org/#/c/57967/

Gerrit topic: https://review.openstack.org/#q,topic:bug/1247174,n,z

Addressed by: https://review.openstack.org/55660
    Cleanup 'deleting' instances on restart

Addressed by: https://review.openstack.org/62038
    Recover from POWERING-* state on compute manager start-up

Addressed by: https://review.openstack.org/63170
    Clean IMAGE_SNAPSHOT_PENDING state on compute manager start up

===============
Remaining patches
===============

Addressed by: https://review.openstack.org/62038
    Recover from POWERING-* state on compute manager start-up

Addressed by: https://review.openstack.org/57967
    Recover from REBOOT-* state on compute manager start-up

I would love this to merge, since its so close, promiting to high --johnthetubaguy

Unapproved - please re-submit via nova-spec --johnthetubagy (20th March 2014)

Addressed by: https://review.openstack.org/176234
    Recover from POWERING-* state on compute manager start-up

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.