Automatically recover instances stuck in the DELETING state

Registered by Radomir Dopieralski

Periodically retry deleting instances stuck in DELETING state for too long, not just on compute service restart.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
Radomir Dopieralski
Direction:
Needs approval
Assignee:
Radomir Dopieralski
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

When a transient communication error happens while an instance is being deleted, it will often get stuck in DELETING state. At this point it's not possible to retry the delete command or otherwise recover it as an user. Resetting the instance state and/or a restart of the compute service is required.

The restart of compute service helps, because there is code in there that explicitly looks for all instances in DELETING state and retries their deleting.

We also have similar code in the resource tracker that handles instances in DELETED state and makes sure their corresponding VMs are not running on the host. This code is ran periodically and checks for existing VMs in DELETED state that has been deleted long enough ago.

I want to propose adding similar periodic check for the DELETING status, together with adding information on when the status was changed to DELETING to the system metadata. This way we can tell that a given instance is stuck for a long time, and retry its deletion, without having to worry about messages stuck in the rabbit queue or pending tasks.

(?)

Work Items

Work items:
add code that saves the time when an instance is put into DELETING state into system metadata: TODO
refactor the code that does the cleanup at startup into a separate function: TODO
call that function from the resource tracker periodic task: TODO

This blueprint contains Public information 
Everyone can see this information.