OpenStack DevOps

assure instance states are correct

Registered by Trey Morris on 2011-01-10

I'm wary of the way we take care of instance states currently. Instance states are only set or updated as the result of actions through the api. Suppose the state of an instance changes as a result of action taken outside the api (host goes down, host needs to be updated/rebooted, a hack, user shuts down the instance manually, etc). This information would never be updated and the database would have bad data. Additionally, from a service provider point of view, we should know when a customer's instance goes down before they send us an email regarding its state. A mechanism needs to be in place that ensures data stays up to date.

My proposition:
Compute nodes keep track of the state of their constituent instances. Every so often the compute node will request status from the hypervisor and compare the resulting states to what it has stored. If there is a conflict, it will pass this information to a state arbiter of some sort, possibly the api, which would then determine what needs to be done. If a supposed running instance isn't actually running, the said "arbiter" could either update the database to show this or attempt to start the instance to return it to the state it's supposed to be in (I'm not sure which makes more sense at this point).

For this to work hypervisors should not take any action unless action is specified by the compute node. For example, if a hypervisor host gets rebooted, when it boots it should not attempt to start any instances. This would be an action which was not specified by the compute node. In this case when the host reboots, the compute node would poll for status, find conflict and pass this information up to the arbiter. The arbiter would then handle the situation, either attempting to get instances back to their prior states or updating the database to show them as not running (again, whichever makes more sense). Maybe it makes sense to perform certain actions for certain states and different actions for others: a paused instance supposed to be running could be unpaused, but it may not make sense for stopped instance which is supposed to be paused to be started and then paused (or maybe it would).