OpenStack Compute (nova)

RPC improvements

Registered by Jesse Andrews on 2011-04-18

The messaging system is currently not used in a manner that handles errors/exceptions. If a worker is down, messages will queue for it until the worker is restarted, leading to both user experience issues and delays in service restoration as the worker has to process all queued messages before processing active messages. For example when launching a VM, a message is sent from the scheduler to a compute node. If the compute worker is down, the VM will be stuck in a scheduled but not launched state until the node has returned to service. The user in the mean time has probably launched another VM, and the requested VM is launched hours later.

Nova's use of the message queue needs to be defined. Changes to both how we use rabbitmq and logic to determine if a task has been completed needs to ripple through all nova code.