RPC and fault tolerant configurations for RabbitMQ

Registered by Armando Migliaccio

Austin release of Nova RPC mappings deals with intermittent network connectivity only. In order to support RabbitMQ clusters and active/passive brokers, more advanced Nova RPC mappings need to be provided, such as strategies to deal with failures of nodes holding queues within clusters and/or master/slave failover for active/passive replication.

Currently, the message queue configuration variables are tied to RabbitMQ from nova/flags.py. In particular, only one rabbitmq host is provided and it is assumed, for simplicity of the deployment, that a single instance is up and running. In face of failures of the RabbitMQ host (e.g. disk or power related), Nova components cannot send/receive messages from the queueing system until it recovers. To provide higher resiliency, RabbitMQ can be made to work in an active/passive setup, such that persistent messages that have been written to disk on the active node are able to be recovered by the passive node should the active node fail. If high-availability is required, active/passive HA can be achieved by using shared disk storage, heartbeat/pacemaker, and possibly a TCP load-balancer in front of the service replicas. Although this solution ensures higher level of transparency to the client-side such as Nova API, Scheduler, and Compute (e.g. no or minimal fail-over strategies are required in the Nova RPC mappings) it still represents a bottleneck of the overall architecture, it may require expensive hardware to run, and hence it is far from ideal.

Blueprint information

Rick Clark
Armando Migliaccio
Needs approval
Series goal:
Milestone target:
Completed by
Armando Migliaccio

Related branches



that's a tentative agenda...

* Jay's plans for refactoring of Nova RPC
  * current work
  * Celery as Message Queue Manager
* High-performance configurations of RabbitMQ
  * HA - pros and cons
  * Clustering - pros and cons
* Design approaches
* what next

I'll be in this session :) -JayPipes --> This session will include the one from Jay (https://blueprints.launchpad.net/nova/+spec/bexar-message-queue-celery)

Here is the link for Etherpad: http://etherpad.openstack.org/rabbitmq-ha

After the discussion at the Summit it seems reasonable to kill this blueprint and rely on RabbitMQ HA for the time being. When/If the RabbitMQ dev team solves the issue of replicated queues in a cluster, RabbitMQ clustering could be reconsidered for HA in Nova deploments.


Work Items

Dependency tree

* Blueprints in grey have been implemented.