Highly Available MAAS

Registered by Julian Edwards on 2013-10-09

Discussion around what is required to make MAAS highly available (resistant to failures).

Blueprint information

Status:
Not started
Approver:
Daniel Westervelt
Priority:
Essential
Drafter:
None
Direction:
Approved
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

MAAS HA
=======

Components that *should* be HA
- DNS
- Web app
- RabbitMQ
- Region's Squid

SPOFs:
 * Region Celery
  * We would need to run perhaps one celery per appserver instance and use Celery's "Broadcast" queue type so that a task is sent to all region works consuming from the broadcast queue.
 * postgres? can be done but hard; defer responsibility to charms (out of scope for MAAS project)
 * Rabbit does not guarantee messages are always delivered in HA mode (server that dies takes messages with it)

Other problems:
 * If a cluster dies, the region controller does not know and would try to allocate machines in it
 * What about pending celery jobs when a cluster dies?
 * We don't look for & handle silent failures, e.g. nodes not netbooting.

To do:
 * Find out if we can bin the CD installers & its related Avahi service.
 * Investigate Celery's HA story

Notes on postgres HA:
 * Switching masters is a manual step. Has to be.
 * Multi-master is coming, according to Herb McNew.

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.