MAAS

Highly Available MAAS

Registered by Julian Edwards on 2013-10-09

Discussion around what is required to make MAAS highly available (resistant to failures).

Blueprint information

Status:: Complete

Approver:: Daniel Westervelt

Priority:: Essential

Drafter:: None

Direction:: Approved

Assignee:: None

Definition:: Obsolete

Series goal:: None

Implementation:: Unknown

Milestone target:: None

Completed by: Adam Collard on 2019-10-09

Related branches

Related bugs

Sprints

cloud-oct-2013

Whiteboard

MAAS HA
=======

Components that *should* be HA
- DNS
- Web app
- RabbitMQ
- Region's Squid

SPOFs:
* Region Celery
* We would need to run perhaps one celery per appserver instance and use Celery's "Broadcast" queue type so that a task is sent to all region works consuming from the broadcast queue.
* postgres? can be done but hard; defer responsibility to charms (out of scope for MAAS project)
* Rabbit does not guarantee messages are always delivered in HA mode (server that dies takes messages with it)

Other problems:
* If a cluster dies, the region controller does not know and would try to allocate machines in it
* What about pending celery jobs when a cluster dies?
* We don't look for & handle silent failures, e.g. nodes not netbooting.

To do:
* Find out if we can bin the CD installers & its related Avahi service.
* Investigate Celery's HA story

Notes on postgres HA:
* Switching masters is a manual step. Has to be.
* Multi-master is coming, according to Herb McNew.

(?)

Work Items

This blueprint contains Public information

Everyone can see this information.

Subscribers

Adam Stokes

Gavin Panella

Jeroen T. Vermeulen

Julian Edwards

Raphaël Badin