High availability (HA) is all about being able to continue working without
intervention when any node fails.
Unfortunately computers are not entirely reliable. Sometimes things break. We
need Juju to be able to continue working in situations where important nodes
Juju has a key single point of failure right now, and that is the bootstrap
node. This is the single machine that the client connects to, runs the single
mongodb server, and the provisioner.
HA is given as an arguement against Juju adoption currently as people feel
this is an essential component before using Juju in a production environment.
* In a running Juju environment any instance can fail and the environment can continue to process requests.
* A user when creating an environment wants to start with multiple manager
nodes to have redundancy from the start.
* An HA Juju environment has the initial bootstrap node stop responding. The
user is still able to execute Juju commands and observe changes. The agents
on the instances continue to respond to system events and report status.
* An HA Juju environment on realising that a primary control node has
disappeared will immediately start another instance and create another control
* John is running a Juju environment. He runs 'juju ha' then uses the control panel of his cloud provider stops an instance running a manager node. John is able to continue to use
* HA is more than just mongodb, but also how the clients connect, how the
agents listen, and how new instances are started.
This feature has significant overlap with multi-tenancy; careful scope management
will be necessary.
* Synchronisation between multiple provisioners so they don't race starting
* Moving from an HA mode to a default mode by removing machines/services.
[OUT OF SCOPE]
Automatic healing of environments with failed nodes.
We will know we are done when:
* A new environment can be started in HA mode with multiple nodes capable of
servicing client requests, starting instances, and running mongodb in
* An existing running juju environment can move from being a default
environment to an HA environment
* A Juju environment when in HA mode is able to continue, business as usual,
when a primary node fails.
Bootstrap monogo in replica set size 1: DONE
Publisher job that monitors key addresses in the environment and writes a file into the private bucket containing those addresses size 2: TODO
Change agentconf to contain the URL for the api addresses size 2: TODO
Agentconf uses URL to locate API/mongodb servers size 2: TODO
Agents know how to establish a connection from a set of addresses size 2: TODO
Create, set, update txns to allow for machine job for managing state (see also local environ blueprint) size 8: DONE
Write state manager task, run it when manage-state job is set (also txn resumer? leave space, anyway, because it's part of the above job even if we skip it today) size 4: TODO
Provisioner will stop nodes running `job manage environ` tasks if they fail presence checks (wait, what? why? what's special about manage-environ vs manage-state? anyway, do we trust presence *that* much? I've seen agents "down" but running...) size 4: TODO
Agents will compete to run firewaller/
Add `juju enable-ha` command size 4: TODO
* Blueprints in grey have been implemented.