High Availability

Registered by Antonio Rosales

[GOAL]
High availability (HA) is all about being able to continue working without
intervention when any node fails.

[RATIONALE]
Unfortunately computers are not entirely reliable. Sometimes things break. We
need Juju to be able to continue working in situations where important nodes
stop responding.

Juju has a key single point of failure right now, and that is the bootstrap
node. This is the single machine that the client connects to, runs the single
mongodb server, and the provisioner.

HA is given as an arguement against Juju adoption currently as people feel
this is an essential component before using Juju in a production environment.

Blueprint information

Status:
Complete
Approver:
Mark Ramm
Priority:
Undefined
Drafter:
Tim Penhey
Direction:
Approved
Assignee:
None
Definition:
Superseded
Series goal:
None
Implementation:
Not started
Milestone target:
None
Completed by
Katherine Cox-Buday

Related branches

Whiteboard

[USER STORIES]
* In a running Juju environment any instance can fail and the environment can continue to process requests.

* A user when creating an environment wants to start with multiple manager
nodes to have redundancy from the start.

* An HA Juju environment has the initial bootstrap node stop responding. The
user is still able to execute Juju commands and observe changes. The agents
on the instances continue to respond to system events and report status.

* An HA Juju environment on realising that a primary control node has
disappeared will immediately start another instance and create another control
node.

* John is running a Juju environment. He runs 'juju ha' then uses the control panel of his cloud provider stops an instance running a manager node. John is able to continue to use

[ASSUMPTIONS]

* HA is more than just mongodb, but also how the clients connect, how the
  agents listen, and how new instances are started.

[RISKS]

This feature has significant overlap with multi-tenancy; careful scope management
will be necessary.

[IN SCOPE]

* Synchronisation between multiple provisioners so they don't race starting
  instances.

 * Moving from an HA mode to a default mode by removing machines/services.

[OUT OF SCOPE]

Automatic healing of environments with failed nodes.

[USER ACCEPTANCE]

We will know we are done when:

 * A new environment can be started in HA mode with multiple nodes capable of
   servicing client requests, starting instances, and running mongodb in
   replicaset mode

 * An existing running juju environment can move from being a default
   environment to an HA environment

 * A Juju environment when in HA mode is able to continue, business as usual,
   when a primary node fails.

[RELEASE NOTE/BLOG]

(?)

Work Items

Work items:
Bootstrap monogo in replica set size 1: DONE
Publisher job that monitors key addresses in the environment and writes a file into the private bucket containing those addresses size 2: TODO
Change agentconf to contain the URL for the api addresses size 2: TODO
Agentconf uses URL to locate API/mongodb servers size 2: TODO
Agents know how to establish a connection from a set of addresses size 2: TODO
Create, set, update txns to allow for machine job for managing state (see also local environ blueprint) size 8: DONE
Write state manager task, run it when manage-state job is set (also txn resumer? leave space, anyway, because it's part of the above job even if we skip it today) size 4: TODO
Provisioner will stop nodes running `job manage environ` tasks if they fail presence checks (wait, what? why? what's special about manage-environ vs manage-state? anyway, do we trust presence *that* much? I've seen agents "down" but running...) size 4: TODO
Agents will compete to run firewaller/provisioner/[cleaner] via racing on a lease size 4: TODO
Add `juju enable-ha` command size 4: TODO

Dependency tree

* Blueprints in grey have been implemented.