Tolerance to infrastructure failures

Registered by Renat Akhmerov on 2017-03-09

Now:

Mistral doesn't handle infrastracture failures like DB outage, network outage, messaging outage. This leads to situations when workflows, tasks and/or actions get stuck in RUNNING state. This happens because, for example, a message from engine to executor may get lost if a message queue goes down in the middle of transferring it, and then an action will be in RUNNING state forever.

The goal:

Mistral should continue to work normally once a fauliure is fixed Mistral needs to recover running workflows automatically so that they don't get stuck and/or provide tools for operators to recover workflows manually.

Solution ideas:

* Identify all possible failures that can cause workflows/task/actions to get stuck in RUNNING state
* Implement automatic handling for some of those situations that allow it
* Maintenance mode. We can implement a mode in which Mistral does not start new workflows/tasks/actions. In this mode we can easily find objects eligible for recovery (RUNNING state) and fix them manually
* Send notifications to operator(s) about failures and workflows that are in RUNNING state for a long time
* Gather statistics on duration of certain workflows and detect suspicious objects (those that are taking longer than usually)

Testing:

* Create a gate where we could imitate failure scenarios

Questions:

* How do we imitate infrastructure failures?
* How to deal with actions stuck in 'running' state after failure of whole control plane?

Links:
 * https://blueprints.launchpad.net/mistral/+spec/mistral-maintenance-mode

Notes/Decisions:

* Highest priority: a convenient way to fix workflows/tasks/actions stuck in RUNNING state manually
* See if it's possible to detect executions that are stuck in RUNNING state (not clear at the moment how exactly)

Blueprint information

Status:
Not started
Approver:
Renat Akhmerov
Priority:
High
Drafter:
Renat Akhmerov
Direction:
Approved
Assignee:
None
Definition:
Approved
Series goal:
None
Implementation:
Not started
Milestone target:
None

Related branches

Sprints

Whiteboard

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.