Tolerance to infrastructure failures
Now:
Mistral doesn't handle infrastracture failures like DB outage, network outage, messaging outage. This leads to situations when workflows, tasks and/or actions get stuck in RUNNING state. This happens because, for example, a message from engine to executor may get lost if a message queue goes down in the middle of transferring it, and then an action will be in RUNNING state forever.
The goal:
Mistral should continue to work normally once a fauliure is fixed Mistral needs to recover running workflows automatically so that they don't get stuck and/or provide tools for operators to recover workflows manually.
Solution ideas:
* Identify all possible failures that can cause workflows/
* Implement automatic handling for some of those situations that allow it
* Maintenance mode. We can implement a mode in which Mistral does not start new workflows/
* Send notifications to operator(s) about failures and workflows that are in RUNNING state for a long time
* Gather statistics on duration of certain workflows and detect suspicious objects (those that are taking longer than usually)
Testing:
* Create a gate where we could imitate failure scenarios
Questions:
* How do we imitate infrastructure failures?
* How to deal with actions stuck in 'running' state after failure of whole control plane?
Links:
* https:/
Notes/Decisions:
* Highest priority: a convenient way to fix workflows/
* See if it's possible to detect executions that are stuck in RUNNING state (not clear at the moment how exactly)
Blueprint information
- Status:
- Not started
- Approver:
- Renat Akhmerov
- Priority:
- High
- Drafter:
- Renat Akhmerov
- Direction:
- Approved
- Assignee:
- None
- Definition:
- Approved
- Series goal:
- None
- Implementation:
- Not started
- Milestone target:
- None
- Started by
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
Work Items
Dependency tree
* Blueprints in grey have been implemented.