Automatic Joiner failure recovery/retry
Registered by
David Bennett
We need a facility to enable automatic retries in a joiner node after SST failure. There are multiple scenarios in which a node can fail to join a cluster, these include:
* Network connection failure
* underlying process failure (due to service or machine restrarts)
* Intermittent failure in the SST provider (xtrabackup, rsync, etc..)
When a joiner fails to apply the SST transfer. It should automatically retry to join the cluster. This may involve:
* Automatic clean or recovery of the joiner's data store.
* A 'join attempt' counter
* A user configurable join_attempts value to limit the retry attempts
* donor selection logic to avoid donor node specific SST creation/transfer issues.
Blueprint information
- Status:
- Not started
- Approver:
- None
- Priority:
- Undefined
- Drafter:
- David Bennett
- Direction:
- Needs approval
- Assignee:
- None
- Definition:
- New
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
(?)