Automatic Joiner failure recovery/retry

Registered by David Bennett

We need a facility to enable automatic retries in a joiner node after SST failure. There are multiple scenarios in which a node can fail to join a cluster, these include:

* Network connection failure
* underlying process failure (due to service or machine restrarts)
* Intermittent failure in the SST provider (xtrabackup, rsync, etc..)

When a joiner fails to apply the SST transfer. It should automatically retry to join the cluster. This may involve:

* Automatic clean or recovery of the joiner's data store.
* A 'join attempt' counter
* A user configurable join_attempts value to limit the retry attempts
* donor selection logic to avoid donor node specific SST creation/transfer issues.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
David Bennett
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.