wbe handle worker failure more gracefully
Currently, if worker started task execution and then failed unexpectedly, executor will wait forever until task is finished. This behavior may lead to flow hanging. Some kind of liveness messages should be sent from worker, so executor would know that task is still being performed. Otherwise, mark request as failed.
Blueprint information
- Status:
- Not started
- Approver:
- None
- Priority:
- High
- Drafter:
- Joshua Harlow
- Direction:
- Needs approval
- Assignee:
- None
- Definition:
- Approved
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
Idea for this is that when a worker accepts a task (transitions it to pending state + responds to the executor about this) that it also joins a 'temporary' tooz group that the engine created; if the worker then craps the bucket the tooz group will lose that worker as a member, and the engine can become aware of this and figure out what to do. On completion (the happy path) the worker will send back the task result then remove itself from the tooz group (this may require some tweaking to not trigger the engine seeing this as a worker crapping the bucket since reception of the task result may take a while)...