wbe handle worker failure more gracefully

Registered by Joshua Harlow

Currently, if worker started task execution and then failed unexpectedly, executor will wait forever until task is finished. This behavior may lead to flow hanging. Some kind of liveness messages should be sent from worker, so executor would know that task is still being performed. Otherwise, mark request as failed.

Blueprint information

Status:
Not started
Approver:
None
Priority:
High
Drafter:
Joshua Harlow
Direction:
Needs approval
Assignee:
None
Definition:
Approved
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Idea for this is that when a worker accepts a task (transitions it to pending state + responds to the executor about this) that it also joins a 'temporary' tooz group that the engine created; if the worker then craps the bucket the tooz group will lose that worker as a member, and the engine can become aware of this and figure out what to do. On completion (the happy path) the worker will send back the task result then remove itself from the tooz group (this may require some tweaking to not trigger the engine seeing this as a worker crapping the bucket since reception of the task result may take a while)...

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.