Mistral

Partial Workflow Failure Handling

Registered by Zelenevskii Vadim on 2023-11-15

    This feature introduces an enhanced error-handling mechanism for workflows, allowing them to gracefully handle issues within individual tasks without
causing a complete workflow failure.
   Previously, when using subworkflow and passing an incomplete set of parameters, the entire workflow would terminate. With this feature, the workflow continues execution, isolating errors at the task level. Consequently, partial issues in one task no longer impact other branches of workflow execution.
    In scenarios, where workflows are dynamically constructed, having the entire workflow fail due to a typo in a single task can be overly restrictive. Instead of forcedly terminating the workflow execution on encountering an error in a specific task, the workflow should gracefully handle such issues and proceed with the remaining tasks. This approach aligns with the idea that workflows should continue processing even in the presence of localized errors like for std.fail without causing a cascading failure across the workflow execution.

Let's consider following workflow:
---
version: '2.0'
example:
  tasks:
    t1:
      action: std.noop
      publish:
        task1: some_value
      on-success:
        - t2_l
        - t2
        - t2_r
    t2_l:
      action: std.sleep seconds=2
      publish:
        task2_l: some_value
      on-success:
        - t3_l
    t2:
      description: Fails here
      input:
        wrong_input: true
      workflow: sub_workflow_with_input
      publish:
        task2: some_value
      on-success:
        - t3
    t2_r:
      action: std.sleep seconds=2
      publish:
        task2_r: some_value
      on-success:
        - t3_r
    t3_l:
      action: std.noop
      publish:
        task3_l: some_value
    t3:
      action: std.noop
      publish:
        task3: some_value
    t3_r:
      action: std.noop
      publish:
        task3_r: some_value

sub_workflow_with_input:
  input:
    - my_input
  tasks:
    sub_wf_task:
      action: std.noop
      publish:
        sub_wf_task_result: some_value

It fails on t2 because we pass wrong input for subworkflow.

For this wf we expect that t2 fails and t3 doesn't created. So let's consider actual behavior for this wf

Before feature:
wf is failed forcedly after t2 failed and t3* tasks weren't created. But t3_l and t3_r should be executed independently of t2.

After:
wf is failed forcedly after t2 failed, t3 wasn't created, t3_l and t3_r were executed