Workflow error analysis

Registered by Renat Akhmerov on 2016-11-09

Now:

When a workflow fails it now may be hard to quickly find a root cause.

From CLI the only way (without creating a new execution) is to use a sequence of commands like:
* 'mistral task-list <workflow execution id>' and see what are in ERROR
* for each failed task execution run 'mistral action-execution-list' and see what are in ERROR
* for each failed action run 'mistral action-execution-get-output <id>' and see the description of the error
* for each failed task execution of type Workflow, find the sub-workflow execution ID, and go back to the first bullet.

It is also possible to create and execute a workflow with a "publish" of all tasks and all sub-workflow tasks recursively (and also filter by tasks in error state). Example: http://paste.openstack.org/show/599714/

The goal:

Mistral should provide one command that allows to see a report on failed actions and how they affected the entire workflow execution. This report should also account for nested workflows.

Solution ideas/steps:
* Write a spec
* It could be implemented on a client side or a server side. The latter is faster because we won't have to make lots of REST requests.

Testing:

* Functional tests that imitate workflow failures and make sure that we get the right report.

Error examples:

* yaql expression failed: http://paste.openstack.org/show/600099/
* http action faild because of an invalid URL: http://paste.openstack.org/show/600100/

Notes:

* One of the current problems is error info cleanness. It's not easy to understand what the precise error is even if we see it.
* Idea: split the actuall error info and contextual information (e.g. stack trace)
* Idea: give an option to report inbound context and outbound context for each task
* Idea: use some sort of classification for all possible errors
* Idea: have a separate REST API endpoint to build reports on the current status of the execution and/or error analysis

Decisions:

* Write a spec first
* Add a new endpoint to generate "Workflow error analysis" reports. Same endpoint can also generate a report on the current progress of a workflow, not necessarily failed yet. It can be used, for example, for UI to track the current situation.

Blueprint information

Status:
Complete
Approver:
Renat Akhmerov
Priority:
High
Drafter:
Renat Akhmerov
Direction:
Approved
Assignee:
Renat Akhmerov
Definition:
Approved
Series goal:
Accepted for stein
Implementation:
Implemented
Milestone target:
milestone icon stein-3
Started by
Renat Akhmerov on 2019-01-30
Completed by
Renat Akhmerov on 2019-03-27

Whiteboard

https://bugs.launchpad.net/mistral/+bug/1674722 - the bug created for the same purpose and now closed to not duplicate this blueprint. It has some additional information though that can be useful.

spec: https://review.openstack.org/#/c/443217/

patches related to it:

https://review.openstack.org/455447
https://review.openstack.org/452901

Gerrit topic: https://review.openstack.org/#q,topic:bp/mistral-error-analysis,n,z

Addressed by: https://review.openstack.org/631163
    WIP: add a workflow execution report endpoint

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.