Asynchronous Container Operations

Registered by Andrew Melton on 2015-04-30

Currently all of our container operations are synchronous and block both the API and the conductor until the Docker client returns. This is not ideal, especially for potentially long running calls like container create where Docker/Swarm will block while the image is pulled onto the node.

Along with certain calls being changed to casts, there are a few other potential changes:
* Container status: the status of a container should be tracked through operations so users can see what's going on and so the api can reject certain calls until the container is ready
* Async faults: With asynchronous calls, exceptions that get raised will not bubble up through the API as calls won't be blocking. We will need a way to communicate failures to users that will work in an async model.

Blueprint information

Status:
Complete
Approver:
Adrian Otto
Priority:
High
Drafter:
Andrew Melton
Direction:
Approved
Assignee:
Surojit Pathak
Definition:
Obsolete
Series goal:
Accepted for newton
Implementation:
Needs Code Review
Milestone target:
None
Started by
Surojit Pathak on 2016-01-23
Completed by
Adrian Otto on 2016-10-28

Related branches

Sprints

Whiteboard

Implement non-blocking async casts at API level, asynchronous behavior by default in python-magnumclient, and add --poll to the CLI to allow for sync CLI behavior.

<-------@suro-patz, 12-28-2015
Here is the summary of the proposed design, after discussion in mailing list [http://lists.openstack.org/pipermail/openstack-dev/2015-December/082524.html]

1. Magnum-conductor would have a pool of green threads for executing the container operations, viz. executor_threadpool. The size of the executor_threadpool will be configurable. [Phase0]
2. Every time, Magnum-conductor(Mcon) receives a container-operation-request from Magnum-API(Mapi), it will do the initial validation, housekeeping and then pick a thread from the executor_threadpool to execute the rest of the operations. Thus Mcon will return from the RPC request context much faster without blocking the Mapi. If the executor_threadpool is empty, Mcon will execute in a manner it does today, i.e. synchronously - this will be the rate-limiting mechanism - thus relaying the feedback of exhaustion. [Phase0]
How often we are hitting this scenario, may be indicative to the operator to create more workers for Mcon.
3. Blocking class of operations - There will be a class of operations, which can not be made async, as they are supposed to return result/content inline, e.g. 'container-logs'. [Phase0]
4. Out-of-order considerations for NonBlocking class of operations - there is a possible race around condition for create followed by start/delete of a container, as things would happen in parallel. To solve this, we will maintain a map of a container and executing thread, for current execution. If we find a request for an operation for a container-in-execution, we will block till the thread completes the execution. [Phase0]
    The approach above puts a prerequisite that operations for a given container on a given Bay would go to the same Magnum-conductor instance. To achieve this, we will use modulo-hashing based on <bay-id, container-id>, so that operations for a given container land up on same conductor-worker.[Phase1]
This mechanism can be further refined to achieve more asynchronous behavior. [Phase2]
5. The hand-off between Mcon and a thread from executor_threadpool can be reflected through new states on the 'container' object, viz. create-in-progress, delete-in-progress.[Phase0]
    These states can be helpful to recover/audit, in case of Mcon restart or even in sync_bay_status. [Phase1]
@suro-patz, 12-28-2015 -------->

<-------@suro-patz, 1-21-2016
Few more updates, along the way of development -
- It is desirable, if we can keep the mode of async operation, controllable by a config knob. This will help get the code in, incrementally. Also, this will help others to try out and provide feedback. So to keep both the code path available, without any duplication, we will use futurist interface.
- For creating two classes of actions (sync/async), and to achieve https://blueprints.launchpad.net/magnum/+spec/async-rpc-api, we would use CAST vs CALL mechanism for oslo_messaging.
- Futurist interface allows submission of task, even when threads are not available. And process them, as they become available. So, we will use this facility to absorb the burst of requests.
@suro-patz, 1-21-2016 -------->

Gerrit topic: https://review.openstack.org/#q,topic:bp/async-container-operations,n,z

Addressed by: https://review.openstack.org/267134
    [WIP]Magnum asynchronous container operation

Gerrit topic: https://review.openstack.org/#q,topic:bp/s,n,z

Addressed by: https://review.openstack.org/275003
    Spec for asynchronous container operations

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.