Asynchronous trace collection in OSProfiler

Registered by Rajul Kumar on 2017-07-25

This is to add asynchronous data collection from the tracepoints in a service.

Current status:
Traces generated at each tracepoint are sent to the data store synchronously on the critical path of an API call.

Problem:
Storing traces synchronously to the data store on the critical path of the request execution results in observable overhead and can affect the response time of an API.
Currently, this may be acceptable as a user selects a specific transaction to be traced. However, this limits OSProfiler’s future ability to be an always-on tracing service like Zipkin, Jaeger etc. [2-3]. Tracing and sampling will be abstracted from the end-user. So, this overhead on random transactions will give an inconsistent experience to the user.
Also, an API tends to do more tasks than it's intended one. Hence, there should be some isolation between performing the task and storing the trace.

Proposed change:
We propose to have an agent running on each node/service. The agent will receive traces from a tracepoint in the service. It will then dump this trace asynchronously to the data store, off the critical path of the transaction [2-3].
This agent should be added as another driver extending the driver's base class for a data store like MongoDB, Ceilometer etc. in the OSProfiler. It will give users another option to save the tracepoints. Agent can then use the drivers already there in the OSProfiler as its backend to save the data to the intended data store.
Agent should run as a daemon to accept traces from the service. It could be initialized as a part of OSProfiler initialization call from a specific service.
An optional central collector may be introduced that will receive the data from the agent, run some validation and transformation if required, before sending it to the data store.

Benefit:
This will reduce the user-visible overhead of OSProfiler. It will just be Tracepoints will make an IPC call to the agent rather than a network call to the data store for an API to store the trace.

Depends on the blueprint "Overhead control in OSProfiler"[1]

[1] https://blueprints.launchpad.net/osprofiler/+spec/osprofiler-overhead-control
[2] http://zipkin.io/pages/architecture.html
[3] http://jaeger.readthedocs.io/en/latest/architecture

Blueprint information

Status:
Not started
Approver:
Tovin Seven
Priority:
Undefined
Drafter:
Rajul Kumar
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.