Tail-based coherent sampling [1] in OSProfiler

Registered by Rajul Kumar

This is to have a more efficient sampling strategy than just a head-based sampling for continuous tracing in OSProfiler

Current status:
There are no sampling strategies implemented yet in OSProfiler.

Problem:
Continuous tracing may require sampling to reduce the amount of data being generated [2].
This sampling can either be done at the beginning of transaction by deciding on whether a trace should be generated for the transaction or not i.e. head-based sampling [1]
It may reduce the amount of data being generated [1]. However, it might miss out some faulty transaction during the process as they were not being traced and were otherwise expected to be saved for analysis.

Proposed Solution:
In addition to head-based sampling, we propose an additional option, called tail-based sampling. This option makes a decision at the end of the transaction right before sending it to the data store based on some pre-computed statistics from the analysis on the transactions seen so far.
The analysis can either be done locally on the partial traces on each agent or traces could be sent to a central collector ,which aggregates them and decides whether or not they are anomalous (and hence worth keeping). We consider two options as stated below.

First, Sending the traces to the collector will take the computation overhead off the nodes, help in better decision making and keeping all the required traces. Normal traces could also be clustered based on API, path, input parameters etc. and sampled to have traces of various possible scenarios. However, it will add some network overhead as all the traces generated are sent to the collector. This will also require resources to run as a service.

Agents can also take a call using partial traces (i.e. if a partial trace looks anomalous to the agent e.g. if a request fails, then it could broadcast to other agents to store the specific trace). This won't require a separate collector. However, each agent must keep the traces till it receives a notification from another agent or there is a timeout to evict them. This will require additional resources on each node and will introduce some network overhead. Communication with other agents could be reduced if the trace metadata contains the information of the agents that has other parts of the trace.

Benefit:
Useful traces are not lost in the sampling process.

Depends on blueprint "Overhead control in OSProfiler"[2]

[1] Sambasivan, R. R., Shafer, I., Mace, J., Sigelman, B. H., Fonseca, R., & Ganger, G. R. (2016, October). Principled workflow-centric tracing of distributed systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing (pp. 401-414). ACM.
[2] https://blueprints.launchpad.net/osprofiler/+spec/osprofiler-overhead-control

Blueprint information

Status:
Not started
Approver:
Tovin Seven
Priority:
Undefined
Drafter:
Rajul Kumar
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.