StackLight

separate lma-collector pipelines

Registered by Swann Croiset on 2016-03-29

Currently, there is one hekad process per node that processes all the data (logs, notifications, metrics and alerts).

To be more efficient and resilient, we need to have different "policies" depending on the kind of data being processed.

A "policy" is defined by at least 2 buffering options: full_action and buffer size.

Two important points should be taken into account when full_action='block' is set for at least one output plugin:
* The Heka's pipeline is blocked when that queue is full for some reason (eg. metrics are no more stored when ES is down and vice versa)
* Heka becomes wedged (idle packs) because too many messages are reinjected into pipeline by the different filter.

The proposal is to separate the lma-collector service into at least 2 services (log_collector and metric_collector) with these policies:

* log_collector:
- data loss should be minimized at all costs.
- logs/notifications: full_action='block' to be able to support a long ES downtime and minimize data loss. A 'medium' buffer_size should be set since the pending logs and messages are persisted respectively on the filesystem and on the RabbitMQ server.

* metric_collector:
  - in general, it is acceptable to lose data when one of the backends is down for a long period of time.
  - metrics: full_action='drop' / 'large' buffer_size to be able to handle (long) downtime of InfluxDB.
  - alerting: full_action='drop' / small buffer size to be able to send quickly the latest checks to Nagios.

This proposal should address the case of the metrics derived from logs and notifications. There are several options and only real tests can confirm the good choice. Currently these options could be :

1/ log_collector sends metrics to metric_collector with a TCP output plugin (full_action='block')
* pros: simple and no metric loss
* cons: if the local metric_collector is down/wedge, all logs/notifications are blocked until the recovery of metric_collector. Also alerts on logs are not evaluated hence no alarm will be sent.

2/ metric_collector parses logs/notifications
* pros: metrics and alerts on logs/notifications are available even if ES or log_collector are down
* cons: increase the load of metric_collector: duplicate the log parsing, duplicated queues to consume notifications

3/ ... other options are possible (log_collector is independent and sends metrics to influxdb, computes and sends alert), .. to be discussed.

The option #1 seems reasonable at first glance and would be the first option explored and tested.

Read the full specification

Blueprint information

Status:: Complete

Approver:: None

Priority:: High

Drafter:: Swann Croiset

Direction:: Approved

Assignee:: Swann Croiset

Definition:: Approved

Series goal:: Accepted for 0.10

Implementation:: Implemented

Milestone target:: 0.10.0

Started by: Swann Croiset on 2016-04-06

Completed by: Swann Croiset on 2016-05-06

Related branches

Related bugs

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/separate-lma-collector-pipelines,n,z

Addressed by: https://review.openstack.org/300447
Separate the (L)og of the LMA collector

Gerrit topic: https://review.openstack.org/#q,topic:discard-missing-data-logs,n,z

Addressed by: https://review.openstack.org/301568
Add discard_missing_data option for alarm rules

Addressed by: https://review.openstack.org/301496
Increase the Heka poolsize on controllers

Addressed by: https://review.openstack.org/301821
Increase the Heka poolsize on controllers

Addressed by: https://review.openstack.org/302100
Increase the Heka poolsize on controllers

Addressed by: https://review.openstack.org/302193
Add keep_alive configurations for TCP input/output plugins

Addressed by: https://review.openstack.org/301608
Remove dashboard configuration by the heka module

Gerrit topic: https://review.openstack.org/#q,topic:bp/aggregated-http-metrics,n,z

Addressed by: https://review.openstack.org/308980
Avoid to inject common tags twice for log_messages metrics

Addressed by: https://review.openstack.org/308464
Emit aggregated HTTP metrics

Addressed by: https://review.openstack.org/308979
Add metric TCP decoder for the metric_collector

Addressed by: https://review.openstack.org/308463
Modify multivalue_metric implementation

Addressed by: https://review.openstack.org/309338
Enable keep_alive for aggregator connexions

Addressed by: https://review.openstack.org/310797
Decrease the heka poolsize to its default 100 for log_collector

Addressed by: https://review.openstack.org/310692
Use a dedicated directory for Lua libraries

Gerrit topic: https://review.openstack.org/#q,topic:separate-lma-collector-pipelines,n,z

Addressed by: https://review.openstack.org/312048
Prevent using init script to start Heka on controller nodes

(?)

Work Items

This blueprint contains Public information

Everyone can see this information.

Subscribers

No subscribers.