separate lma-collector pipelines

Registered by Swann Croiset

Currently, there is one hekad process per node that processes all the data (logs, notifications, metrics and alerts).

To be more efficient and resilient, we need to have different "policies" depending on the kind of data being processed.

A "policy" is defined by at least 2 buffering options: full_action and buffer size.

Two important points should be taken into account when full_action='block' is set for at least one output plugin:
* The Heka's pipeline is blocked when that queue is full for some reason (eg. metrics are no more stored when ES is down and vice versa)
* Heka becomes wedged (idle packs) because too many messages are reinjected into pipeline by the different filter.

The proposal is to separate the lma-collector service into at least 2 services (log_collector and metric_collector) with these policies:

* log_collector:
   - data loss should be minimized at all costs.
   - logs/notifications: full_action='block' to be able to support a long ES downtime and minimize data loss. A 'medium' buffer_size should be set since the pending logs and messages are persisted respectively on the filesystem and on the RabbitMQ server.

* metric_collector:
  - in general, it is acceptable to lose data when one of the backends is down for a long period of time.
  - metrics: full_action='drop' / 'large' buffer_size to be able to handle (long) downtime of InfluxDB.
  - alerting: full_action='drop' / small buffer size to be able to send quickly the latest checks to Nagios.

This proposal should address the case of the metrics derived from logs and notifications. There are several options and only real tests can confirm the good choice. Currently these options could be :

1/ log_collector sends metrics to metric_collector with a TCP output plugin (full_action='block')
  * pros: simple and no metric loss
  * cons: if the local metric_collector is down/wedge, all logs/notifications are blocked until the recovery of metric_collector. Also alerts on logs are not evaluated hence no alarm will be sent.

2/ metric_collector parses logs/notifications
  * pros: metrics and alerts on logs/notifications are available even if ES or log_collector are down
  * cons: increase the load of metric_collector: duplicate the log parsing, duplicated queues to consume notifications

3/ ... other options are possible (log_collector is independent and sends metrics to influxdb, computes and sends alert), .. to be discussed.

The option #1 seems reasonable at first glance and would be the first option explored and tested.

Blueprint information

Status:
Complete
Approver:
None
Priority:
High
Drafter:
Swann Croiset
Direction:
Approved
Assignee:
Swann Croiset
Definition:
Approved
Series goal:
Accepted for 0.10
Implementation:
Implemented
Milestone target:
milestone icon 0.10.0
Started by
Swann Croiset
Completed by
Swann Croiset

Related branches

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/separate-lma-collector-pipelines,n,z

Addressed by: https://review.openstack.org/300447
    Separate the (L)og of the LMA collector

Gerrit topic: https://review.openstack.org/#q,topic:discard-missing-data-logs,n,z

Addressed by: https://review.openstack.org/301568
    Add discard_missing_data option for alarm rules

Addressed by: https://review.openstack.org/301496
    Increase the Heka poolsize on controllers

Addressed by: https://review.openstack.org/301821
    Increase the Heka poolsize on controllers

Addressed by: https://review.openstack.org/302100
    Increase the Heka poolsize on controllers

Addressed by: https://review.openstack.org/302193
    Add keep_alive configurations for TCP input/output plugins

Addressed by: https://review.openstack.org/301608
    Remove dashboard configuration by the heka module

Gerrit topic: https://review.openstack.org/#q,topic:bp/aggregated-http-metrics,n,z

Addressed by: https://review.openstack.org/308980
    Avoid to inject common tags twice for log_messages metrics

Addressed by: https://review.openstack.org/308464
    Emit aggregated HTTP metrics

Addressed by: https://review.openstack.org/308979
    Add metric TCP decoder for the metric_collector

Addressed by: https://review.openstack.org/308463
    Modify multivalue_metric implementation

Addressed by: https://review.openstack.org/309338
    Enable keep_alive for aggregator connexions

Addressed by: https://review.openstack.org/310797
    Decrease the heka poolsize to its default 100 for log_collector

Addressed by: https://review.openstack.org/310692
    Use a dedicated directory for Lua libraries

Gerrit topic: https://review.openstack.org/#q,topic:separate-lma-collector-pipelines,n,z

Addressed by: https://review.openstack.org/312048
    Prevent using init script to start Heka on controller nodes

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.