separate lma-collector pipelines
Currently, there is one hekad process per node that processes all the data (logs, notifications, metrics and alerts).
To be more efficient and resilient, we need to have different "policies" depending on the kind of data being processed.
A "policy" is defined by at least 2 buffering options: full_action and buffer size.
Two important points should be taken into account when full_action='block' is set for at least one output plugin:
* The Heka's pipeline is blocked when that queue is full for some reason (eg. metrics are no more stored when ES is down and vice versa)
* Heka becomes wedged (idle packs) because too many messages are reinjected into pipeline by the different filter.
The proposal is to separate the lma-collector service into at least 2 services (log_collector and metric_collector) with these policies:
* log_collector:
- data loss should be minimized at all costs.
- logs/notifications: full_action='block' to be able to support a long ES downtime and minimize data loss. A 'medium' buffer_size should be set since the pending logs and messages are persisted respectively on the filesystem and on the RabbitMQ server.
* metric_collector:
- in general, it is acceptable to lose data when one of the backends is down for a long period of time.
- metrics: full_action='drop' / 'large' buffer_size to be able to handle (long) downtime of InfluxDB.
- alerting: full_action='drop' / small buffer size to be able to send quickly the latest checks to Nagios.
This proposal should address the case of the metrics derived from logs and notifications. There are several options and only real tests can confirm the good choice. Currently these options could be :
1/ log_collector sends metrics to metric_collector with a TCP output plugin (full_action=
* pros: simple and no metric loss
* cons: if the local metric_collector is down/wedge, all logs/notifications are blocked until the recovery of metric_collector. Also alerts on logs are not evaluated hence no alarm will be sent.
2/ metric_collector parses logs/notifications
* pros: metrics and alerts on logs/notifications are available even if ES or log_collector are down
* cons: increase the load of metric_collector: duplicate the log parsing, duplicated queues to consume notifications
3/ ... other options are possible (log_collector is independent and sends metrics to influxdb, computes and sends alert), .. to be discussed.
The option #1 seems reasonable at first glance and would be the first option explored and tested.
Blueprint information
- Status:
- Complete
- Approver:
- None
- Priority:
- High
- Drafter:
- Swann Croiset
- Direction:
- Approved
- Assignee:
- Swann Croiset
- Definition:
- Approved
- Series goal:
- Accepted for 0.10
- Implementation:
- Implemented
- Milestone target:
- 0.10.0
- Started by
- Swann Croiset
- Completed by
- Swann Croiset
Related branches
Related bugs
Sprints
Whiteboard
Gerrit topic: https:/
Addressed by: https:/
Separate the (L)og of the LMA collector
Gerrit topic: https:/
Addressed by: https:/
Add discard_
Addressed by: https:/
Increase the Heka poolsize on controllers
Addressed by: https:/
Increase the Heka poolsize on controllers
Addressed by: https:/
Increase the Heka poolsize on controllers
Addressed by: https:/
Add keep_alive configurations for TCP input/output plugins
Addressed by: https:/
Remove dashboard configuration by the heka module
Gerrit topic: https:/
Addressed by: https:/
Avoid to inject common tags twice for log_messages metrics
Addressed by: https:/
Emit aggregated HTTP metrics
Addressed by: https:/
Add metric TCP decoder for the metric_collector
Addressed by: https:/
Modify multivalue_metric implementation
Addressed by: https:/
Enable keep_alive for aggregator connexions
Addressed by: https:/
Decrease the heka poolsize to its default 100 for log_collector
Addressed by: https:/
Use a dedicated directory for Lua libraries
Gerrit topic: https:/
Addressed by: https:/
Prevent using init script to start Heka on controller nodes