Apache doesn't handle the load to process passive checks with 200 nodes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StackLight |
Fix Released
|
Medium
|
Swann Croiset |
Bug Description
MOS 8.0 build 589, Infrastructure Alerting plugin from origin/master
Environment:
3 controllers
193 compute (20 of them are also ceph nodes)
3 elasticsearch node
3 influxdb nodes
1 infra alerting node (apache2/nagios3)
How to reproduce:
just deploy the env described above
Actual result:
*Some service status are : "UNKNOWN: No data received for at least 130 seconds " (and flap OK -> UNKN -> OK ..)
* The operator receive false alerts
* CPU 100% usage
* high fork rate ~110/s
Expected result:
services status stays OK or at least have "stable" status
Diagnostic:
Apache cannot handle the load: all nodes send their status (AFD) directly to Nagios through CGI and the aggregator send cluster status (GSE)
There are 1109 afd/gse with post message to apache every 10 seconds: ~111 req/s
Changed in lma-toolchain: | |
importance: | Undecided → Medium |
description: | updated |
Changed in lma-toolchain: | |
milestone: | none → 0.10.0 |
Changed in lma-toolchain: | |
status: | Fix Committed → Fix Released |
apache conf:
<IfModule mpm_prefork_module> Child 4000 PerChild 0
StartServers 8
MinSpareServers 5
MaxSpareServers 20
ServerLimit 400
MaxClients 1000
MaxRequestsPer
MaxRequestWorkers 450
MaxConnections