Apache doesn't handle the load to process passive checks with 200 nodes

Bug #1552772 reported by Swann Croiset
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StackLight
Fix Released
Medium
Swann Croiset

Bug Description

MOS 8.0 build 589, Infrastructure Alerting plugin from origin/master

Environment:
3 controllers
193 compute (20 of them are also ceph nodes)
3 elasticsearch node
3 influxdb nodes
1 infra alerting node (apache2/nagios3)

How to reproduce:
just deploy the env described above

Actual result:
*Some service status are : "UNKNOWN: No data received for at least 130 seconds " (and flap OK -> UNKN -> OK ..)
* The operator receive false alerts
* CPU 100% usage
* high fork rate ~110/s

Expected result:
services status stays OK or at least have "stable" status

Diagnostic:
Apache cannot handle the load: all nodes send their status (AFD) directly to Nagios through CGI and the aggregator send cluster status (GSE)
There are 1109 afd/gse with post message to apache every 10 seconds: ~111 req/s

Tags: scale apache
Revision history for this message
Swann Croiset (swann-w) wrote :
Revision history for this message
Swann Croiset (swann-w) wrote :

apache conf:

<IfModule mpm_prefork_module>
 StartServers 8
 MinSpareServers 5
 MaxSpareServers 20
 ServerLimit 400
 MaxClients 1000
 MaxRequestsPerChild 4000
 MaxRequestWorkers 450
 MaxConnectionsPerChild 0

Changed in lma-toolchain:
assignee: nobody → LMA-Toolchain Fuel Plugins (mos-lma-toolchain)
Revision history for this message
Swann Croiset (swann-w) wrote :

A possible solution would be to use a lightweight interface [0] instead of Apache/CGI

[0] https://github.com/zorkian/nagios-api

Revision history for this message
Swann Croiset (swann-w) wrote :

After further tests and investigations the issue is due to the fact that all lma_collector bufferize data until Nagios/Apache is deployed and up. The collector can bufferize data up to 2MB which represent a large number of messages to catch up!

Once the Apache is up and running, all collectors flood it and Apache never recovers this extra load.

The solution would be to decrease the buff to its minimal size (eg few KB)

Swann Croiset (swann-w)
Changed in lma-toolchain:
importance: Undecided → Medium
Revision history for this message
Swann Croiset (swann-w) wrote :

Optimal buffer sizes for nagios outputs
===============================
Knowing that an AFD/GSE message couldn't be bigger than 2KB and the buffering size must be greater than the max_message_size of heka message (currently 256KB) we can compute the required size.

There are by default 27 AFD per controllers and 7 per compute/storage configured :
 -> 2KB * 27 = 54KB per controller --> for x2 additional AFD = 114KB
 -> 2KB * 7 = 14KB per compute/storage --> for x2 additional AFD = 28KB

There are by default 15 global clusers and 6 node cluster: 2KB * 15 * 6 = 180KB --> 200KB

The 'theorical' size:
for controller = 200 + 114 = 314KB
for compute/storage = 260KB (>max_message_size)

Idealy we should configure buffer sizing options per output queue type (AFD, GSE_global, GSE_service).

max_buffer_size = 321536 # 314KB
max_file_size = 266240 # 260KB > max_message_size 256KB

Conclusion: we can safely reduce the buffer size from 2MB to 500KB

Test results with buffer size 400KB
===========================
not enough, after a down time of 5 minutes (shutdown of apache) the same behavior is observed
however, the test successed with 100 nodes (with a long period of high load on infra_alerting node to catch up the buffered messages)

Recommendation to mitigate this issue shortly
=====================================
1/ decrease the buffer sizes to 500KB
2/ increase the interval of all AFD filters from 10s to 20s (at least on compute nodes)

Long term solution
===============
This issue will be fixed when the CGI script will be replaced by a lightweight interface (nagios-api)

Swann Croiset (swann-w)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-lma-infrastructure-alerting (master)

Fix proposed to branch: master
Review: https://review.openstack.org/315421

Changed in lma-toolchain:
assignee: LMA-Toolchain Fuel Plugins (mos-lma-toolchain) → Swann Croiset (swann-w)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-lma-infrastructure-alerting (master)

Reviewed: https://review.openstack.org/315421
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-infrastructure-alerting/commit/?id=3ec437b2dc775aaf4388b1a0e5106a24b60f698a
Submitter: Jenkins
Branch: master

commit 3ec437b2dc775aaf4388b1a0e5106a24b60f698a
Author: Swann Croiset <email address hidden>
Date: Thu May 12 10:40:19 2016 +0200

    Implement lightweight WSGI application to replace CGI

    Fixes-bug: #1552772
    Implement: blueprint scalable-nagios-api

    Change-Id: I55613dd650b039142767174d3f19fa9262a2a7bc

Changed in lma-toolchain:
status: In Progress → Fix Committed
Changed in lma-toolchain:
milestone: none → 0.10.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-lma-infrastructure-alerting (stable/0.9)

Fix proposed to branch: stable/0.9
Review: https://review.openstack.org/341540

Changed in lma-toolchain:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-lma-infrastructure-alerting (stable/0.8)

Fix proposed to branch: stable/0.8
Review: https://review.openstack.org/351640

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-lma-infrastructure-alerting (stable/0.8)

Reviewed: https://review.openstack.org/351640
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-infrastructure-alerting/commit/?id=06e390d96f619af39bf29f2c2f5cf11e424a0abf
Submitter: Jenkins
Branch: stable/0.8

commit 06e390d96f619af39bf29f2c2f5cf11e424a0abf
Author: Swann Croiset <email address hidden>
Date: Thu May 12 10:40:19 2016 +0200

    Implement lightweight WSGI application to replace CGI

    Fixes-bug: #1552772
    Implement: blueprint scalable-nagios-api

    Change-Id: I55613dd650b039142767174d3f19fa9262a2a7bc
    (cherry picked from commit 3ec437b2dc775aaf4388b1a0e5106a24b60f698a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-plugin-lma-infrastructure-alerting (stable/0.9)

Change abandoned by Swann Croiset (<email address hidden>) on branch: stable/0.9
Review: https://review.openstack.org/341540
Reason: for history !

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.