Provide agent/service status which can be queried via init.d script or parent process

Registered by Miguel Angel Ajo

Currently we have no method to query agent or service status locally when we run them,
all we can get is the pid file, and know if the pid file is alive. But in HA environments
we need to health-check the service status (does it have connectivity?, child processes
have died unexpectedly?, are they working as expected?)

I propose exposing the agent/service status to a status file which could be queried via /etc/init.d/neutron-*-agent status or agent's parent process.

The status file would contain any status conditions that we want to propagate to
the parent process.

They will be formed by several lines with the format, ordered by priority (most critical first)

<status code>,<priority WARNING|ERROR|CRITICAL|...>,"<message>"
<status code>,<priority WARNING|ERROR|CRITICAL|...>,"<message>"
....

This status is not a log file, so, the recovered conditions should be removed from the file
as the agent/service recovers from the event.

This extra status querying mechanism will allow the parent to take action in the event
of failures, restarting the agent, sending notifications, or moving the service somewhere else.

Init script recommendation:

When using init script's, those would have the responsibility to translate our
internal codes to LSB ones, and expose the one considered most critical:

http://refspecs.linux-foundation.org/LSB_3.2.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html

codes 0-4 are already handled via the process status & the pid file.

codes 150-199 could be used as they're application specific.

150) Agent child processes died unexpectedly and need recovery
151) Broker connectivity is lost
...
...

Blueprint information

Status:
Complete
Approver:
Mark McClain
Priority:
Undefined
Drafter:
Miguel Angel Ajo
Direction:
Needs approval
Assignee:
Miguel Angel Ajo
Definition:
Obsolete
Series goal:
None
Implementation:
Started
Milestone target:
milestone icon next
Started by
Miguel Angel Ajo
Completed by
Armando Migliaccio

Whiteboard

Dec-16-2015(armax): If someone is interested in pursuing it, this must be re-submitted according to guidelines defined in [1].

[1] http://docs.openstack.org/developer/neutron/policies/blueprints.html

---------------

How would these codes be set?

Mark McClain

---------------

The errors would be set from inside the agent, under error circumstances,
and cleared when the error circumstances have passed. I'd write a custom
class to handle the output status set/clear operation.

I propose providing an --status-file setting to the agents,

for example:
       neutron-dhcp-agent --status-file /var/lib/neutron/dhcp/status

This file would be populated with the current status number + a textual description (optional)

1) In normal conditions
0,OK

2) In error conditions, we could dump a list of the current error conditions

a)

150,"A child process has died unexpectedly and needs recovery"

b)

151,"Message broker connectivity lost"

c) Several errors, one per line, ordered by severity:

151,"Message broker connectivity lost"
150,"A child process has died unexpectedly and needs recovery"

The LSB init script specification allocates error codes from 150 to 199 as application specific, we could also dump status codes starting from 0..N which could be translated by an init script to 0, 150..199.

The status file should be deleted at process exit.

Miguel Angel Ajo
-------

Overall, I think it is a good idea to be able to query status of the agents at a deeper level than we can now. I think this blueprint has value.

I don't think that the status of child processes and the status of broker connectivity is known internally to the agents now, is it? How do you plan to address this and detect these conditions?

Do you have any more examples besides child processes and broker connectivity? Those are a good start, though.

In the case of multiple status codes, which will be returned when I run "status neutron-dhcp-agent"?

While I understand the need to query status locally from the host (external monitory/pacemaker), do you have any plans or thoughts about reporting this status in a way that it could be queried through the neutron api? In general, it would be strange to see a ":-)" in the status column of "neutron agent list" but know through other monitoring channels that there is a problem with the agent. This wouldn't necessarily need to be addressed in this blueprint. I think it could be a good follow-on blueprint.

Carl Baldwin
-------

Hi Carl, thanks for the comments,

Currently I think we could easily detect if the broker connectivity is gone (probably that's the first thing to go for, in the patch itself as a first type of error).

I was planning to fill a second blueprint on top of this, to provide the broken child status.

I can't come with more error statuses at this moment, but if we extend this "service status" to neutron-server itself using the same mechanism, I can think of reporting an high-load status, or database connection problems (not sure if this last one provides more value than).

For a multiple status condition, if the parent is an init.d script, we should decide there
which one has the highest priority, and expose or translate the internal code to LSB standard
ones. If the parent is something else (neutron code itself doesn't include any init script if
I'm not wrong) it could handle the whole list. May be it makes sense to include a field that
says how critical the failure is (warning, error, critical....).

Exposing the error conditions to neutron-server on the agents via RPC actually makes sense
it's a good point to make the code design extensible so it could be used for this kind of situation.

Miguel Ángel Ajo
-----------------------------

Sounds good so far. Maybe you could reflect the results of these discussions with a little more detail in the specification above.

Carl
-------

Done :-),

Thanks a lot for the feedback

Miguel Ángel
--------------

Gerrit topic: https://review.openstack.org/#q,topic:bp/agent-service-status,n,z

Addressed by: https://review.openstack.org/74045
    Provide agent status via status file (WIP)

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.