cinder volume service HA improvements

Registered by Mingyan Bao

Summary
------------
Currently for a multi-backend setup, if one backend fails to start, the whole cinder volume service fails. This causes issues when the volume service is part of an HA setup such as a pacemaker cluster. the end result is that a single backend failure will eventually cause the whole monitored cluster to either shutdown or fail over to a backup, which could have the exact same failure if the failing backend isn't fixed.

Details
--------------------
Cinder volume service will start a child process for each configured storage driver. If a driver configuration is incorrect then its child process will fail to start and an error will be logged. Even though child processes for the other drivers will start okay, the Cinder volume service will return a “Failed” status when queried.

The Cinder volume service should report a “Warning” status instead of a “Failure” status when a child process failed to start for a driver configuration. That way the volume service will not be restarted by any HA monitoring software, and a single backend failure will not affect other backends that started fine.

When the cinder volume service fails to start a backend driver configuration, it should send an error event via the message bus in addition to logging the error in the log file.

Right now the only way to figure out which backend is failing is to read the log file. The current behavior makes Cinder management and configuration difficult because it’s hard to programmatically determine which driver configuration caused the issue and if that’s the cause for the volume service to be in a bad state.

At a minimum, the Cinder volume service should send a rabbit message when a backend could not be started. The message body should include the driver configuration information and any error messages encountered. This would allow any management UI’s or management services to be able to determine which backend drivers are configured incorrectly and alert the user to fix it.

Blueprint information

Status:
Complete
Approver:
None
Priority:
Undefined
Drafter:
Mingyan Bao
Direction:
Needs approval
Assignee:
None
Definition:
Obsolete
Series goal:
None
Implementation:
Unknown
Milestone target:
None
Completed by
Sean McGinnis

Related branches

Sprints

Whiteboard

(smcginnis): Marking obsolete as this has been sitting out there for a long time. If this is still needed, please submit a new bp.

None

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.