cinder volume service HA improvements
Summary
------------
Currently for a multi-backend setup, if one backend fails to start, the whole cinder volume service fails. This causes issues when the volume service is part of an HA setup such as a pacemaker cluster. the end result is that a single backend failure will eventually cause the whole monitored cluster to either shutdown or fail over to a backup, which could have the exact same failure if the failing backend isn't fixed.
Details
-------
Cinder volume service will start a child process for each configured storage driver. If a driver configuration is incorrect then its child process will fail to start and an error will be logged. Even though child processes for the other drivers will start okay, the Cinder volume service will return a “Failed” status when queried.
The Cinder volume service should report a “Warning” status instead of a “Failure” status when a child process failed to start for a driver configuration. That way the volume service will not be restarted by any HA monitoring software, and a single backend failure will not affect other backends that started fine.
When the cinder volume service fails to start a backend driver configuration, it should send an error event via the message bus in addition to logging the error in the log file.
Right now the only way to figure out which backend is failing is to read the log file. The current behavior makes Cinder management and configuration difficult because it’s hard to programmatically determine which driver configuration caused the issue and if that’s the cause for the volume service to be in a bad state.
At a minimum, the Cinder volume service should send a rabbit message when a backend could not be started. The message body should include the driver configuration information and any error messages encountered. This would allow any management UI’s or management services to be able to determine which backend drivers are configured incorrectly and alert the user to fix it.
Blueprint information
- Status:
- Complete
- Approver:
- None
- Priority:
- Undefined
- Drafter:
- Mingyan Bao
- Direction:
- Needs approval
- Assignee:
- None
- Definition:
- Obsolete
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
- Sean McGinnis
Related branches
Related bugs
Sprints
Whiteboard
(smcginnis): Marking obsolete as this has been sitting out there for a long time. If this is still needed, please submit a new bp.
None