Take a node out of service if no active conductors supports the node's driver
First the API won't allow clients to register a node with an invalid driver (non-existent or not present in any of the active conductors) , but conductors could go offline at some point making nodes to become invalid, the intention of this blueprint is to make sure that all nodes registered with an invalid driver get's marked as out-of-service.
Marking a node as out-of-service also should remove the node from the scheduler immediately to avoid a retry-fail loop[1].
Here's two ideas for marking the node as out-of-service:
1 (Simpler) - Having a periodic task that get's a list of active drivers and interact trough the list of registered nodes checking if the drivers of the nodes are are still valid.
2 - The consistent hashing algorithm[2] maps conductors to nodes considering the node's driver and the list of driver that each active conductor have, the algorithm is also responsible for maintaining a list of dead conductors as well, every time a conductor goes offline it should trigger a task that would first check if the drivers that the dead conductor had is not present in any other active conductor, in case the driver is not present any more it should fetch a list of nodes that needs such drivers and mark them as out-of-service.
[1] https:/
[2] https:/
Blueprint information
- Status:
- Not started
- Approver:
- aeva black
- Priority:
- Undefined
- Drafter:
- Lucas Alvares Gomes
- Direction:
- Needs approval
- Assignee:
- Lucas Alvares Gomes
- Definition:
- New
- Series goal:
- None
- Implementation:
-
Unknown
- Milestone target:
- None
- Started by
- Completed by
Related branches
Related bugs
Bug #1526735: [RFE] Take a node out of service if no active conductors supports the node's driver | Confirmed |
Sprints
Whiteboard
"the API won't allow clients to register a node with an invalid driver"
-- I tested this today, and the API still allowed it. So I have filed this review to fix it:
https:/
I see a problem with both your proposed solutions.
[1] where does this periodic_task run? If it runs on all conductors, which one decides what nodes to mark offline?
[2] again, which surviving conductor is responsible for marking the nodes-now-
Take the extreme case -- what if all conductors are offline. Thus all nodes are unavailable, since the hash won't map any node to any where (there will be no drivers in the ring, right?).
I do not think this should be "conductor marks a node inactive in the database". Instead, I think we need to:
1) ensure that the nova driver only gets a list of actually-available nodes, and will remove no-longer-available nodes from its list, during each cycle where it refreshes the view of available resources
2) gracefully handle requests to the API to manage nodes which no longer have any active conductor.
I think that patch https:/
Just my thoughts,
Devananda 2014-01-20
We're moving from using blueprints to track features to RFE bugs. I've filed one for your change (see related bugs section). Please track further work there using Closes-Bug, Partial-Bug or Related-Bug in commit messages and use this newly created RFE bug.
//vdrok 2015-12-16