Take a node out of service if no active conductors supports the node's driver

Registered by Lucas Alvares Gomes

First the API won't allow clients to register a node with an invalid driver (non-existent or not present in any of the active conductors) , but conductors could go offline at some point making nodes to become invalid, the intention of this blueprint is to make sure that all nodes registered with an invalid driver get's marked as out-of-service.

Marking a node as out-of-service also should remove the node from the scheduler immediately to avoid a retry-fail loop[1].

Here's two ideas for marking the node as out-of-service:

1 (Simpler) - Having a periodic task that get's a list of active drivers and interact trough the list of registered nodes checking if the drivers of the nodes are are still valid.

2 - The consistent hashing algorithm[2] maps conductors to nodes considering the node's driver and the list of driver that each active conductor have, the algorithm is also responsible for maintaining a list of dead conductors as well, every time a conductor goes offline it should trigger a task that would first check if the drivers that the dead conductor had is not present in any other active conductor, in case the driver is not present any more it should fetch a list of nodes that needs such drivers and mark them as out-of-service.

[1] https://bugs.launchpad.net/ironic/+bug/1260099
[2] https://blueprints.launchpad.net/ironic/+spec/instance-mapping-by-consistent-hash

Blueprint information

Status:
Not started
Approver:
aeva black
Priority:
Undefined
Drafter:
Lucas Alvares Gomes
Direction:
Needs approval
Assignee:
Lucas Alvares Gomes
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Whiteboard

"the API won't allow clients to register a node with an invalid driver"
-- I tested this today, and the API still allowed it. So I have filed this review to fix it:
   https://review.openstack.org/68018

I see a problem with both your proposed solutions.
[1] where does this periodic_task run? If it runs on all conductors, which one decides what nodes to mark offline?
[2] again, which surviving conductor is responsible for marking the nodes-now-owned-by-no-one as dead?

Take the extreme case -- what if all conductors are offline. Thus all nodes are unavailable, since the hash won't map any node to any where (there will be no drivers in the ring, right?).

I do not think this should be "conductor marks a node inactive in the database". Instead, I think we need to:
1) ensure that the nova driver only gets a list of actually-available nodes, and will remove no-longer-available nodes from its list, during each cycle where it refreshes the view of available resources
2) gracefully handle requests to the API to manage nodes which no longer have any active conductor.

I think that patch https://review.openstack.org/68018 goes to some degree to handle (2), but it may need more work. I suspect we have the means already (or most of it) to do (1) as well, but not sure if that's in the Nova driver or not.

Just my thoughts,
Devananda 2014-01-20

We're moving from using blueprints to track features to RFE bugs. I've filed one for your change (see related bugs section). Please track further work there using Closes-Bug, Partial-Bug or Related-Bug in commit messages and use this newly created RFE bug.
//vdrok 2015-12-16

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.