Distributed host discovery with leader election

Registered by sean mooney

Today Nova supports automatic, periodic host discovery and cell mapping via the Nova scheduler.

This is achieved via a periodic task that can be enabled in at most one scheduled instance at a time.
The reason for this limitation is while the db will enforce correctness there is no synchronisation
between scheduled instances.

As a result, if multiple schedulers are deployed with the same config and the periodic is enabled,
when a new host is added but unmapped to a cell,
each scheduler instance can race to map the compute node.

Today, this is guarded by our db constraints and results in one or more of
the scheduler instances raising a HostMappingExists exception and logging a warning.

This can be annoying for operators and installer tool authors.

for operators, the warning is more or less noise
it's complex for them to do the leader election externally
and it requires that they generate different configurations for different instances of the schedule.

for installer authors its common to want all instances of a stateless service like the scheduler to
share a common config. when nova is deployed in a k8s environment generating separate nova
configs per pod prevents you from scaling the number of scheduled pod merely by using k8s's native
concepts of deployments or stateful sets.

this is because config files are shared between pods that comprise a stateful set or deployment
cannot be different per pod within the replica set (the generic thing that deployments and stateful sets are built from)

To this end, this blueprint tracks a minimal change to the Nova scheduler to optimise the behaviour of the discovered host periodically.

To reduce collisions and optimise the overhead of enabling the periodic in multiple schedulers the periodic will be enhanced to do leader election via lazy consensus.

while there are may ways of doing leader election, uptime, Rendezvous hashing, or distributed locking scheme like RAFT, a far simpler approach will be taken.

the periodic task will be modified to retrieve a list of the current up schedulers.
sort them by the service.Host to have a deterministic list give a stable input,
and select the first service from the list as the leader.
if the current service is not the leader it will return from the periodic.

while this does not guarantee that two periodic tasks cant run at the same time if there is a split-brain,
that is not required due to the existing db constraints. while its possible that there is a delay in selecting
a new leader while we wait for the current one to be detected as down, any new hosts will be detected on
the next run, as such the simple approach is just enough to address the pain point for minimal overhead.

Blueprint information

Status:
Started
Approver:
Sylvain Bauza
Priority:
Undefined
Drafter:
sean mooney
Direction:
Approved
Assignee:
sean mooney
Definition:
Approved
Series goal:
Accepted for 2025.1
Implementation:
Started
Milestone target:
None
Started by
Sylvain Bauza

Related branches

Sprints

Whiteboard

[20250107 bauzas] Approved as specless during today's meeting

Gerrit topic: https://review.opendev.org/#/q/topic:bp/distributed-host-discovery

Addressed by: https://review.opendev.org/c/openstack/nova/+/938523
    allow discover host to be enabeld in multiple schedulers

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.