"down" nova-compute service spuriously marked as "up" when disabled/enabled

Bug #1420848 reported by Chris Friesen
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Chris Friesen

Bug Description

I think our usage of the "updated_at" field to determine whether a service is "up" or not is buggy. Consider this scenario:

1) nova-compute is happily running and is up/enabled on compute-0
2) something causes nova-compute to stop (process crash, hardware fault, power failure, network isolation, etc.)
3) a minute later, the nova-compute service is reported as "down"
4) I run "nova service-disable compute-0 nova-compute", then "nova service-enable compute-0 nova-compute"
5) nova-compute is now reported as "up" for the next minute, and the scheduler might try to assign stuff to it. Since it's not actually available, these requests could be delayed by the RPC timeout period.

I wonder if it would make sense to have a separate "last status timestamp" database field that would only get updated when we get a service status update and not when we change any other fields.

Tags: compute
Eric Xie (mark-xiett)
Changed in nova:
assignee: nobody → Eric Xie (mark-xiett)
status: New → Incomplete
Revision history for this message
Chris Friesen (cbf123) wrote :

Just curious, what is "incomplete" about this? Is there more information that I can provide?

Revision history for this message
Eric Xie (mark-xiett) wrote :

Hi Chris, the "incomplete" is for me. I already checked in icehouse release. And need to check the other branches.

Revision history for this message
melanie witt (melwitt) wrote :

Eric, the Incomplete bug status means that more information is needed from the reporter before we can triage.

Changed in nova:
importance: Undecided → Low
status: Incomplete → Confirmed
Revision history for this message
Eric Xie (mark-xiett) wrote :

Sorry. This is my first time for bug fix.

Changed in nova:
assignee: Eric Xie (mark-xiett) → nobody
Revision history for this message
melanie witt (melwitt) wrote :

Hi Eric, no worries. You can assign the bug to yourself if you're looking into it. Just be sure to unassign if you decide you don't wish to work on it anymore.

Please see this doc about bug triage to learn how it works: https://wiki.openstack.org/wiki/BugTriage

Revision history for this message
jichenjc (jichenjc) wrote :

Add a new field to service table indicate the heart beat time might be a solution but it might introduce complex
such as migrations of existing db
except that ,we might have no good method on checking this guess staled for 1 min is acceptable?

Revision history for this message
Chris Friesen (cbf123) wrote :

I don't think it's acceptable, no. Any operation involving the scheduler could end up trying to place an instance on the "down" compute node for that minute.

And if we were enabling the service rather than disabling it (or doing any other operation on the service) then we could end up in a state where the scheduler thinks it's available. That could result in operations taking a long time as they block waiting for the RPC timeout since of course the compute node would never respond.

I think it's clearly a flawed design to rely on automatic database row timestamps for status, when both system and user-triggered operations cause those timestamps to be updated.

Chris Friesen (cbf123)
Changed in nova:
assignee: nobody → Chris Friesen (cbf123)
description: updated
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Chris Friesen (cbf123) wrote :

Not sure why it didn't auto-link, but there's a fix proposed at

https://review.openstack.org/163060

Chris Friesen (cbf123)
summary: - nova-compute service spuriously marked as "up" when disabled
+ "down" nova-compute service spuriously marked as "up" when
+ disabled/enabled
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/168418

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/163060
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b9bae02af2168ad64d3b3d28c97c3853cee73272
Submitter: Jenkins
Branch: master

commit b9bae02af2168ad64d3b3d28c97c3853cee73272
Author: Chris Friesen <email address hidden>
Date: Fri Mar 27 09:23:48 2015 -0600

    fix "down" nova-compute service spuriously marked as "up"

    Currently we use the auto-updated "updated_at" field to determine
    whether a service is "up". An end-user can cause the "updated_at"
    field to be updated by disabling or enabling the service, thus
    potentially causing a service that is unavailable to be detected
    as "up". This could result in the scheduler trying to assign
    instances to an unavailable compute node, or in the system
    mistakenly preventing evacuation of an instance.

    The fix is to add a new field to explicitly track the timestamp of
    the last time the service sent in a status report and use that if
    available when testing whether the service is up.

    DocImpact
    This commit will cause a behaviour change for the DB servicegroup
    driver. It will mean that enabling/disabling the service will
    cause the "updated_at" field to change (as before) but that will
    no longer be tied to the "up/down" status of the service. So
    "nova service-list" could show the service as "down" even if it
    shows a recent "updated_at". (But this could happen for the other
    servicegroup drivers already.)

    Closes-Bug: #1420848
    Change-Id: Ied7d47363d0489bca3cf2c711217e1a3b7d24a03

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Chris Friesen (<email address hidden>) on branch: master
Review: https://review.openstack.org/168418
Reason: No support for the change, abandoning.

Thierry Carrez (ttx)
Changed in nova:
milestone: liberty-1 → 12.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.