Hypervisor resource usage on source still shows old flavor usage after resize confirm until update_available_resource periodic runs

Bug #1818914 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Matt Riedemann

Bug Description

I actually uncovered this due to some failing functional tests for cross-cell resize:

https://review.openstack.org/#/c/641176/2/nova/compute/resource_tracker.py@503

But this goes back to https://review.openstack.org/#/c/370374/ for bug 1641750 and StarlingX has already fixed it:

https://github.com/starlingx-staging/stx-nova/blob/master/nova/compute/resource_tracker.py#L728

The issue is that if you set the "update_resources_interval" to some higher value, let's say 10 minutes, and resize an instance and immediately confirm it, because let's say "resize_confirm_window" is set to 1 second, then the GET /os-hypervisors/{hypervisor_id} results for things like "local_gb_used", "memory_mb_used" and "vcpus_used" will still show usage for the old flavor even though the instance is running on the dest host with the new flavor and is gone from the source host. That doesn't get fixed until the update_available_resource periodic task runs on the source again (or the source nova-compute service is restarted).

This is because the source compute resource tracker is not tracking the migration in it's "tracked_migrations" dict. The resize claim happens on the dest host and that's where the migration is "tracked". The ResourceTracker on the source is tracking the instance in 'tracked_instances' rather than 'tracked_migrations'.

On the source host when the RT code is called here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1063

"tracked = uuid in self.tracked_instances" will be True because the instance is on the source until it gets resized to the dest and then the host value changes here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/manager.py#L4500

But in the RT this means we won't get the itype here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1125

So the source RT doesn't track the migration here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1146

This is important because later in confirm_resize (on the source host) when it calls RT.drop_move_claim:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/manager.py#L4014

That will only update resource usage and decrement the old flavor if it's a tracked migration:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L478

As noted from the TODO in the elif block below:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L489

This is semi-low priority given how latent it is and the fact it's self-healing since the next run of the update_available_resource periodic will fix the resource usage on the source host, but in a busy cloud it could mean the difference between a server being able to build on that source host or not based on it's tracked resource usage.

Matt Riedemann (mriedem)
Changed in nova:
status: New → Triaged
tags: added: starlingx
Revision history for this message
Matt Riedemann (mriedem) wrote :

Part of this is also sort of reported in a very old bug 1498126.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/641521

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/641806

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/641521
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=54877e06f13b13134f2030557cd779f947a43c24
Submitter: Zuul
Branch: master

commit 54877e06f13b13134f2030557cd779f947a43c24
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 6 18:46:22 2019 -0500

    Add functional recreate test for bug 1818914

    The confirm resize flow in the compute manager
    runs on the source host. It calls RT.drop_move_claim
    to drop resource usage from the source host for the
    old flavor. The problem with drop_move_claim is it
    only decrements the old flavor from the reported usage
    if the instance is in RT.tracked_migrations, which will
    only be there on the source host if the update_available_resource
    periodic task runs before the resize is confirmed, otherwise
    the instance is still just tracked in RT.tracked_instances on
    the source host. This leaves the source compute incorrectly
    reporting resource usage for the old flavor until the next
    periodic runs, which could be a large window if resizes are
    configured to automatically confirm, e.g. resize_confirm_window=1,
    and the periodic interval is big, e.g. update_resources_interval=600.

    This change adds a functional recreate test for the bug which will
    be updated in the change that fixes the bug.

    Change-Id: I4aac187283c2f341b5c2712be85f722156e14f63
    Related-Bug: #1818914
    Related-Bug: #1498126

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/641806
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ad9f37350ad1f4e598a9a5df559b9160db1a11c1
Submitter: Zuul
Branch: master

commit ad9f37350ad1f4e598a9a5df559b9160db1a11c1
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 7 16:07:18 2019 -0500

    Update usage in RT.drop_move_claim during confirm resize

    The confirm resize flow in the compute manager
    runs on the source host. It calls RT.drop_move_claim
    to drop resource usage from the source host for the
    old flavor. The problem with drop_move_claim is it
    only decrements the old flavor from the reported usage
    if the instance is in RT.tracked_migrations, which will
    only be there on the source host if the update_available_resource
    periodic task runs before the resize is confirmed, otherwise
    the instance is still just tracked in RT.tracked_instances on
    the source host. This leaves the source compute incorrectly
    reporting resource usage for the old flavor until the next
    periodic runs, which could be a large window if resizes are
    configured to automatically confirm, e.g. resize_confirm_window=1,
    and the periodic interval is big, e.g. update_resources_interval=600.

    This fixes the issue by also updating usage in drop_move_claim
    when the instance is not in tracked_migrations but is in
    tracked_instances.

    Because of the tight coupling with the instance.migration_context
    we need to ensure the migration_context still exists before
    drop_move_claim is called during confirm_resize, so a test wrinkle
    is added to enforce that.

    test_drop_move_claim_on_revert also needed some updating for
    reality because of how drop_move_claim is called during
    revert_resize.

    And finally, the functional recreate test is updated to show the
    bug is fixed.

    Change-Id: Ia6d8a7909081b0b856bd7e290e234af7e42a2b38
    Closes-Bug: #1818914
    Related-Bug: #1641750
    Related-Bug: #1498126

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/744958

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.opendev.org/741995
Reason: Abandoning this in favor of https://review.opendev.org/#/c/744958/8 which is the same code change but with a better commit message that provides context on a bug the change also addresses.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/744958
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=44376d2e212e0f9405a58dc7fc4d5b38d70ac42e
Submitter: Zuul
Branch: master

commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

    Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/751352

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/751367

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/751352
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ce95af2caf69cb1b650459718fd4fa5f00ff28f5
Submitter: Zuul
Branch: stable/ussuri

commit ce95af2caf69cb1b650459718fd4fa5f00ff28f5
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

    Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914
    (cherry picked from commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/c/openstack/nova/+/751367
Committed: https://opendev.org/openstack/nova/commit/75f9b288f8edfd24affe5ecbc1f3efb6a63726e4
Submitter: "Zuul (22348)"
Branch: stable/train

commit 75f9b288f8edfd24affe5ecbc1f3efb6a63726e4
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

    Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

    Conflicts:
      nova/compute/manager.py

    NOTE(stephenfin): Conflicts are due to a number of missing cross-cell
    resize changes.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914
    (cherry picked from commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e)
    (cherry picked from commit ce95af2caf69cb1b650459718fd4fa5f00ff28f5)

tags: added: in-stable-train
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.