OpenStack Compute (nova)

Bug #1879878
Comment #18

Comment 18 for bug 1879878

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11: Fix merged to nova (master)

#18

Reviewed: https://review.opendev.org/747746
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc9c7a5ebf11253f86127238d33dff7401465155
Submitter: Zuul
Branch: master

commit dc9c7a5ebf11253f86127238d33dff7401465155
Author: Stephen Finucane <email address hidden>
Date: Fri Aug 21 17:43:36 2020 +0100

Move revert resize under semaphore

    As discussed in change I26b050c402f5721fc490126e9becb643af9279b4, the
    resource tracker's periodic task is reliant on the status of migrations
    to determine whether to include usage from these migrations in the
    total, and races between setting the migration status and decrementing
    resource usage via 'drop_move_claim' can result in incorrect usage.
    That change tackled the confirm resize operation. This one changes the
    revert resize operation, and is a little trickier due to kinks in how
    both the same-cell and cross-cell resize revert operations work.

    For same-cell resize revert, the 'ComputeManager.revert_resize'
    function, running on the destination host, sets the migration status to
    'reverted' before dropping the move claim. This exposes the same race
    that we previously saw with the confirm resize operation. It then calls
    back to 'ComputeManager.finish_revert_resize' on the source host to boot
    up the instance itself. This is kind of weird, because, even ignoring
    the race, we're marking the migration as 'reverted' before we've done
    any of the necessary work on the source host.

    The cross-cell resize revert splits dropping of the move claim and
    setting of the migration status between the source and destination host
    tasks. Specifically, we do cleanup on the destination and drop the move
    claim first, via 'ComputeManager.revert_snapshot_based_resize_at_dest'
    before resuming the instance and setting the migration status on the
    source via
    'ComputeManager.finish_revert_snapshot_based_resize_at_source'. This
    would appear to avoid the weird quirk of same-cell migration, however,
    in typical weird cross-cell fashion, these are actually different
    instances and different migration records.

    The solution is once again to move the setting of the migration status
    and the dropping of the claim under 'COMPUTE_RESOURCE_SEMAPHORE'. This
    introduces the weird setting of migration status before completion to
    the cross-cell resize case and perpetuates it in the same-cell case, but
    this seems like a suitable compromise to avoid attempts to do things
    like unplugging already unplugged PCI devices or unpinning already
    unpinned CPUs. From an end-user perspective, instance state changes are
    what really matter and once a revert is completed on the destination
    host and the instance has been marked as having returned to the source
    host, hard reboots can help us resolve any remaining issues.

    Change-Id: I29d6f4a78c0206385a550967ce244794e71cef6d
    Signed-off-by: Stephen Finucane <email address hidden>
    Closes-Bug: #1879878

Reviewed:  https://review.opendev.org/747746
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc9c7a5ebf11253f86127238d33dff7401465155
Submitter: Zuul
Branch:    master

commit dc9c7a5ebf11253f86127238d33dff7401465155
Author: Stephen Finucane <stephenfin@redhat.com>
Date:   Fri Aug 21 17:43:36 2020 +0100

Move revert resize under semaphore
    
    As discussed in change I26b050c402f5721fc490126e9becb643af9279b4, the
    resource tracker's periodic task is reliant on the status of migrations
    to determine whether to include usage from these migrations in the
    total, and races between setting the migration status and decrementing
    resource usage via 'drop_move_claim' can result in incorrect usage.
    That change tackled the confirm resize operation. This one changes the
    revert resize operation, and is a little trickier due to kinks in how
    both the same-cell and cross-cell resize revert operations work.
    
    For same-cell resize revert, the 'ComputeManager.revert_resize'
    function, running on the destination host, sets the migration status to
    'reverted' before dropping the move claim. This exposes the same race
    that we previously saw with the confirm resize operation. It then calls
    back to 'ComputeManager.finish_revert_resize' on the source host to boot
    up the instance itself. This is kind of weird, because, even ignoring
    the race, we're marking the migration as 'reverted' before we've done
    any of the necessary work on the source host.
    
    The cross-cell resize revert splits dropping of the move claim and
    setting of the migration status between the source and destination host
    tasks. Specifically, we do cleanup on the destination and drop the move
    claim first, via 'ComputeManager.revert_snapshot_based_resize_at_dest'
    before resuming the instance and setting the migration status on the
    source via
    'ComputeManager.finish_revert_snapshot_based_resize_at_source'. This
    would appear to avoid the weird quirk of same-cell migration, however,
    in typical weird cross-cell fashion, these are actually different
    instances and different migration records.
    
    The solution is once again to move the setting of the migration status
    and the dropping of the claim under 'COMPUTE_RESOURCE_SEMAPHORE'. This
    introduces the weird setting of migration status before completion to
    the cross-cell resize case and perpetuates it in the same-cell case, but
    this seems like a suitable compromise to avoid attempts to do things
    like unplugging already unplugged PCI devices or unpinning already
    unpinned CPUs. From an end-user perspective, instance state changes are
    what really matter and once a revert is completed on the destination
    host and the instance has been marked as having returned to the source
    host, hard reboots can help us resolve any remaining issues.
    
    Change-Id: I29d6f4a78c0206385a550967ce244794e71cef6d
    Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
    Closes-Bug: #1879878