libvirt: post_live_migration failures to disconnect volumes result in the rollback of live migrations

Bug #1843639 reported by Lee Yarwood
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Lee Yarwood

Bug Description

Description
===========
At present any exceptions encountered during post_live_migration on the source after an instance has successfully migrated result in the overall failure of the migration and the instance being listed as running on the source while actually being on the destination.

Any such errors should be logged but otherwise ignored allowing the migration to complete and for the instance to continue to be tracked correctly.

Steps to reproduce
==================
- Live migrate an instance from host A to host B, ensuring post_live_migration fails.

Expected result
===============
Any failures on the source encountered by post_live_migration are logged but the overall migration still completes successfully.

Actual result
=============
The instance and overall migration are left in error states. Additionally the instance is reported as residing on the source host while actually running on the destination.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   ba3147420c0a6f8b17a46b1a493b89bcd67af6f1

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Revision history for this message
Matt Riedemann (mriedem) wrote :

Not surprised about this since the _post_live_migration method and the post_live_migration_at_destination that it calls are all huge and complicated. I've advocated for a long time now that we should be breaking down those giant methods into smaller parts so we can more correctly do error handling like this, but for a backportable fix we'd likely just need to handle the volume errors during post processing and refactor the code out later.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Similar change from Lee here for refactoring volume handling in _rollback_live_migration:

https://review.opendev.org/#/c/656500/

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Lee Yarwood (lyarwood) wrote :

Apologies for the confusion, I was specifically talking about post_live_migration within the Libvirt driver itself and not within the compute layer. There are definitely additional issues there as you've pointed out above but this bug is specifically about the lack of error handling with the following method:

https://github.com/openstack/nova/blob/7a18209a81539217a95ab7daad6bc67002768950/nova/virt/libvirt/driver.py#L8800-L8810

Thankfully the fix is pretty straight forward and should be easily backportable. I'll post it shortly once M3 is cut and the gate is in better shape.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/682621

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/682622

Changed in nova:
assignee: Lee Yarwood (lyarwood) → Artom Lifshitz (notartom)
Changed in nova:
assignee: Artom Lifshitz (notartom) → Lee Yarwood (lyarwood)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https://review.opendev.org/682621

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/682622
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ac68cffd43a2f5103c28a2d4b31e087c3f5c24b9
Submitter: Zuul
Branch: master

commit ac68cffd43a2f5103c28a2d4b31e087c3f5c24b9
Author: Lee Yarwood <email address hidden>
Date: Wed Sep 11 19:24:05 2019 +0100

    libvirt: Ignore volume exceptions during post_live_migration

    Previously errors while disconnecting volumes from the source host
    during post_live_migration within LibvirtDriver would result in the
    overall failure of the migration. This would also mean that while the
    instance would be running on the destination it would still be listed as
    running on the source within the db.

    This change simply ignores any exceptions raised while attempting to
    disconnect volumes on the source. These errors can be safely ignored as
    they will have no impact on the running instance on the destination.

    In the future Nova could wire up the force and ignore_errors kwargs when
    calling down into the associated os-brick connectors to help avoid this.

    Closes-Bug: #1843639
    Change-Id: Ieff5243854321ec40f642845e87a0faecaca8721

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/691281

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/691282

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/691283

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/691284

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/691281
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ff36b6d97ff289ddc34d7776f6a9141b09eb3ad9
Submitter: Zuul
Branch: stable/train

commit ff36b6d97ff289ddc34d7776f6a9141b09eb3ad9
Author: Lee Yarwood <email address hidden>
Date: Wed Sep 11 19:24:05 2019 +0100

    libvirt: Ignore volume exceptions during post_live_migration

    Previously errors while disconnecting volumes from the source host
    during post_live_migration within LibvirtDriver would result in the
    overall failure of the migration. This would also mean that while the
    instance would be running on the destination it would still be listed as
    running on the source within the db.

    This change simply ignores any exceptions raised while attempting to
    disconnect volumes on the source. These errors can be safely ignored as
    they will have no impact on the running instance on the destination.

    In the future Nova could wire up the force and ignore_errors kwargs when
    calling down into the associated os-brick connectors to help avoid this.

    Closes-Bug: #1843639
    Change-Id: Ieff5243854321ec40f642845e87a0faecaca8721
    (cherry picked from commit ac68cffd43a2f5103c28a2d4b31e087c3f5c24b9)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/691282
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=022ea2819425b5ab3001791455dda36ed638c22d
Submitter: Zuul
Branch: stable/stein

commit 022ea2819425b5ab3001791455dda36ed638c22d
Author: Lee Yarwood <email address hidden>
Date: Wed Sep 11 19:24:05 2019 +0100

    libvirt: Ignore volume exceptions during post_live_migration

    Previously errors while disconnecting volumes from the source host
    during post_live_migration within LibvirtDriver would result in the
    overall failure of the migration. This would also mean that while the
    instance would be running on the destination it would still be listed as
    running on the source within the db.

    This change simply ignores any exceptions raised while attempting to
    disconnect volumes on the source. These errors can be safely ignored as
    they will have no impact on the running instance on the destination.

    In the future Nova could wire up the force and ignore_errors kwargs when
    calling down into the associated os-brick connectors to help avoid this.

    NOTE(mriedem): The driver.py change is slightly different from Train
    because pep F841 was not enforced starting in Train but is in Stein.

    Closes-Bug: #1843639
    Change-Id: Ieff5243854321ec40f642845e87a0faecaca8721
    (cherry picked from commit ac68cffd43a2f5103c28a2d4b31e087c3f5c24b9)
    (cherry picked from commit ff36b6d97ff289ddc34d7776f6a9141b09eb3ad9)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/691283
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a07c612ea6fb8553effecef7454caa179589e916
Submitter: Zuul
Branch: stable/rocky

commit a07c612ea6fb8553effecef7454caa179589e916
Author: Lee Yarwood <email address hidden>
Date: Wed Sep 11 19:24:05 2019 +0100

    libvirt: Ignore volume exceptions during post_live_migration

    Previously errors while disconnecting volumes from the source host
    during post_live_migration within LibvirtDriver would result in the
    overall failure of the migration. This would also mean that while the
    instance would be running on the destination it would still be listed as
    running on the source within the db.

    This change simply ignores any exceptions raised while attempting to
    disconnect volumes on the source. These errors can be safely ignored as
    they will have no impact on the running instance on the destination.

    In the future Nova could wire up the force and ignore_errors kwargs when
    calling down into the associated os-brick connectors to help avoid this.

    Closes-Bug: #1843639
    Change-Id: Ieff5243854321ec40f642845e87a0faecaca8721
    (cherry picked from commit ac68cffd43a2f5103c28a2d4b31e087c3f5c24b9)
    (cherry picked from commit ff36b6d97ff289ddc34d7776f6a9141b09eb3ad9)
    (cherry picked from commit 022ea2819425b5ab3001791455dda36ed638c22d)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/691284
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e2e9f415a977c725835f040d3c0f1c4cef9ad39c
Submitter: Zuul
Branch: stable/queens

commit e2e9f415a977c725835f040d3c0f1c4cef9ad39c
Author: Lee Yarwood <email address hidden>
Date: Wed Sep 11 19:24:05 2019 +0100

    libvirt: Ignore volume exceptions during post_live_migration

    Previously errors while disconnecting volumes from the source host
    during post_live_migration within LibvirtDriver would result in the
    overall failure of the migration. This would also mean that while the
    instance would be running on the destination it would still be listed as
    running on the source within the db.

    This change simply ignores any exceptions raised while attempting to
    disconnect volumes on the source. These errors can be safely ignored as
    they will have no impact on the running instance on the destination.

    In the future Nova could wire up the force and ignore_errors kwargs when
    calling down into the associated os-brick connectors to help avoid this.

    Closes-Bug: #1843639
    Change-Id: Ieff5243854321ec40f642845e87a0faecaca8721
    (cherry picked from commit ac68cffd43a2f5103c28a2d4b31e087c3f5c24b9)
    (cherry picked from commit ff36b6d97ff289ddc34d7776f6a9141b09eb3ad9)
    (cherry picked from commit 022ea2819425b5ab3001791455dda36ed638c22d)
    (cherry picked from commit a07c612ea6fb8553effecef7454caa179589e916)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.1

This issue was fixed in the openstack/nova 20.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.1.0

This issue was fixed in the openstack/nova 19.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.3.0

This issue was fixed in the openstack/nova 18.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova queens-eol

This issue was fixed in the openstack/nova queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.