NUMA aware live migration failed when vCPU pin set

Bug #1845146 reported by ya.wang
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Artom Lifshitz
Train
Fix Committed
High
Dan Smith

Bug Description

Description
===========

When vCPU pin policy is dedicated, the NUMA aware live migration may go failed.

Steps to reproduce
==================

1. Create two flavor: 2c2g.numa; 4c.4g.numa
   (venv) [root@t1 ~]# openstack flavor show 2c2g.numa
+----------------------------+----------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+----------------------------+----------------------------------------------------------------------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| access_project_ids | None |
| disk | 1 |
| id | b4a2df98-82c5-4a53-8ba5-4372f20a98bd |
| name | 2c2g.numa |
| os-flavor-access:is_public | True |
| properties | hw:cpu_policy='dedicated', hw:numa_cpus.0='0', hw:numa_cpus.1='1', hw:numa_mem.0='1024', hw:numa_mem.1='1024', hw:numa_nodes='2' |
| ram | 2048 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 2 |
+----------------------------+----------------------------------------------------------------------------------------------------------------------------------+
   (venv) [root@t1 ~]# openstack flavor show 4c.4g.numa
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| access_project_ids | None |
| disk | 1 |
| id | cf53f5ea-c036-4a79-8183-6a2389212d02 |
| name | 4c.4g.numa |
| os-flavor-access:is_public | True |
| properties | hw:cpu_policy='dedicated', hw:numa_cpus.0='0', hw:numa_cpus.1='1,2,3', hw:numa_mem.0='3072', hw:numa_mem.1='1024', hw:numa_nodes='2' |
| ram | 4096 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 4 |
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------+

2. Create four instance (2c2g.numa * 2, 4c.4g.numa * 2)

3. Live migrate the instances one by one

4. After the four instances live migrate done, check the vCPU pin is correct (use 'virsh vcpupin [vm_id]')

5. If vCPU pin correct, continue to step 3.

Expected result
===============

The vCPU pin is correct

Actual result
=============

The vCPU pin not correct on compute node: t1.

(nova-libvirt)[root@t1 /]# virsh list
 Id Name State
----------------------------------------------------
 138 instance-00000012 running
 139 instance-00000011 running

(nova-libvirt)[root@t1 /]# virsh vcpupin 138
VCPU: CPU Affinity
----------------------------------
   0: 0
   1: 15

(nova-libvirt)[root@t1 /]# virsh vcpupin 139
VCPU: CPU Affinity
----------------------------------
   0: 0
   1: 15

Environment
===========

Code version: master, 23 Sep

Three compute nodes:
    t1: 16C, 24GB (2 NUMA nodes)
    t2: 12C, 16GB (2 NUMA nodes)
    t3: 8C, 12GB (2 NUMA nodes)

The image has no property.

Hypervisor: Libvirt + KVM

Storage: ceph

Networking_type: Neutron + OVS

Logs & Configs
==============

Please check the attachment to get log file.

Revision history for this message
ya.wang (ya.wang) wrote :
Matt Riedemann (mriedem)
tags: added: numa
Revision history for this message
Matt Riedemann (mriedem) wrote :

This may be a duplicate of bug 1829349.

Revision history for this message
Artom Lifshitz (notartom) wrote :
Download full text (5.4 KiB)

Log analysis notes:

The XML was updated to pin both instances to CPUs 0 and 15, at very different times:

2019-09-24 14:16:14.195 6 DEBUG nova.virt.libvirt.migration [-] _update_numa_xml output xml=<domain type="kvm">
  <name>instance-00000012</name>
  <uuid>17bcf040-cf68-4ac3-b365-8a77f93af85b</uuid>
[...]
    <vcpupin vcpu="0" cpuset="0"/>
    <vcpupin vcpu="1" cpuset="15"/>

2019-09-24 14:16:42.251 6 DEBUG nova.virt.libvirt.migration [-] _update_numa_xml output xml=<domain type="kvm">
  <name>instance-00000011</name>
  <uuid>f1929d75-d6ac-45af-b54b-0e10be75d155</uuid>
[...]
    <vcpupin vcpu="0" cpuset="0"/>
    <vcpupin vcpu="1" cpuset="15"/>

For the first live migration we create the claims and the NUMAMigrateInfo:

2019-09-24 14:16:08.747 6 DEBUG nova.compute.manager [req-9c210c32-614c-4040-abc8-e8a4138d885b f28f4213c3a14121b6f7fa15140e7aef 25696a57055b4d6bb5428f45f0473a8c - default default] [instance: 17bcf040-cf68-4ac3-b365-8a77f93af85b] Created live migration claim. _live_migration_claim /var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py:6659

2019-09-24 14:16:08.760 6 DEBUG nova.virt.libvirt.driver [req-9c210c32-614c-4040-abc8-e8a4138d885b f28f4213c3a14121b6f7fa15140e7aef 25696a57055b4d6bb5428f45f0473a8c - default default] Built NUMA live migration info: LibvirtLiveMigrateNUMAInfo(cell_pins={0=set([0]),1=set([1])},cpu_pins={0=set([0]),1=set([15])},emulator_pins=set([0,15]),sched_priority=<?>,sched_vcpus=<?>) _get_live_migrate_numa_info /var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:8059

Same for the second live migration:

2019-09-24 14:16:35.853 6 DEBUG nova.compute.manager [req-5aeb2f2d-c69f-473a-8da4-59664d87a214 f28f4213c3a14121b6f7fa15140e7aef 25696a57055b4d6bb5428f45f0473a8c - default default] [instance: f1929d75-d6ac-45af-b54b-0e10be75d155] Created live migration claim. _live_migration_claim /var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/manager.py:6659

2019-09-24 14:16:35.861 6 DEBUG nova.virt.libvirt.driver [req-5aeb2f2d-c69f-473a-8da4-59664d87a214 f28f4213c3a14121b6f7fa15140e7aef 25696a57055b4d6bb5428f45f0473a8c - default default] Built NUMA live migration info: LibvirtLiveMigrateNUMAInfo(cell_pins={0=set([0]),1=set([1])},cpu_pins={0=set([0]),1=set([15])},emulator_pins=set([0,15]),sched_priority=<?>,sched_vcpus=<?>) _get_live_migrate_numa_info /var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:8059

Both claimed host CPUs 0 and 15 - but how/why? What happened between those 2 claims? Going back in time, we see:

The second live migration's claim claims CPUs 0 and 15:

2019-09-24 14:16:34.290 6 DEBUG nova.virt.hardware [req-5aeb2f2d-c69f-473a-8da4-59664d87a214 f28f4213c3a14121b6f7fa15140e7aef 25696a57055b4d6bb5428f45f0473a8c - default default] Selected cores for pinning: [(0, 0)], in cell 0 _pack_instance_onto_cores /var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/hardware.py:979

[...]

2019-09-24 14:16:34.295 6 DEBUG nova.virt.hardware [req-5aeb2f2d-c69f-473a-8da4-59664d87a214 f28f4213c3a14121b6f7fa15140e7aef 25696a57055b4d6bb5428f45f0473a8c - default default] Selected cores for pinning: [(1,...

Read more...

Revision history for this message
Artom Lifshitz (notartom) wrote :

Figured it out:

When the update resources periodic task runs, it pulls migrations from the database using [1], which filters out migrations in 'accepted' status. Live migrations are created with an 'accepted' status by the conductor [2], and are only set to 'preparing' by the compute manager here [3], which happens after all the new NUMA-aware live migrations claims stuff. So there's a time window after the claim but before the migration has been set to 'preparing' during which, if the periodic resource update task kicks in, it will miss the migration, see that the instance is still on the source host according to the database, and free its resources from the destination.

[1] https://github.com/openstack/nova/blob/0ce66605e16aca85df97acdd8c459802fcdb9aa0/nova/db/sqlalchemy/api.py#L4422
[2] https://github.com/openstack/nova/blob/0ce66605e16aca85df97acdd8c459802fcdb9aa0/nova/conductor/manager.py#L422
[3] https://github.com/openstack/nova/blob/0ce66605e16aca85df97acdd8c459802fcdb9aa0/nova/compute/manager.py#L7020

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/684409

Changed in nova:
assignee: nobody → Artom Lifshitz (notartom)
status: New → In Progress
Revision history for this message
Artom Lifshitz (notartom) wrote :

Ya, could you retry your tests with [1] applied, to confirm whether it fixes the issue?

[1] https://review.opendev.org/684409

Matt Riedemann (mriedem)
tags: added: train-rc-potential
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/685387

Matt Riedemann (mriedem)
no longer affects: nova/train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/684409
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6ec686c26b2c8b18bcff522633bfe9715e0feec3
Submitter: Zuul
Branch: master

commit 6ec686c26b2c8b18bcff522633bfe9715e0feec3
Author: Artom Lifshitz <email address hidden>
Date: Tue Sep 24 13:22:23 2019 -0400

    Stop filtering out 'accepted' for in-progress migrations

    Live migrations are created with an 'accepted' status. Resource claims
    on the destination are done with the migration in 'accepted' status.
    The status is set to 'preparing' a bit later, right before running
    pre_live_migration(). Migrations with status 'accepted' are filtered
    out by the database layer when getting in-progress migrations. Thus,
    there's a time window after resource claims but before 'preparing'
    during which resources have been claimed but the migration is not
    considered in-progress by the database layer. During that window, the
    instance's host is the source - that's only updated once the live
    migration finishes. If the update available resources periodic task
    runs during that window, it'll free the instance's resource from the
    destination because neither the instance nor any of its in-progress
    migrations are associated with the destination. This means that other
    incoming instances are able to consume resources that should not be
    available. This patch stops filtering out the 'accepted' status in the
    database layer when retrieving in-progress migrations.

    Change-Id: I4c56925ed35bc3275ca1ac6c30d7fd641ad84260
    Closes-bug: 1845146

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/685387
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=45c2ba37bc21370d9814b81cd892abc3cb8a9f04
Submitter: Zuul
Branch: stable/train

commit 45c2ba37bc21370d9814b81cd892abc3cb8a9f04
Author: Artom Lifshitz <email address hidden>
Date: Tue Sep 24 13:22:23 2019 -0400

    Stop filtering out 'accepted' for in-progress migrations

    Live migrations are created with an 'accepted' status. Resource claims
    on the destination are done with the migration in 'accepted' status.
    The status is set to 'preparing' a bit later, right before running
    pre_live_migration(). Migrations with status 'accepted' are filtered
    out by the database layer when getting in-progress migrations. Thus,
    there's a time window after resource claims but before 'preparing'
    during which resources have been claimed but the migration is not
    considered in-progress by the database layer. During that window, the
    instance's host is the source - that's only updated once the live
    migration finishes. If the update available resources periodic task
    runs during that window, it'll free the instance's resource from the
    destination because neither the instance nor any of its in-progress
    migrations are associated with the destination. This means that other
    incoming instances are able to consume resources that should not be
    available. This patch stops filtering out the 'accepted' status in the
    database layer when retrieving in-progress migrations.

    Change-Id: I4c56925ed35bc3275ca1ac6c30d7fd641ad84260
    Closes-bug: 1845146
    (cherry picked from commit 6ec686c26b2c8b18bcff522633bfe9715e0feec3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/687404

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc2

This issue was fixed in the openstack/nova 20.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/687404
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=32713a4fe885ee55ef0fefc8ce6c78877f2f03e7
Submitter: Zuul
Branch: master

commit 32713a4fe885ee55ef0fefc8ce6c78877f2f03e7
Author: Artom Lifshitz <email address hidden>
Date: Tue Oct 8 15:23:47 2019 -0400

    NUMA LM: Add func test for bug 1845146

    Bug 1845146 was caused by the update available resources periodic task
    running during a small window in which the migration was in 'accepted'
    but resource claims had been done. 'accepted' migrations were not
    considered in progress before the fix for 1845146 merged as commit
    6ec686c26b, which caused the periodic task to incorrectly free the
    migration's resources from the destination. This patch adds a test
    that triggers this race by wrapping around the compute manager's
    live_migration() (which sets the 'queued' migration status - this was
    actually wrong in 6ec686c26b, as it talks about 'preparing') and
    running the update available resources periodic task while the
    migration is still in 'accepted'.

    Related bug: 1845146

    Change-Id: I78e79112a9c803fb45d828cfb4641456da66364a

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.