ovn sriov broken from ussuri onwards

Bug #1931244 reported by Edward Hope-Morley
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
High
Unassigned
Ussuri
Fix Released
High
Unassigned
Victoria
Invalid
Undecided
Unassigned
Wallaby
Fix Released
High
Unassigned
Xena
Fix Released
High
Unassigned
neutron
Fix Released
High
Terry Wilson
neutron (Ubuntu)
Fix Released
High
Unassigned
Focal
Fix Released
High
Unassigned
Hirsute
Fix Released
High
Unassigned
Impish
Fix Released
High
Unassigned

Bug Description

I have an Openstack Ussuri 16.3.2 deployment using OVN. When I create a vm with one or more sriov ports it fails with:

2021-06-08 11:38:31.939 526862 WARNING nova.virt.libvirt.driver [req-c4be797e-7d7e-4e73-8406-f74ae51db192 696c98b722a44d229e16b6d6474a27d4 0b9102977dcc4d4ab662b48494bb3110 - 2e0bf6ec95c047d986a61f7570222149 2e0bf6ec95c047d986a61f7570222149] [instance: 7ab9b374-51eb-4e94-8055-c69e8a7d76b3] Timeout waiting for [('network-vif-plugged', 'c2b7c68d-c465-4ca2-869a-59bc73b2b595'), ('network-vif-plugged', 'a50de16a-29ac-4dca-9cb6-0247a932fbf3')] for instance with vm_state building and task_state spawning.: eventlet.timeout.Timeout: 300 seconds

A bit of analysis shows that nova-compute did its thing and sits there waiting on network-vif-plugged. The sriov-agent then notices new VFs configured and sends a get_devices_details_list() rpc call to neutron and neutron never responds. Reverting to 16.3.1 fixes the issue. Taking a closer look at 16.3.2 by reverting patches lead to [1] as the culprit.

[1] https://github.com/openstack/neutron/commit/7cf9597570f288d27768dc5ff7be04824d09f8bc

=== Ubuntu SRU details ===
[Impact]
[Test Case]
See above.
I think for testing we can run standard regression testing with OVN/neutron deployments plus tempest testing.

For now we are planning to revert the commit as a stop-gap to prevent further upgrades from being regressed.

[Regression Potential]
There is regression potential in that the patch being reverted contributes partial fixes to the following related bugs. Considering most openstack users are on ussuri at this point and 16.3.2 has not been available for very long the revert that we are proposing would seem to have the least amount of regression potential.
https://bugs.launchpad.net/neutron/+bug/1894117
https://bugs.launchpad.net/neutron/+bug/1903008

Revision history for this message
Brian Haley (brian-haley) wrote :

Terry - can you take a look since it's related to https://review.opendev.org/c/openstack/neutron/+/765874 (Rely on worker count for HashRing caching)

Changed in neutron:
importance: Undecided → High
assignee: nobody → Terry Wilson (otherwiseguy)
tags: added: ovn
tags: added: sriov-pci-pt
Changed in neutron (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in cloud-archive:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I believe the following bug may also be related - https://bugs.launchpad.net/neutron/+bug/1927977

Revision history for this message
Edward Hope-Morley (hopem) wrote :

I think the issue here is basically that the new code relies on [1] to get number of worker threads but that does not include things like rpc workers.

https://github.com/openstack/neutron/blob/df94641b43964834ba14c69eb4fb17cc45349117/neutron/service.py#L313

Changed in neutron (Ubuntu Focal):
importance: Undecided → High
status: New → Triaged
Changed in neutron (Ubuntu Groovy):
importance: Undecided → High
status: New → Triaged
Changed in neutron (Ubuntu Hirsute):
importance: Undecided → High
status: New → Triaged
description: updated
description: updated
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

I'm a little confused because I was under the impression that ml2/ovn doesn't actually use RPC workers for anything. What is happening that RPC requests end up calling into ml2/ovn code?

If it's needed, it's needed, but I don't like unconditionally setting up connections for non-API workers because connections are really expensive (each maintains an in-memory copy of the OVN DBs). I have no experience w/ nor hardware for SRIOV stuff, but it should be easy for someone who does to be able to add service.RpcWorker (I think) to https://review.opendev.org/c/openstack/neutron/+/765874/11/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#62 to see if it resolves the issue. If it does, I'd love to be able to find a way to conditionally add the connections only if they are needed--but if it involves multiple ml2 drivers or something, that might be hard.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This doesn't affect groovy/victoria in Ubuntu since it's not been included in an upstream point release.

Changed in neutron (Ubuntu Groovy):
status: Triaged → Invalid
Revision history for this message
Corey Bryant (corey.bryant) wrote :

^ In an upstream point release for victoria, that is.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Edward, or anyone else affected,

Accepted neutron into hirsute-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:18.0.0-0ubuntu2.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-hirsute to verification-done-hirsute. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-hirsute. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in neutron (Ubuntu Hirsute):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-hirsute
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Edward, or anyone else affected,

Accepted neutron into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:16.3.2-0ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in neutron (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed-focal
Mathew Hodson (mhodson)
no longer affects: neutron (Ubuntu Groovy)
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Terry, while I'd love it if Neutron had completed the migration to using OVN services and database for everything as a replacement for the AMQP RPC and agent infrastructure, the reality is that this is not (yet) the case.

For the SR-IOV use case the OVN driver owns ports and provides DHCP- and Metadata- services to the consumers of the sriovnicsiwtch mechanism driver. At this point in time the sriovnicswitch driver still relies on an agent and RPC.

The RPC code does indeed read and update ports and subsequently will call into the OVN driver.

Thank you for the pointer to the RpcWorker class name. I will put up a proposal for how we could conditionally add OVN IDL connections to the RPC Workers gated on the presence of the sriovnicswitch driver.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/795474

Changed in neutron:
status: New → In Progress
Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Edward, or anyone else affected,

Accepted neutron into ussuri-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ussuri-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-archive:
status: Triaged → Fix Committed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Edward, or anyone else affected,

Accepted neutron into wallaby-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:wallaby-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-wallaby-needed to verification-wallaby-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-wallaby-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ussuri-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Verified bionic-ussuri-proposed with output:

# apt-cache policy neutron-common
neutron-common:
  Installed: 2:16.3.2-0ubuntu3~cloud0
  Candidate: 2:16.3.2-0ubuntu3~cloud0
  Version table:
 *** 2:16.3.2-0ubuntu3~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-proposed/ussuri/main amd64 Packages
        100 /var/lib/dpkg/status
     2:16.3.1-0ubuntu1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/ussuri/main amd64 Packages
     2:12.1.1-0ubuntu7 500
        500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
     2:12.0.1-0ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages

I created a two-port sriov vm on bionic-ussuri and it came up in seconds.

tags: added: verification-ussuri-done
removed: verification-ussuri-needed
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Verified on focal-proposed and test case is successful

$ juju run -a neutron-api -- sudo apt-cache policy neutron-common
neutron-common:
  Installed: 2:16.3.2-0ubuntu3
  Candidate: 2:16.3.2-0ubuntu3
  Version table:
 *** 2:16.3.2-0ubuntu3 500
        500 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     2:16.3.2-0ubuntu2 500
        500 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
     2:16.0.0~b3~git2020041516.5f42488a9a-0ubuntu2 500
        500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages

Created VM with 2 SRIOV ports

$ openstack server list --long
+--------------------------------------+-------------+--------+------------+-------------+----------------------------------------+------------+--------------------------------------+-------------+-----------+-------------------+---------------------+------------+
| ID | Name | Status | Task State | Power State | Networks | Image Name | Image ID | Flavor Name | Flavor ID | Availability Zone | Host | Properties |
+--------------------------------------+-------------+--------+------------+-------------+----------------------------------------+------------+--------------------------------------+-------------+-----------+-------------------+---------------------+------------+
| 0f0c5104-cda8-4b84-95b0-8a713e8a1db6 | sriov-test1 | ACTIVE | None | Running | sriov_net=10.230.58.157, 10.230.58.133 | bionic | 17cca127-b912-444d-bc9a-5e4cf48156b3 | m1.medium | 3 | nova | test.test.test | |
+--------------------------------------+-------------+--------+------------+-------------+----------------------------------------+------------+--------------------------------------+-------------+-----------+-------------------+---------------------+------------+

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Verified on hirsute-proposed and test case is successful

$ juju run -a neutron-api -- sudo apt-cache policy neutron-common
neutron-common:
  Installed: 2:18.0.0-0ubuntu2.1
  Candidate: 2:18.0.0-0ubuntu2.1
  Version table:
 *** 2:18.0.0-0ubuntu2.1 500
        500 http://archive.ubuntu.com/ubuntu hirsute-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     2:18.0.0-0ubuntu2 500
        500 http://archive.ubuntu.com/ubuntu hirsute/main amd64 Packages

Created VM with 2 SRIOV ports
$ openstack server list
+--------------------------------------+-------------+--------+----------------------------------------+--------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-------------+--------+----------------------------------------+--------+-----------+
| 590b45a3-3d93-44cf-b8a5-0ece109c608e | sriov-test2 | ACTIVE | sriov_net=10.230.58.156, 10.230.58.170 | bionic | m1.medium |
+--------------------------------------+-------------+--------+----------------------------------------+--------+-----------+

tags: added: verification-done-hirsute
removed: verification-needed-hirsute
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

SRU team,
As per comment #14, the fix should be available in cloud-archive:wallaby-proposed

But I dont see a new neutron package to upgrade when wallaby-proposed is enabled.

# cat /etc/apt/sources.list.d/cloudarchive-wallaby-proposed.list
deb http://ubuntu-cloud.archive.canonical.com/ubuntu focal-proposed/wallaby main

# apt list --installed | grep neutron

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

neutron-common/focal-updates,focal-proposed,now 2:18.0.0-0ubuntu2~cloud0 all [installed]
neutron-fwaas-common/focal-updates,now 1:16.0.0-0ubuntu0.20.04.1 all [installed,automatic]
neutron-plugin-ml2/focal-updates,focal-proposed,now 2:18.0.0-0ubuntu2~cloud0 all [installed,automatic]
neutron-server/focal-updates,focal-proposed,now 2:18.0.0-0ubuntu2~cloud0 all [installed]
python3-neutron-dynamic-routing/focal-updates,focal-proposed,now 2:18.0.0-0ubuntu1~cloud0 all [installed]
python3-neutron-fwaas/focal-updates,now 1:16.0.0-0ubuntu0.20.04.1 all [installed]
python3-neutron-lib/focal-updates,focal-proposed,now 2.10.1-0ubuntu1~cloud0 all [installed,automatic]
python3-neutron/focal-updates,focal-proposed,now 2:18.0.0-0ubuntu2~cloud0 all [installed]
python3-neutronclient/focal-updates,focal-proposed,now 1:7.2.1-0ubuntu1~cloud0 all [installed,automatic]

Also neutron-common in http://ubuntu-cloud.archive.canonical.com/ubuntu/dists/focal-proposed/wallaby/main/binary-arm64/Packages refers to 2:18.0.0-0ubuntu2~cloud0 (which is same version as focal-updates/wallaby)

Could you please crosscheck if new package is released for SRU verification

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hemanth, the new version should now be available in proposed. Thanks for the note on that.

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Verified on wallaby-proposed and test case is successful

$ juju run -a neutron-api -- sudo apt-cache policy neutron-common
neutron-common:
  Installed: 2:18.0.0-0ubuntu2.1~cloud0
  Candidate: 2:18.0.0-0ubuntu2.1~cloud0
  Version table:
 *** 2:18.0.0-0ubuntu2.1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu focal-proposed/wallaby/main amd64 Packages
        100 /var/lib/dpkg/status
     2:18.0.0-0ubuntu2~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu focal-updates/wallaby/main amd64 Packages
     2:16.3.2-0ubuntu2 500
        500 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
     2:16.0.0~b3~git2020041516.5f42488a9a-0ubuntu2 500
        500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages

Launching of VM with 2 SRIOV ports is successful

$ openstack server list
+--------------------------------------+------------+--------+----------------------------------------+--------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+----------------------------------------+--------+-----------+
| 53d397bf-8233-4852-a5c0-c27835cede67 | sriov-test | ACTIVE | sriov_net=10.230.58.173, 10.230.58.121 | bionic | m1.medium |
+--------------------------------------+------------+--------+----------------------------------------+--------+-----------+

tags: added: verification-done verification-wallaby-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:18.0.0-0ubuntu2.1

---------------
neutron (2:18.0.0-0ubuntu2.1) hirsute; urgency=medium

  * d/gbp.conf: Create stable/wallaby branch.
  * d/p/revert-rely-on-worker-count-for-hashring-caching.patch: Revert
    patch due to SR-IOV regression (LP: #1931244).
  * d/p/remove-leading-zeroes-from-an-ip-address.patch: Cherry-picked from
    upstream to fix failing test (LP: #1930222).

 -- Corey Bryant <email address hidden> Tue, 08 Jun 2021 10:52:23 -0400

Changed in neutron (Ubuntu Hirsute):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:16.3.2-0ubuntu3

---------------
neutron (2:16.3.2-0ubuntu3) focal; urgency=medium

  * d/p/revert-rely-on-worker-count-for-hashring-caching.patch: Revert
    patch from 16.3.2 due to SR-IOV regression (LP: #1931244).

 -- Corey Bryant <email address hidden> Tue, 08 Jun 2021 12:57:47 -0400

Changed in neutron (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:18.0.0-0ubuntu3~cloud0
---------------

 neutron (2:18.0.0-0ubuntu3~cloud0) focal-xena; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:18.0.0-0ubuntu3) impish; urgency=medium
 .
   * d/p/revert-rely-on-worker-count-for-hashring-caching.patch: Revert
     patch due to SR-IOV regression (LP: #1931244).
   * d/p/remove-leading-zeroes-from-an-ip-address.patch: Cherry-picked from
     upstream to fix failing test (LP: #1930222).

Changed in cloud-archive:
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:18.0.0-0ubuntu2.1~cloud0
---------------

 neutron (2:18.0.0-0ubuntu2.1~cloud0) focal-wallaby; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:18.0.0-0ubuntu2.1) hirsute; urgency=medium
 .
   * d/gbp.conf: Create stable/wallaby branch.
   * d/p/revert-rely-on-worker-count-for-hashring-caching.patch: Revert
     patch due to SR-IOV regression (LP: #1931244).
   * d/p/remove-leading-zeroes-from-an-ip-address.patch: Cherry-picked from
     upstream to fix failing test (LP: #1930222).

Changed in neutron (Ubuntu Impish):
status: Triaged → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:16.3.2-0ubuntu3~cloud0
---------------

 neutron (2:16.3.2-0ubuntu3~cloud0) bionic-ussuri; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:16.3.2-0ubuntu3) focal; urgency=medium
 .
   * d/p/revert-rely-on-worker-count-for-hashring-caching.patch: Revert
     patch from 16.3.2 due to SR-IOV regression (LP: #1931244).

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Frode Nordahl <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/795474
Reason: Superseded by unconditionally enabling OVN IDL for RPC workers in https://review.opendev.org/c/openstack/neutron/+/800679

Revision history for this message
Frode Nordahl (fnordahl) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.