Allow non admin users to use hardware offloaded ovs.

Registered by sean mooney

When hardware offloaded ovs was implemented it was intended to be useable by non-admin simply by setting vninc_type=direct.

This was broken by change I0b5f062bcbf02381bdf4f694fc039f9bb17a2db5 as an attempt to resolve https://bugs.launchpad.net/neutron/+bug/1713590

as noted in https://review.opendev.org/c/openstack/neutron/+/854796 the approach taken in neutron was fundamentally flawed.

the original neutron bug boiled down to the fact that if you could not simply deploy hardware offloaded ovs and normal sriov on the same host if the neutron mech driver list was

mechanism_drivers = openvswitch,sriovnicswitch

that is because the ovs mech driver would bind all direct type ports but then fail when we booted the VM in port plugging.

the simple workaround at the time was just to reverse the order of the mech driver.
mechanism_drivers = sriovnicswitch,openvswitch

https://bugs.launchpad.net/neutron/+bug/1713590/comments/1

instead, it was chosen to require that '{"capabilities": ["switchdev"]}' is present in the binding profile which requires the user to set that.

that is broken in 3 ways.

first, the binding profile field is defined as providing information from the hypervisor to the network backend not the user to the network backend.

Second that field is admin only and it's unsafe to allow normal users to write to the binding profile.

Third neutron just assume if that is set that the vf is actually in switchdev mode.
that is not true. there is nothing that considers this on the nova side as
https://specs.openstack.org/openstack/nova-specs/specs/pike/approved/enable-sriov-nic-features.html was not implemented.
so while nova does have the nic feature in the database, switchdev is not one of the capabilities we record and we never implemented the ability to schedule based on the neutron port capability because we change direction with the creation of the placement service.

instead, nova should simple be extended to discover if a VF's parent PF is in switchdev mode and report that to neutron in the binding:profile.

this will not by itself enable scheduling based on this capability but it will allow non-admins to use hardware offloaded ovs transparently as nova will add the capability if the VF support it automatically.

Blueprint information

Status:
Complete
Approver:
None
Priority:
Undefined
Drafter:
sean mooney
Direction:
Needs approval
Assignee:
None
Definition:
Obsolete
Series goal:
None
Implementation:
Unknown
Milestone target:
None
Completed by
sean mooney

Related branches

Sprints

Whiteboard

note that since libvirt does not provide this info

sean@cloud:~$ virsh nodedev-dumpxml net_enp34s0f0np0_b8_ce_f6_48_06_80
<device>
  <name>net_enp34s0f0np0_b8_ce_f6_48_06_80</name>
  <path>/sys/devices/pci0000:20/0000:20:03.0/0000:22:00.0/net/enp34s0f0np0</path>
  <parent>pci_0000_22_00_0</parent>
  <capability type='net'>
    <interface>enp34s0f0np0</interface>
    <address>b8:ce:f6:48:06:80</address>
    <link speed='25000' state='up'/>
    <capability type='80203'/>
  </capability>
</device>

sean@cloud:~$ virsh nodedev-dumpxml pci_0000_22_00_0
<device>
  <name>pci_0000_22_00_0</name>
  <path>/sys/devices/pci0000:20/0000:20:03.0/0000:22:00.0</path>
  <parent>pci_0000_20_03_0</parent>
  <driver>
    <name>mlx5_core</name>
  </driver>
  <capability type='pci'>
    <class>0x020000</class>
    <domain>0</domain>
    <bus>34</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x101d'>MT2892 Family [ConnectX-6 Dx]</product>
    <vendor id='0x15b3'>Mellanox Technologies</vendor>
    <capability type='virt_functions' maxCount='8'/>
    <iommuGroup number='70'>
      <address domain='0x0000' bus='0x22' slot='0x00' function='0x0'/>
    </iommuGroup>
    <numa node='1'/>
    <pci-express>
      <link validity='cap' port='0' speed='16' width='8'/>
      <link validity='sta' speed='8' width='8'/>
    </pci-express>
  </capability>
</device>

and its not provided in the genic offload we get form ethtool

sean@cloud:~$ ethtool -k enp34s0f0np0
Features for enp34s0f0np0:
rx-checksumming: on
tx-checksumming: on
 tx-checksum-ipv4: off [fixed]
 tx-checksum-ip-generic: on
 tx-checksum-ipv6: off [fixed]
 tx-checksum-fcoe-crc: off [fixed]
 tx-checksum-sctp: off [fixed]
scatter-gather: on
 tx-scatter-gather: on
 tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
 tx-tcp-segmentation: on
 tx-tcp-ecn-segmentation: off [fixed]
 tx-tcp-mangleid-segmentation: off
 tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: on
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

we will have to directly detect this form sysfs.

we indirectly already do this in os-vif

https://github.com/openstack/os-vif/blob/771dfffcd90dcd7c8c95c41744092f5ad4917be3/vif_plug_ovs/linux_net.py#L411-L423

def _get_phys_switch_id(ifname):
    """Get the interface name and return its phys_switch_id
    :param ifname: The interface name
    :return: The phys_switch_id of the given ifname
    """
    phys_port_name_path = "/sys/class/net/%s/phys_switch_id" % ifname

    if not os.path.isfile(phys_port_name_path):
        return None

    with open(phys_port_name_path, 'r') as fd:
        return fd.readline().strip()

as phys_switch_id will be none if the PF is not isn switchdev mode

https://github.com/openstack/os-vif/blob/771dfffcd90dcd7c8c95c41744092f5ad4917be3/vif_plug_ovs/linux_net.py#L309-L317

def _is_switchdev(netdev):
    """Returns True if a netdev has a readable phys_switch_id"""
    try:
        phys_switch_id = _get_phys_switch_id(netdev)
        if phys_switch_id != "" and phys_switch_id is not None:
            return True
    except (OSError, IOError):
        return False
    return False

 but we can check it directly in sysfs too.

actually looking at this hw-tc-offload: on is actually what we are looking for. i think ill need to verify so we might be able to just use that and avoid the sysfs lookup.

sigh so there might also be a libvirt bug at least on my current host

https://github.com/libvirt/libvirt/commit/8708ca01c0dd38764cad3e483405bdeb05ac2e96
as part of
https://blueprints.launchpad.net/nova/+spec/enable-sriov-nic-features

we also enhanced libvirt to detect the swtihc dev capability but
libvirt 8.0.0 on ubuntu 22.04 is not reporting any nic feature
for connex6-dx cards when they are in switch deve mode.

its works fine for intel cards in legacy mode

sean@cloud:~$ virsh nodedev-dumpxml net_enp8s0f0_a0_36_9f_2a_fd_f8
<device>
  <name>net_enp8s0f0_a0_36_9f_2a_fd_f8</name>
  <path>/sys/devices/pci0000:00/0000:00:1c.4/0000:08:00.0/net/enp8s0f0</path>
  <parent>pci_0000_08_00_0</parent>
  <capability type='net'>
    <interface>enp8s0f0</interface>
    <address>a0:36:9f:2a:fd:f8</address>
    <link speed='1000' state='up'/>
    <feature name='rx'/>
    <feature name='tx'/>
    <feature name='sg'/>
    <feature name='tso'/>
    <feature name='gso'/>
    <feature name='gro'/>
    <feature name='rxvlan'/>
    <feature name='txvlan'/>
    <feature name='rxhash'/>
    <feature name='txudptnl'/>
    <capability type='80203'/>
  </capability>
</device>

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.