Configuring an SR-IOV network device

You can configure a Single Root I/O Virtualization (SR-IOV) device in your cluster.

SR-IOV network node configuration object

You specify the SR-IOV network device configuration for a node by creating an SR-IOV network node policy. The API object for the policy is part of the sriovnetwork.openshift.io API group.

The following YAML describes an SR-IOV network node policy:

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovNetworkNodePolicy
  3. metadata:
  4. name: <name> (1)
  5. namespace: openshift-sriov-network-operator (2)
  6. spec:
  7. resourceName: <sriov_resource_name> (3)
  8. nodeSelector:
  9. feature.node.kubernetes.io/network-sriov.capable: "true" (4)
  10. priority: <priority> (5)
  11. mtu: <mtu> (6)
  12. needVhostNet: false (7)
  13. numVfs: <num> (8)
  14. nicSelector: (9)
  15. vendor: "<vendor_code>" (10)
  16. deviceID: "<device_id>" (11)
  17. pfNames: ["<pf_name>", ...] (12)
  18. rootDevices: ["<pci_bus_id>", ...] (13)
  19. netFilter: "<filter_string>" (14)
  20. deviceType: <device_type> (15)
  21. isRdma: false (16)
  22. linkType: <link_type> (17)
  23. eSwitchMode: <mode> (18)
  24. excludeTopology: false (19)
1The name for the custom resource object.
2The namespace where the SR-IOV Network Operator is installed.
3The resource name of the SR-IOV network device plugin. You can create multiple SR-IOV network node policies for a resource name.

When specifying a name, be sure to use the accepted syntax expression ^[a-zA-Z0-9_]+$ in the resourceName.

4The node selector specifies the nodes to configure. Only SR-IOV network devices on the selected nodes are configured. The SR-IOV Container Network Interface (CNI) plugin and device plugin are deployed on selected nodes only.

The SR-IOV Network Operator applies node network configuration policies to nodes in sequence. Before applying node network configuration policies, the SR-IOV Network Operator checks if the machine config pool (MCP) for a node is in an unhealthy state such as Degraded or Updating. If a node is in an unhealthy MCP, the process of applying node network configuration policies to all targeted nodes in the cluster pauses until the MCP returns to a healthy state.

To avoid a node in an unhealthy MCP from blocking the application of node network configuration policies to other nodes, including nodes in other MCPs, you must create a separate node network configuration policy for each MCP.

5Optional: The priority is an integer value between 0 and 99. A smaller value receives higher priority. For example, a priority of 10 is a higher priority than 99. The default value is 99.
6Optional: The maximum transmission unit (MTU) of the virtual function. The maximum MTU value can vary for different network interface controller (NIC) models.

If you want to create virtual function on the default network interface, ensure that the MTU is set to a value that matches the cluster MTU.

7Optional: Set needVhostNet to true to mount the /dev/vhost-net device in the pod. Use the mounted /dev/vhost-net device with Data Plane Development Kit (DPDK) to forward traffic to the kernel network stack.
8The number of the virtual functions (VF) to create for the SR-IOV physical network device. For an Intel network interface controller (NIC), the number of VFs cannot be larger than the total VFs supported by the device. For a Mellanox NIC, the number of VFs cannot be larger than 128.
9The NIC selector identifies the device for the Operator to configure. You do not have to specify values for all the parameters. It is recommended to identify the network device with enough precision to avoid selecting a device unintentionally.

If you specify rootDevices, you must also specify a value for vendor, deviceID, or pfNames. If you specify both pfNames and rootDevices at the same time, ensure that they refer to the same device. If you specify a value for netFilter, then you do not need to specify any other parameter because a network ID is unique.

10Optional: The vendor hexadecimal code of the SR-IOV network device. The only allowed values are 8086 and 15b3.
11Optional: The device hexadecimal code of the SR-IOV network device. For example, 101b is the device ID for a Mellanox ConnectX-6 device.
12Optional: An array of one or more physical function (PF) names for the device.
13Optional: An array of one or more PCI bus addresses for the PF of the device. Provide the address in the following format: 0000:02:00.1.
14Optional: The platform-specific network filter. The only supported platform is OpenStack. Acceptable values use the following format: openstack/NetworkID:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. Replace xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx with the value from the /var/config/openstack/latest/network_data.json metadata file.
15Optional: The driver type for the virtual functions. The only allowed values are netdevice and vfio-pci. The default value is netdevice.

For a Mellanox NIC to work in DPDK mode on bare metal nodes, use the netdevice driver type and set isRdma to true.

16Optional: Configures whether to enable remote direct memory access (RDMA) mode. The default value is false.

If the isRdma parameter is set to true, you can continue to use the RDMA-enabled VF as a normal network device. A device can be used in either mode.

Set isRdma to true and additionally set needVhostNet to true to configure a Mellanox NIC for use with Fast Datapath DPDK applications.

17Optional: The link type for the VFs. The default value is eth for Ethernet. Change this value to ‘ib’ for InfiniBand.

When linkType is set to ib, isRdma is automatically set to true by the SR-IOV Network Operator webhook. When linkType is set to ib, deviceType should not be set to vfio-pci.

Do not set linkType to ‘eth’ for SriovNetworkNodePolicy, because this can lead to an incorrect number of available devices reported by the device plugin.

18Optional: The NIC device mode. The only allowed values are legacy or switchdev.

When eSwitchMode is set to legacy, the default SR-IOV behavior is enabled.

When eSwitchMode is set to switchdev, hardware offloading is enabled.

19Optional: To exclude advertising an SR-IOV network resource’s NUMA node to the Topology Manager, set the value to true. The default value is false.

SR-IOV network node configuration examples

The following example describes the configuration for an InfiniBand device:

Example configuration for an InfiniBand device

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovNetworkNodePolicy
  3. metadata:
  4. name: policy-ib-net-1
  5. namespace: openshift-sriov-network-operator
  6. spec:
  7. resourceName: ibnic1
  8. nodeSelector:
  9. feature.node.kubernetes.io/network-sriov.capable: "true"
  10. numVfs: 4
  11. nicSelector:
  12. vendor: "15b3"
  13. deviceID: "101b"
  14. rootDevices:
  15. - "0000:19:00.0"
  16. linkType: ib
  17. isRdma: true

The following example describes the configuration for an SR-IOV network device in a OpenStack virtual machine:

Example configuration for an SR-IOV device in a virtual machine

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovNetworkNodePolicy
  3. metadata:
  4. name: policy-sriov-net-openstack-1
  5. namespace: openshift-sriov-network-operator
  6. spec:
  7. resourceName: sriovnic1
  8. nodeSelector:
  9. feature.node.kubernetes.io/network-sriov.capable: "true"
  10. numVfs: 1 (1)
  11. nicSelector:
  12. vendor: "15b3"
  13. deviceID: "101b"
  14. netFilter: "openstack/NetworkID:ea24bd04-8674-4f69-b0ee-fa0b3bd20509" (2)
1The numVfs field is always set to 1 when configuring the node network policy for a virtual machine.
2The netFilter field must refer to a network ID when the virtual machine is deployed on OpenStack. Valid values for netFilter are available from an SriovNetworkNodeState object.

Virtual function (VF) partitioning for SR-IOV devices

In some cases, you might want to split virtual functions (VFs) from the same physical function (PF) into multiple resource pools. For example, you might want some of the VFs to load with the default driver and the remaining VFs load with the vfio-pci driver. In such a deployment, the pfNames selector in your SriovNetworkNodePolicy custom resource (CR) can be used to specify a range of VFs for a pool using the following format: <pfname>#<first_vf>-<last_vf>.

For example, the following YAML shows the selector for an interface named netpf0 with VF 2 through 7:

  1. pfNames: ["netpf0#2-7"]
  • netpf0 is the PF interface name.

  • 2 is the first VF index (0-based) that is included in the range.

  • 7 is the last VF index (0-based) that is included in the range.

You can select VFs from the same PF by using different policy CRs if the following requirements are met:

  • The numVfs value must be identical for policies that select the same PF.

  • The VF index must be in the range of 0 to <numVfs>-1. For example, if you have a policy with numVfs set to 8, then the <first_vf> value must not be smaller than 0, and the <last_vf> must not be larger than 7.

  • The VFs ranges in different policies must not overlap.

  • The <first_vf> must not be larger than the <last_vf>.

The following example illustrates NIC partitioning for an SR-IOV device.

The policy policy-net-1 defines a resource pool net-1 that contains the VF 0 of PF netpf0 with the default VF driver. The policy policy-net-1-dpdk defines a resource pool net-1-dpdk that contains the VF 8 to 15 of PF netpf0 with the vfio VF driver.

Policy policy-net-1:

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovNetworkNodePolicy
  3. metadata:
  4. name: policy-net-1
  5. namespace: openshift-sriov-network-operator
  6. spec:
  7. resourceName: net1
  8. nodeSelector:
  9. feature.node.kubernetes.io/network-sriov.capable: "true"
  10. numVfs: 16
  11. nicSelector:
  12. pfNames: ["netpf0#0-0"]
  13. deviceType: netdevice

Policy policy-net-1-dpdk:

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovNetworkNodePolicy
  3. metadata:
  4. name: policy-net-1-dpdk
  5. namespace: openshift-sriov-network-operator
  6. spec:
  7. resourceName: net1dpdk
  8. nodeSelector:
  9. feature.node.kubernetes.io/network-sriov.capable: "true"
  10. numVfs: 16
  11. nicSelector:
  12. pfNames: ["netpf0#8-15"]
  13. deviceType: vfio-pci

Verifying that the interface is successfully partitioned

Confirm that the interface partitioned to virtual functions (VFs) for the SR-IOV device by running the following command.

  1. $ ip link show <interface> (1)
1Replace <interface> with the interface that you specified when partitioning to VFs for the SR-IOV device, for example, ens3f1.

Example output

  1. 5: ens3f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
  2. link/ether 3c:fd:fe:d1:bc:01 brd ff:ff:ff:ff:ff:ff
  3. vf 0 link/ether 5a:e7:88:25:ea:a0 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
  4. vf 1 link/ether 3e:1d:36:d7:3d:49 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
  5. vf 2 link/ether ce:09:56:97:df:f9 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
  6. vf 3 link/ether 5e:91:cf:88:d1:38 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
  7. vf 4 link/ether e6:06:a1:96:2f:de brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off

Configuring SR-IOV network devices

The SR-IOV Network Operator adds the SriovNetworkNodePolicy.sriovnetwork.openshift.io CustomResourceDefinition to OKD. You can configure an SR-IOV network device by creating a SriovNetworkNodePolicy custom resource (CR).

When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes.

It might take several minutes for a configuration change to apply.

Prerequisites

  • You installed the OpenShift CLI (oc).

  • You have access to the cluster as a user with the cluster-admin role.

  • You have installed the SR-IOV Network Operator.

  • You have enough available nodes in your cluster to handle the evicted workload from drained nodes.

  • You have not selected any control plane nodes for SR-IOV network device configuration.

Procedure

  1. Create an SriovNetworkNodePolicy object, and then save the YAML in the <name>-sriov-node-network.yaml file. Replace <name> with the name for this configuration.

  2. Optional: Label the SR-IOV capable cluster nodes with SriovNetworkNodePolicy.Spec.NodeSelector if they are not already labeled. For more information about labeling nodes, see “Understanding how to update labels on nodes”.

  3. Create the SriovNetworkNodePolicy object:

    1. $ oc create -f <name>-sriov-node-network.yaml

    where <name> specifies the name for this configuration.

    After applying the configuration update, all the pods in sriov-network-operator namespace transition to the Running status.

  4. To verify that the SR-IOV network device is configured, enter the following command. Replace <node_name> with the name of a node with the SR-IOV network device that you just configured.

    1. $ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath='{.status.syncStatus}'

Additional resources

Troubleshooting SR-IOV configuration

After following the procedure to configure an SR-IOV network device, the following sections address some error conditions.

To display the state of nodes, run the following command:

  1. $ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name>

where: <node_name> specifies the name of a node with an SR-IOV network device.

Error output: Cannot allocate memory

  1. "lastSyncError": "write /sys/bus/pci/devices/0000:3b:00.1/sriov_numvfs: cannot allocate memory"

When a node indicates that it cannot allocate memory, check the following items:

  • Confirm that global SR-IOV settings are enabled in the BIOS for the node.

  • Confirm that VT-d is enabled in the BIOS for the node.

Assigning an SR-IOV network to a VRF

As a cluster administrator, you can assign an SR-IOV network interface to your VRF domain by using the CNI VRF plugin.

To do this, add the VRF configuration to the optional metaPlugins parameter of the SriovNetwork resource.

Applications that use VRFs need to bind to a specific device. The common usage is to use the SO_BINDTODEVICE option for a socket. SO_BINDTODEVICE binds the socket to a device that is specified in the passed interface name, for example, eth1. To use SO_BINDTODEVICE, the application must have CAP_NET_RAW capabilities.

Using a VRF through the ip vrf exec command is not supported in OKD pods. To use VRF, bind applications directly to the VRF interface.

Creating an additional SR-IOV network attachment with the CNI VRF plugin

The SR-IOV Network Operator manages additional network definitions. When you specify an additional SR-IOV network to create, the SR-IOV Network Operator creates the NetworkAttachmentDefinition custom resource (CR) automatically.

Do not edit NetworkAttachmentDefinition custom resources that the SR-IOV Network Operator manages. Doing so might disrupt network traffic on your additional network.

To create an additional SR-IOV network attachment with the CNI VRF plugin, perform the following procedure.

Prerequisites

  • Install the OKD CLI (oc).

  • Log in to the OKD cluster as a user with cluster-admin privileges.

Procedure

  1. Create the SriovNetwork custom resource (CR) for the additional SR-IOV network attachment and insert the metaPlugins configuration, as in the following example CR. Save the YAML as the file sriov-network-attachment.yaml.

    1. apiVersion: sriovnetwork.openshift.io/v1
    2. kind: SriovNetwork
    3. metadata:
    4. name: example-network
    5. namespace: additional-sriov-network-1
    6. spec:
    7. ipam: |
    8. {
    9. "type": "host-local",
    10. "subnet": "10.56.217.0/24",
    11. "rangeStart": "10.56.217.171",
    12. "rangeEnd": "10.56.217.181",
    13. "routes": [{
    14. "dst": "0.0.0.0/0"
    15. }],
    16. "gateway": "10.56.217.1"
    17. }
    18. vlan: 0
    19. resourceName: intelnics
    20. metaPlugins : |
    21. {
    22. "type": "vrf", (1)
    23. "vrfname": "example-vrf-name" (2)
    24. }
    1type must be set to vrf.
    2vrfname is the name of the VRF that the interface is assigned to. If it does not exist in the pod, it is created.
  2. Create the SriovNetwork resource:

    1. $ oc create -f sriov-network-attachment.yaml

Verifying that the NetworkAttachmentDefinition CR is successfully created

  • Confirm that the SR-IOV Network Operator created the NetworkAttachmentDefinition CR by running the following command.

    1. $ oc get network-attachment-definitions -n <namespace> (1)
    1Replace <namespace> with the namespace that you specified when configuring the network attachment, for example, additional-sriov-network-1.

    Example output

    1. NAME AGE
    2. additional-sriov-network-1 14m

    There might be a delay before the SR-IOV Network Operator creates the CR.

Verifying that the additional SR-IOV network attachment is successful

To verify that the VRF CNI is correctly configured and the additional SR-IOV network attachment is attached, do the following:

  1. Create an SR-IOV network that uses the VRF CNI.

  2. Assign the network to a pod.

  3. Verify that the pod network attachment is connected to the SR-IOV additional network. Remote shell into the pod and run the following command:

    1. $ ip vrf show

    Example output

    1. Name Table
    2. -----------------------
    3. red 10
  4. Confirm the VRF interface is master of the secondary interface:

    1. $ ip link

    Example output

    1. ...
    2. 5: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master red state UP mode
    3. ...

Exclude the SR-IOV network topology for NUMA-aware scheduling

You can exclude advertising the Non-Uniform Memory Access (NUMA) node for the SR-IOV network to the Topology Manager for more flexible SR-IOV network deployments during NUMA-aware pod scheduling.

In some scenarios, it is a priority to maximize CPU and memory resources for a pod on a single NUMA node. By not providing a hint to the Topology Manager about the NUMA node for the pod’s SR-IOV network resource, the Topology Manager can deploy the SR-IOV network resource and the pod CPU and memory resources to different NUMA nodes. This can add to network latency because of the data transfer between NUMA nodes. However, it is acceptable in scenarios when workloads require optimal CPU and memory performance.

For example, consider a compute node, compute-1, that features two NUMA nodes: numa0 and numa1. The SR-IOV-enabled NIC is present on numa0. The CPUs available for pod scheduling are present on numa1 only. By setting the excludeTopology specification to true, the Topology Manager can assign CPU and memory resources for the pod to numa1 and can assign the SR-IOV network resource for the same pod to numa0. This is only possible when you set the excludeTopology specification to true. Otherwise, the Topology Manager attempts to place all resources on the same NUMA node.

Excluding the SR-IOV network topology for NUMA-aware scheduling

To exclude advertising the SR-IOV network resource’s Non-Uniform Memory Access (NUMA) node to the Topology Manager, you can configure the excludeTopology specification in the SriovNetworkNodePolicy custom resource. Use this configuration for more flexible SR-IOV network deployments during NUMA-aware pod scheduling.

Prerequisites

  • You have installed the OpenShift CLI (oc).

  • You have configured the CPU Manager policy to static. For more information about CPU Manager, see the Additional resources section.

  • You have configured the Topology Manager policy to single-numa-node.

  • You have installed the SR-IOV Network Operator.

Procedure

  1. Create the SriovNetworkNodePolicy CR:

    1. Save the following YAML in the sriov-network-node-policy.yaml file, replacing values in the YAML to match your environment:

      1. apiVersion: sriovnetwork.openshift.io/v1
      2. kind: SriovNetworkNodePolicy
      3. metadata:
      4. name: <policy_name>
      5. namespace: openshift-sriov-network-operator
      6. spec:
      7. resourceName: sriovnuma0 (1)
      8. nodeSelector:
      9. kubernetes.io/hostname: <node_name>
      10. numVfs: <number_of_Vfs>
      11. nicSelector: (2)
      12. vendor: "<vendor_ID>"
      13. deviceID: "<device_ID>"
      14. deviceType: netdevice
      15. excludeTopology: true (3)
      1The resource name of the SR-IOV network device plugin. This YAML uses a sample resourceName value.
      2Identify the device for the Operator to configure by using the NIC selector.
      3To exclude advertising the NUMA node for the SR-IOV network resource to the Topology Manager, set the value to true. The default value is false.

      If multiple SriovNetworkNodePolicy resources target the same SR-IOV network resource, the SriovNetworkNodePolicy resources must have the same value as the excludeTopology specification. Otherwise, the conflicting policy is rejected.

    2. Create the SriovNetworkNodePolicy resource by running the following command:

      1. $ oc create -f sriov-network-node-policy.yaml

      Example output

      1. sriovnetworknodepolicy.sriovnetwork.openshift.io/policy-for-numa-0 created
  2. Create the SriovNetwork CR:

    1. Save the following YAML in the sriov-network.yaml file, replacing values in the YAML to match your environment:

      1. apiVersion: sriovnetwork.openshift.io/v1
      2. kind: SriovNetwork
      3. metadata:
      4. name: sriov-numa-0-network (1)
      5. namespace: openshift-sriov-network-operator
      6. spec:
      7. resourceName: sriovnuma0 (2)
      8. networkNamespace: <namespace> (3)
      9. ipam: |- (4)
      10. {
      11. "type": "<ipam_type>",
      12. }
      1Replace sriov-numa-0-network with the name for the SR-IOV network resource.
      2Specify the resource name for the SriovNetworkNodePolicy CR from the previous step. This YAML uses a sample resourceName value.
      3Enter the namespace for your SR-IOV network resource.
      4Enter the IP address management configuration for the SR-IOV network.
    2. Create the SriovNetwork resource by running the following command:

      1. $ oc create -f sriov-network.yaml

      Example output

      1. sriovnetwork.sriovnetwork.openshift.io/sriov-numa-0-network created
  3. Create a pod and assign the SR-IOV network resource from the previous step:

    1. Save the following YAML in the sriov-network-pod.yaml file, replacing values in the YAML to match your environment:

      1. apiVersion: v1
      2. kind: Pod
      3. metadata:
      4. name: <pod_name>
      5. annotations:
      6. k8s.v1.cni.cncf.io/networks: |-
      7. [
      8. {
      9. "name": "sriov-numa-0-network", (1)
      10. }
      11. ]
      12. spec:
      13. containers:
      14. - name: <container_name>
      15. image: <image>
      16. imagePullPolicy: IfNotPresent
      17. command: ["sleep", "infinity"]
      1This is the name of the SriovNetwork resource that uses the SriovNetworkNodePolicy resource.
    2. Create the Pod resource by running the following command:

      1. $ oc create -f sriov-network-pod.yaml

      Example output

      1. pod/example-pod created

Verification

  1. Verify the status of the pod by running the following command, replacing <pod_name> with the name of the pod:

    1. $ oc get pod <pod_name>

    Example output

    1. NAME READY STATUS RESTARTS AGE
    2. test-deployment-sriov-76cbbf4756-k9v72 1/1 Running 0 45h
  2. Open a debug session with the target pod to verify that the SR-IOV network resources are deployed to a different node than the memory and CPU resources.

    1. Open a debug session with the pod by running the following command, replacing <pod_name> with the target pod name.

      1. $ oc debug pod/<pod_name>
    2. Set /host as the root directory within the debug shell. The debug pod mounts the root file system from the host in /host within the pod. By changing the root directory to /host, you can run binaries from the host file system:

      1. $ chroot /host
    3. View information about the CPU allocation by running the following commands:

      1. $ lscpu | grep NUMA

      Example output

      1. NUMA node(s): 2
      2. NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,...
      3. NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,...
      1. $ cat /proc/self/status | grep Cpus

      Example output

      1. Cpus_allowed: aa
      2. Cpus_allowed_list: 1,3,5,7
      1. $ cat /sys/class/net/net1/device/numa_node

      Example output

      1. 0

      In this example, CPUs 1,3,5, and 7 are allocated to NUMA node1 but the SR-IOV network resource can use the NIC in NUMA node0.

If the excludeTopology specification is set to True, it is possible that the required resources exist in the same NUMA node.

Additional resources

Next steps