About Single Root I/O Virtualization (SR-IOV) hardware networks

The Single Root I/O Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that can share a single device with multiple pods.

SR-IOV enables you to segment a compliant network device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device. The SR-IOV device driver for the device determines how the VF is exposed in the container:

  • netdevice driver: A regular kernel network device in the netns of the container

  • vfio-pci driver: A character device mounted in the container

You can use SR-IOV network devices with additional networks on your OKD cluster for application that require high bandwidth or low latency.

Components that manage SR-IOV network devices

The SR-IOV Network Operator creates and manages the components of the SR-IOV stack. It performs the following functions:

  • Orchestrates discovery and management of SR-IOV network devices

  • Generates NetworkAttachmentDefinition custom resources for the SR-IOV Container Network Interface (CNI)

  • Creates and updates the configuration of the SR-IOV network device plug-in

  • Creates node specific SriovNetworkNodeState custom resources

  • Updates the spec.interfaces field in each SriovNetworkNodeState custom resource

The Operator provisions the following components:

SR-IOV network configuration daemon

A DaemonSet that is deployed on worker nodes when the SR-IOV Operator starts. The daemon is responsible for discovering and initializing SR-IOV network devices in the cluster.

SR-IOV Operator webhook

A dynamic admission controller webhook that validates the Operator custom resource and sets appropriate default values for unset fields.

SR-IOV Network resources injector

A dynamic admission controller webhook that provides functionality for patching Kubernetes pod specifications with requests and limits for custom network resources such as SR-IOV VFs.

SR-IOV network device plug-in

A device plug-in that discovers, advertises, and allocates SR-IOV network virtual function (VF) resources. Device plug-ins are used in Kubernetes to enable the use of limited resources, typically in physical devices. Device plug-ins give the Kubernetes scheduler awareness of resource availability, so that the scheduler can schedule pods on nodes with sufficient resources.

SR-IOV CNI plug-in

A CNI plug-in that attaches VF interfaces allocated from the SR-IOV device plug-in directly into a pod.

SR-IOV InfiniBand CNI plug-in

A CNI plug-in that attaches InfiniBand (IB) VF interfaces allocated from the SR-IOV device plug-in directly into a pod.

The SR-IOV Network resources injector and SR-IOV Network Operator webhook are enabled by default and can be disabled by editing the default SriovOperatorConfig CR.

Supported devices

OKD supports the following network interface controllers:

Table 1. Supported network interface controllers
ManufacturerModelVendor IDDevice ID

Intel

X710

8086

1572

Intel

XXV710

8086

158b

Mellanox

MT27700 Family [ConnectX‑4]

15b3

1013

Mellanox

MT27710 Family [ConnectX‑4 Lx]

15b3

1015

Mellanox

MT27800 Family [ConnectX‑5]

15b3

1017

Mellanox

MT28908 Family [ConnectX‑6]

15b3

101b

Automated discovery of SR-IOV network devices

The SR-IOV Network Operator searches your cluster for SR-IOV capable network devices on worker nodes. The Operator creates and updates a SriovNetworkNodeState custom resource (CR) for each worker node that provides a compatible SR-IOV network device.

The CR is assigned the same name as the worker node. The status.interfaces list provides information about the network devices on a node.

Do not modify a SriovNetworkNodeState object. The Operator creates and manages these resources automatically.

Example SriovNetworkNodeState object

The following YAML is an example of a SriovNetworkNodeState object created by the SR-IOV Network Operator:

An SriovNetworkNodeState object

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovNetworkNodeState
  3. metadata:
  4. name: node-25 (1)
  5. namespace: openshift-sriov-network-operator
  6. ownerReferences:
  7. - apiVersion: sriovnetwork.openshift.io/v1
  8. blockOwnerDeletion: true
  9. controller: true
  10. kind: SriovNetworkNodePolicy
  11. name: default
  12. spec:
  13. dpConfigVersion: "39824"
  14. status:
  15. interfaces: (2)
  16. - deviceID: "1017"
  17. driver: mlx5_core
  18. mtu: 1500
  19. name: ens785f0
  20. pciAddress: "0000:18:00.0"
  21. totalvfs: 8
  22. vendor: 15b3
  23. - deviceID: "1017"
  24. driver: mlx5_core
  25. mtu: 1500
  26. name: ens785f1
  27. pciAddress: "0000:18:00.1"
  28. totalvfs: 8
  29. vendor: 15b3
  30. - deviceID: 158b
  31. driver: i40e
  32. mtu: 1500
  33. name: ens817f0
  34. pciAddress: 0000:81:00.0
  35. totalvfs: 64
  36. vendor: "8086"
  37. - deviceID: 158b
  38. driver: i40e
  39. mtu: 1500
  40. name: ens817f1
  41. pciAddress: 0000:81:00.1
  42. totalvfs: 64
  43. vendor: "8086"
  44. - deviceID: 158b
  45. driver: i40e
  46. mtu: 1500
  47. name: ens803f0
  48. pciAddress: 0000:86:00.0
  49. totalvfs: 64
  50. vendor: "8086"
  51. syncStatus: Succeeded
1The value of the name field is the same as the name of the worker node.
2The interfaces stanza includes a list of all of the SR-IOV devices discovered by the Operator on the worker node.

Example use of a virtual function in a pod

You can run a remote direct memory access (RDMA) or a Data Plane Development Kit (DPDK) application in a pod with SR-IOV VF attached.

This example shows a pod using a virtual function (VF) in RDMA mode:

Pod spec that uses RDMA mode

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: rdma-app
  5. annotations:
  6. k8s.v1.cni.cncf.io/networks: sriov-rdma-mlnx
  7. spec:
  8. containers:
  9. - name: testpmd
  10. image: <RDMA_image>
  11. imagePullPolicy: IfNotPresent
  12. securityContext:
  13. runAsUser: 0
  14. capabilities:
  15. add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
  16. command: ["sleep", "infinity"]

The following example shows a pod with a VF in DPDK mode:

Pod spec that uses DPDK mode

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: dpdk-app
  5. annotations:
  6. k8s.v1.cni.cncf.io/networks: sriov-dpdk-net
  7. spec:
  8. containers:
  9. - name: testpmd
  10. image: <DPDK_image>
  11. securityContext:
  12. runAsUser: 0
  13. capabilities:
  14. add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
  15. volumeMounts:
  16. - mountPath: /dev/hugepages
  17. name: hugepage
  18. resources:
  19. limits:
  20. memory: "1Gi"
  21. cpu: "2"
  22. hugepages-1Gi: "4Gi"
  23. requests:
  24. memory: "1Gi"
  25. cpu: "2"
  26. hugepages-1Gi: "4Gi"
  27. command: ["sleep", "infinity"]
  28. volumes:
  29. - name: hugepage
  30. emptyDir:
  31. medium: HugePages

An optional library is available to aid the application running in a container in gathering network information associated with a pod. This library is called ‘app-netutil’. See the library’s source code in the app-netutil GitHub repo.

This library is intended to ease the integration of the SR-IOV VFs in DPDK mode into the container. The library provides both a GO API and a C API, as well as examples of using both languages.

There is also a sample Docker image, ‘dpdk-app-centos’, which can run one of the following DPDK sample applications based on an environmental variable in the pod-spec: l2fwd, l3wd or testpmd. This Docker image provides an example of integrating the ‘app-netutil’ into the container image itself. The library can also integrate into an init-container which collects the desired data and passes the data to an existing DPDK workload.

Next steps