About Single Root I/O Virtualization (SR-IOV) hardware networks

The Single Root I/O Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that can share a single device with multiple pods.

SR-IOV can segment a compliant network device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device. The SR-IOV network device driver for the device determines how the VF is exposed in the container:

  • netdevice driver: A regular kernel network device in the netns of the container

  • vfio-pci driver: A character device mounted in the container

You can use SR-IOV network devices with additional networks on your OKD cluster installed on bare metal or Red Hat OpenStack Platform (RHOSP) infrastructure for applications that require high bandwidth or low latency.

Components that manage SR-IOV network devices

The SR-IOV Network Operator creates and manages the components of the SR-IOV stack. It performs the following functions:

  • Orchestrates discovery and management of SR-IOV network devices

  • Generates NetworkAttachmentDefinition custom resources for the SR-IOV Container Network Interface (CNI)

  • Creates and updates the configuration of the SR-IOV network device plug-in

  • Creates node specific SriovNetworkNodeState custom resources

  • Updates the spec.interfaces field in each SriovNetworkNodeState custom resource

The Operator provisions the following components:

SR-IOV network configuration daemon

A daemon set that is deployed on worker nodes when the SR-IOV Network Operator starts. The daemon is responsible for discovering and initializing SR-IOV network devices in the cluster.

SR-IOV Network Operator webhook

A dynamic admission controller webhook that validates the Operator custom resource and sets appropriate default values for unset fields.

SR-IOV Network resources injector

A dynamic admission controller webhook that provides functionality for patching Kubernetes pod specifications with requests and limits for custom network resources such as SR-IOV VFs. The SR-IOV network resources injector adds the resource field to only the first container in a pod automatically.

SR-IOV network device plug-in

A device plug-in that discovers, advertises, and allocates SR-IOV network virtual function (VF) resources. Device plug-ins are used in Kubernetes to enable the use of limited resources, typically in physical devices. Device plug-ins give the Kubernetes scheduler awareness of resource availability, so that the scheduler can schedule pods on nodes with sufficient resources.

SR-IOV CNI plug-in

A CNI plug-in that attaches VF interfaces allocated from the SR-IOV network device plug-in directly into a pod.

SR-IOV InfiniBand CNI plug-in

A CNI plug-in that attaches InfiniBand (IB) VF interfaces allocated from the SR-IOV network device plug-in directly into a pod.

The SR-IOV Network resources injector and SR-IOV Network Operator webhook are enabled by default and can be disabled by editing the default SriovOperatorConfig CR.

Supported platforms

The SR-IOV Network Operator is supported on the following platforms:

  • Bare metal

  • Red Hat OpenStack Platform (RHOSP)

Supported devices

OKD supports the following network interface controllers:

Table 1. Supported network interface controllers
ManufacturerModelVendor IDDevice ID

Broadcom

BCM57414

14e4

16d7

Broadcom

BCM57508

14e4

1750

Intel

X710

8086

1572

Intel

XL710

8086

1583

Intel

XXV710

8086

158b

Intel

E810-CQDA2

8088

1592

Intel

E810-XXVDA2

8088

159b

Intel

E810-XXVDA4

8088

1593

Mellanox

MT27700 Family [ConnectX‑4]

15b3

1013

Mellanox

MT27710 Family [ConnectX‑4 Lx]

15b3

1015

Mellanox

MT27800 Family [ConnectX‑5]

15b3

1017

Mellanox

MT28880 Family [ConnectX‑5 Ex]

15b3

1019

Mellanox

MT28908 Family [ConnectX‑6]

15b3

101b

Automated discovery of SR-IOV network devices

The SR-IOV Network Operator searches your cluster for SR-IOV capable network devices on worker nodes. The Operator creates and updates a SriovNetworkNodeState custom resource (CR) for each worker node that provides a compatible SR-IOV network device.

The CR is assigned the same name as the worker node. The status.interfaces list provides information about the network devices on a node.

Do not modify a SriovNetworkNodeState object. The Operator creates and manages these resources automatically.

Example SriovNetworkNodeState object

The following YAML is an example of a SriovNetworkNodeState object created by the SR-IOV Network Operator:

An SriovNetworkNodeState object

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovNetworkNodeState
  3. metadata:
  4. name: node-25 (1)
  5. namespace: openshift-sriov-network-operator
  6. ownerReferences:
  7. - apiVersion: sriovnetwork.openshift.io/v1
  8. blockOwnerDeletion: true
  9. controller: true
  10. kind: SriovNetworkNodePolicy
  11. name: default
  12. spec:
  13. dpConfigVersion: "39824"
  14. status:
  15. interfaces: (2)
  16. - deviceID: "1017"
  17. driver: mlx5_core
  18. mtu: 1500
  19. name: ens785f0
  20. pciAddress: "0000:18:00.0"
  21. totalvfs: 8
  22. vendor: 15b3
  23. - deviceID: "1017"
  24. driver: mlx5_core
  25. mtu: 1500
  26. name: ens785f1
  27. pciAddress: "0000:18:00.1"
  28. totalvfs: 8
  29. vendor: 15b3
  30. - deviceID: 158b
  31. driver: i40e
  32. mtu: 1500
  33. name: ens817f0
  34. pciAddress: 0000:81:00.0
  35. totalvfs: 64
  36. vendor: "8086"
  37. - deviceID: 158b
  38. driver: i40e
  39. mtu: 1500
  40. name: ens817f1
  41. pciAddress: 0000:81:00.1
  42. totalvfs: 64
  43. vendor: "8086"
  44. - deviceID: 158b
  45. driver: i40e
  46. mtu: 1500
  47. name: ens803f0
  48. pciAddress: 0000:86:00.0
  49. totalvfs: 64
  50. vendor: "8086"
  51. syncStatus: Succeeded
1The value of the name field is the same as the name of the worker node.
2The interfaces stanza includes a list of all of the SR-IOV devices discovered by the Operator on the worker node.

Example use of a virtual function in a pod

You can run a remote direct memory access (RDMA) or a Data Plane Development Kit (DPDK) application in a pod with SR-IOV VF attached.

This example shows a pod using a virtual function (VF) in RDMA mode:

Pod spec that uses RDMA mode

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: rdma-app
  5. annotations:
  6. k8s.v1.cni.cncf.io/networks: sriov-rdma-mlnx
  7. spec:
  8. containers:
  9. - name: testpmd
  10. image: <RDMA_image>
  11. imagePullPolicy: IfNotPresent
  12. securityContext:
  13. runAsUser: 0
  14. capabilities:
  15. add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
  16. command: ["sleep", "infinity"]

The following example shows a pod with a VF in DPDK mode:

Pod spec that uses DPDK mode

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: dpdk-app
  5. annotations:
  6. k8s.v1.cni.cncf.io/networks: sriov-dpdk-net
  7. spec:
  8. containers:
  9. - name: testpmd
  10. image: <DPDK_image>
  11. securityContext:
  12. runAsUser: 0
  13. capabilities:
  14. add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
  15. volumeMounts:
  16. - mountPath: /dev/hugepages
  17. name: hugepage
  18. resources:
  19. limits:
  20. memory: "1Gi"
  21. cpu: "2"
  22. hugepages-1Gi: "4Gi"
  23. requests:
  24. memory: "1Gi"
  25. cpu: "2"
  26. hugepages-1Gi: "4Gi"
  27. command: ["sleep", "infinity"]
  28. volumes:
  29. - name: hugepage
  30. emptyDir:
  31. medium: HugePages

DPDK library for use with container applications

An optional library, app-netutil, provides several API methods for gathering network information about a pod from within a container running within that pod.

This library can assist with integrating SR-IOV virtual functions (VFs) in Data Plane Development Kit (DPDK) mode into the container. The library provides both a Golang API and a C API.

Currently there are three API methods implemented:

GetCPUInfo()

This function determines which CPUs are available to the container and returns the list.

GetHugepages()

This function determines the amount of huge page memory requested in the Pod spec for each container and returns the values.

GetInterfaces()

This function determines the set of interfaces in the container and returns the list. The return value includes the interface type and type-specific data for each interface.

The repository for the library includes a sample Dockerfile to build a container image, dpdk-app-centos. The container image can run one of the following DPDK sample applications, depending on an environment variable in the pod specification: l2fwd, l3wd or testpmd. The container image provides an example of integrating the app-netutil library into the container image itself. The library can also integrate into an init container. The init container can collect the required data and pass the data to an existing DPDK workload.

Huge pages resource injection for Downward API

When a pod specification includes a resource request or limit for huge pages, the Network Resources Injector automatically adds Downward API fields to the pod specification to provide the huge pages information to the container.

The Network Resources Injector adds a volume that is named podnetinfo and is mounted at /etc/podnetinfo for each container in the pod. The volume uses the Downward API and includes a file for huge pages requests and limits. The file naming convention is as follows:

  • /etc/podnetinfo/hugepages_1G_request_<container-name>

  • /etc/podnetinfo/hugepages_1G_limit_<container-name>

  • /etc/podnetinfo/hugepages_2M_request_<container-name>

  • /etc/podnetinfo/hugepages_2M_limit_<container-name>

The paths specified in the previous list are compatible with the app-netutil library. By default, the library is configured to search for resource information in the /etc/podnetinfo directory. If you choose to specify the Downward API path items yourself manually, the app-netutil library searches for the following paths in addition to the paths in the previous list.

  • /etc/podnetinfo/hugepages_request

  • /etc/podnetinfo/hugepages_limit

  • /etc/podnetinfo/hugepages_1G_request

  • /etc/podnetinfo/hugepages_1G_limit

  • /etc/podnetinfo/hugepages_2M_request

  • /etc/podnetinfo/hugepages_2M_limit

As with the paths that the Network Resources Injector can create, the paths in the preceding list can optionally end with a _<container-name> suffix.

Next steps