Using DPDK and RDMA

The containerized Data Plane Development Kit (DPDK) application is supported on OKD. You can use Single Root I/O Virtualization (SR-IOV) network hardware with the Data Plane Development Kit (DPDK) and with remote direct memory access (RDMA).

For information on supported devices, refer to Supported devices.

Using a virtual function in DPDK mode with an Intel NIC

Prerequisites

  • Install the OpenShift CLI (oc).

  • Install the SR-IOV Network Operator.

  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create the following SriovNetworkNodePolicy object, and then save the YAML in the intel-dpdk-node-policy.yaml file.

    1. apiVersion: sriovnetwork.openshift.io/v1
    2. kind: SriovNetworkNodePolicy
    3. metadata:
    4. name: intel-dpdk-node-policy
    5. namespace: openshift-sriov-network-operator
    6. spec:
    7. resourceName: intelnics
    8. nodeSelector:
    9. feature.node.kubernetes.io/network-sriov.capable: "true"
    10. priority: <priority>
    11. numVfs: <num>
    12. nicSelector:
    13. vendor: "8086"
    14. deviceID: "158b"
    15. pfNames: ["<pf_name>", ...]
    16. rootDevices: ["<pci_bus_id>", "..."]
    17. deviceType: vfio-pci (1)
    1Specify the driver type for the virtual functions to vfio-pci.

    See the Configuring SR-IOV network devices section for a detailed explanation on each option in SriovNetworkNodePolicy.

    When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.

    After the configuration update is applied, all the pods in openshift-sriov-network-operator namespace will change to a Running status.

  2. Create the SriovNetworkNodePolicy object by running the following command:

    1. $ oc create -f intel-dpdk-node-policy.yaml
  3. Create the following SriovNetwork object, and then save the YAML in the intel-dpdk-network.yaml file.

    1. apiVersion: sriovnetwork.openshift.io/v1
    2. kind: SriovNetwork
    3. metadata:
    4. name: intel-dpdk-network
    5. namespace: openshift-sriov-network-operator
    6. spec:
    7. networkNamespace: <target_namespace>
    8. ipam: |-
    9. # ... (1)
    10. vlan: <vlan>
    11. resourceName: intelnics
    1Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.

    See the “Configuring SR-IOV additional network” section for a detailed explanation on each option in SriovNetwork.

    An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.

  4. Create the SriovNetwork object by running the following command:

    1. $ oc create -f intel-dpdk-network.yaml
  5. Create the following Pod spec, and then save the YAML in the intel-dpdk-pod.yaml file.

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: dpdk-app
    5. namespace: <target_namespace> (1)
    6. annotations:
    7. k8s.v1.cni.cncf.io/networks: intel-dpdk-network
    8. spec:
    9. containers:
    10. - name: testpmd
    11. image: <DPDK_image> (2)
    12. securityContext:
    13. runAsUser: 0
    14. capabilities:
    15. add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] (3)
    16. volumeMounts:
    17. - mountPath: /dev/hugepages (4)
    18. name: hugepage
    19. resources:
    20. limits:
    21. openshift.io/intelnics: "1" (5)
    22. memory: "1Gi"
    23. cpu: "4" (6)
    24. hugepages-1Gi: "4Gi" (7)
    25. requests:
    26. openshift.io/intelnics: "1"
    27. memory: "1Gi"
    28. cpu: "4"
    29. hugepages-1Gi: "4Gi"
    30. command: ["sleep", "infinity"]
    31. volumes:
    32. - name: hugepage
    33. emptyDir:
    34. medium: HugePages
    1Specify the same target_namespace where the SriovNetwork object intel-dpdk-network is created. If you would like to create the pod in a different namespace, change target_namespace in both the Pod spec and the SriovNetowrk object.
    2Specify the DPDK image which includes your application and the DPDK library used by application.
    3Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
    4Mount a hugepage volume to the DPDK pod under /dev/hugepages. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
    5Optional: Specify the number of DPDK devices allocated to DPDK pod. This resource request and limit, if not explicitly specified, will be automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by the SR-IOV Operator. It is enabled by default and can be disabled by setting enableInjector option to false in the default SriovOperatorConfig CR.
    6Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs to be allocated from the kubelet. This is achieved by setting CPU Manager policy to static and creating a pod with Guaranteed QoS.
    7Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the DPDK pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepage requires adding kernel arguments to Nodes. For example, adding kernel arguments default_hugepagesz=1GB, hugepagesz=1G and hugepages=16 will result in 16*1Gi hugepages be allocated during system boot.
  6. Create the DPDK pod by running the following command:

    1. $ oc create -f intel-dpdk-pod.yaml

Using a virtual function in DPDK mode with a Mellanox NIC

You can create a network node policy and create a Data Plane Development Kit (DPDK) pod using a virtual function in DPDK mode with a Mellanox NIC.

Prerequisites

  • You have installed the OpenShift CLI (oc).

  • You have installed the Single Root I/O Virtualization (SR-IOV) Network Operator.

  • You have logged in as a user with cluster-admin privileges.

Procedure

  1. Save the following SriovNetworkNodePolicy YAML configuration to an mlx-dpdk-node-policy.yaml file:

    1. apiVersion: sriovnetwork.openshift.io/v1
    2. kind: SriovNetworkNodePolicy
    3. metadata:
    4. name: mlx-dpdk-node-policy
    5. namespace: openshift-sriov-network-operator
    6. spec:
    7. resourceName: mlxnics
    8. nodeSelector:
    9. feature.node.kubernetes.io/network-sriov.capable: "true"
    10. priority: <priority>
    11. numVfs: <num>
    12. nicSelector:
    13. vendor: "15b3"
    14. deviceID: "1015" (1)
    15. pfNames: ["<pf_name>", ...]
    16. rootDevices: ["<pci_bus_id>", "..."]
    17. deviceType: netdevice (2)
    18. isRdma: true (3)
    1Specify the device hex code of the SR-IOV network device.
    2Specify the driver type for the virtual functions to netdevice. A Mellanox SR-IOV Virtual Function (VF) can work in DPDK mode without using the vfio-pci device type. The VF device appears as a kernel network interface inside a container.
    3Enable Remote Direct Memory Access (RDMA) mode. This is required for Mellanox cards to work in DPDK mode.

    See Configuring an SR-IOV network device for a detailed explanation of each option in the SriovNetworkNodePolicy object.

    When applying the configuration specified in an SriovNetworkNodePolicy object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes. It might take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.

    After the configuration update is applied, all the pods in the openshift-sriov-network-operator namespace will change to a Running status.

  2. Create the SriovNetworkNodePolicy object by running the following command:

    1. $ oc create -f mlx-dpdk-node-policy.yaml
  3. Save the following SriovNetwork YAML configuration to an mlx-dpdk-network.yaml file:

    1. apiVersion: sriovnetwork.openshift.io/v1
    2. kind: SriovNetwork
    3. metadata:
    4. name: mlx-dpdk-network
    5. namespace: openshift-sriov-network-operator
    6. spec:
    7. networkNamespace: <target_namespace>
    8. ipam: |- (1)
    9. ...
    10. vlan: <vlan>
    11. resourceName: mlxnics
    1Specify a configuration object for the IP Address Management (IPAM) Container Network Interface (CNI) plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.

    See Configuring an SR-IOV network device for a detailed explanation on each option in the SriovNetwork object.

    The app-netutil option library provides several API methods for gathering network information about the parent pod of a container.

  4. Create the SriovNetwork object by running the following command:

    1. $ oc create -f mlx-dpdk-network.yaml
  5. Save the following Pod YAML configuration to an mlx-dpdk-pod.yaml file:

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: dpdk-app
    5. namespace: <target_namespace> (1)
    6. annotations:
    7. k8s.v1.cni.cncf.io/networks: mlx-dpdk-network
    8. spec:
    9. containers:
    10. - name: testpmd
    11. image: <DPDK_image> (2)
    12. securityContext:
    13. runAsUser: 0
    14. capabilities:
    15. add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] (3)
    16. volumeMounts:
    17. - mountPath: /dev/hugepages (4)
    18. name: hugepage
    19. resources:
    20. limits:
    21. openshift.io/mlxnics: "1" (5)
    22. memory: "1Gi"
    23. cpu: "4" (6)
    24. hugepages-1Gi: "4Gi" (7)
    25. requests:
    26. openshift.io/mlxnics: "1"
    27. memory: "1Gi"
    28. cpu: "4"
    29. hugepages-1Gi: "4Gi"
    30. command: ["sleep", "infinity"]
    31. volumes:
    32. - name: hugepage
    33. emptyDir:
    34. medium: HugePages
    1Specify the same target_namespace where SriovNetwork object mlx-dpdk-network is created. To create the pod in a different namespace, change target_namespace in both the Pod spec and SriovNetwork object.
    2Specify the DPDK image which includes your application and the DPDK library used by the application.
    3Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
    4Mount the hugepage volume to the DPDK pod under /dev/hugepages. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
    5Optional: Specify the number of DPDK devices allocated for the DPDK pod. If not explicitly specified, this resource request and limit is automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by SR-IOV Operator. It is enabled by default and can be disabled by setting the enableInjector option to false in the default SriovOperatorConfig CR.
    6Specify the number of CPUs. The DPDK pod usually requires that exclusive CPUs be allocated from the kubelet. To do this, set the CPU Manager policy to static and create a pod with Guaranteed Quality of Service (QoS).
    7Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the DPDK pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepages requires adding kernel arguments to Nodes.
  6. Create the DPDK pod by running the following command:

    1. $ oc create -f mlx-dpdk-pod.yaml

Overview of achieving a specific DPDK line rate

To achieve a specific Data Plane Development Kit (DPDK) line rate, deploy a Node Tuning Operator and configure Single Root I/O Virtualization (SR-IOV). You must also tune the DPDK settings for the following resources:

  • Isolated CPUs

  • Hugepages

  • The topology scheduler

In previous versions of OKD, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OKD applications. In OKD 4.11 and later, this functionality is part of the Node Tuning Operator.

DPDK test environment

The following diagram shows the components of a traffic-testing environment:

DPDK test environment

  • Traffic generator: An application that can generate high-volume packet traffic.

  • SR-IOV-supporting NIC: A network interface card compatible with SR-IOV. The card runs a number of virtual functions on a physical interface.

  • Physical Function (PF): A PCI Express (PCIe) function of a network adapter that supports the SR-IOV interface.

  • Virtual Function (VF): A lightweight PCIe function on a network adapter that supports SR-IOV. The VF is associated with the PCIe PF on the network adapter. The VF represents a virtualized instance of the network adapter.

  • Switch: A network switch. Nodes can also be connected back-to-back.

  • testpmd: An example application included with DPDK. The testpmd application can be used to test the DPDK in a packet-forwarding mode. The testpmd application is also an example of how to build a fully-fledged application using the DPDK Software Development Kit (SDK).

  • worker 0 and worker 1: OKD nodes.

Using SR-IOV and the Node Tuning Operator to achieve a DPDK line rate

You can use the Node Tuning Operator to configure isolated CPUs, hugepages, and a topology scheduler. You can then use the Node Tuning Operator with Single Root I/O Virtualization (SR-IOV) to achieve a specific Data Plane Development Kit (DPDK) line rate.

Prerequisites

  • You have installed the OpenShift CLI (oc).

  • You have installed the SR-IOV Network Operator.

  • You have logged in as a user with cluster-admin privileges.

  • You have deployed a standalone Node Tuning Operator.

    In previous versions of OKD, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OKD 4.11 and later, this functionality is part of the Node Tuning Operator.

Procedure

  1. Create a PerformanceProfile object based on the following example:

    1. apiVersion: performance.openshift.io/v2
    2. kind: PerformanceProfile
    3. metadata:
    4. name: performance
    5. spec:
    6. globallyDisableIrqLoadBalancing: true
    7. cpu:
    8. isolated: 21-51,73-103 (1)
    9. reserved: 0-20,52-72 (2)
    10. hugepages:
    11. defaultHugepagesSize: 1G (3)
    12. pages:
    13. - count: 32
    14. size: 1G
    15. net:
    16. userLevelNetworking: true
    17. numa:
    18. topologyPolicy: "single-numa-node"
    19. nodeSelector:
    20. node-role.kubernetes.io/worker-cnf: ""
    1If hyperthreading is enabled on the system, allocate the relevant symbolic links to the isolated and reserved CPU groups. If the system contains multiple non-uniform memory access nodes (NUMAs), allocate CPUs from both NUMAs to both groups. You can also use the Performance Profile Creator for this task. For more information, see Creating a performance profile.
    2You can also specify a list of devices that will have their queues set to the reserved CPU count. For more information, see Reducing NIC queues using the Node Tuning Operator.
    3Allocate the number and size of hugepages needed. You can specify the NUMA configuration for the hugepages. By default, the system allocates an even number to every NUMA node on the system. If needed, you can request the use of a realtime kernel for the nodes. See Provisioning a worker with real-time capabilities for more information.
  2. Save the yaml file as mlx-dpdk-perfprofile-policy.yaml.

  3. Apply the performance profile using the following command:

    1. $ oc create -f mlx-dpdk-perfprofile-policy.yaml

Example SR-IOV Network Operator for virtual functions

You can use the Single Root I/O Virtualization (SR-IOV) Network Operator to allocate and configure Virtual Functions (VFs) from SR-IOV-supporting Physical Function NICs on the nodes.

For more information on deploying the Operator, see Installing the SR-IOV Network Operator. For more information on configuring an SR-IOV network device, see Configuring an SR-IOV network device.

There are some differences between running Data Plane Development Kit (DPDK) workloads on Intel VFs and Mellanox VFs. This section provides object configuration examples for both VF types. The following is an example of an sriovNetworkNodePolicy object used to run DPDK applications on Intel NICs:

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovNetworkNodePolicy
  3. metadata:
  4. name: dpdk-nic-1
  5. namespace: openshift-sriov-network-operator
  6. spec:
  7. deviceType: vfio-pci (1)
  8. needVhostNet: true (2)
  9. nicSelector:
  10. pfNames: ["ens3f0"]
  11. nodeSelector:
  12. node-role.kubernetes.io/worker-cnf: ""
  13. numVfs: 10
  14. priority: 99
  15. resourceName: dpdk_nic_1
  16. ---
  17. apiVersion: sriovnetwork.openshift.io/v1
  18. kind: SriovNetworkNodePolicy
  19. metadata:
  20. name: dpdk-nic-1
  21. namespace: openshift-sriov-network-operator
  22. spec:
  23. deviceType: vfio-pci
  24. needVhostNet: true
  25. nicSelector:
  26. pfNames: ["ens3f1"]
  27. nodeSelector:
  28. node-role.kubernetes.io/worker-cnf: ""
  29. numVfs: 10
  30. priority: 99
  31. resourceName: dpdk_nic_2
1For Intel NICs, deviceType must be vfio-pci.
2If kernel communication with DPDK workloads is required, add needVhostNet: true. This mounts the /dev/net/tun and /dev/vhost-net devices into the container so the application can create a tap device and connect the tap device to the DPDK workload.

The following is an example of an sriovNetworkNodePolicy object for Mellanox NICs:

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovNetworkNodePolicy
  3. metadata:
  4. name: dpdk-nic-1
  5. namespace: openshift-sriov-network-operator
  6. spec:
  7. deviceType: netdevice (1)
  8. isRdma: true (2)
  9. nicSelector:
  10. rootDevices:
  11. - "0000:5e:00.1"
  12. nodeSelector:
  13. node-role.kubernetes.io/worker-cnf: ""
  14. numVfs: 5
  15. priority: 99
  16. resourceName: dpdk_nic_1
  17. ---
  18. apiVersion: sriovnetwork.openshift.io/v1
  19. kind: SriovNetworkNodePolicy
  20. metadata:
  21. name: dpdk-nic-2
  22. namespace: openshift-sriov-network-operator
  23. spec:
  24. deviceType: netdevice
  25. isRdma: true
  26. nicSelector:
  27. rootDevices:
  28. - "0000:5e:00.0"
  29. nodeSelector:
  30. node-role.kubernetes.io/worker-cnf: ""
  31. numVfs: 5
  32. priority: 99
  33. resourceName: dpdk_nic_2
1For Mellanox devices the deviceType must be netdevice.
2For Mellanox devices isRdma must be true. Mellanox cards are connected to DPDK applications using Flow Bifurcation. This mechanism splits traffic between Linux user space and kernel space, and can enhance line rate processing capability.

Example SR-IOV network operator

The following is an example definition of an sriovNetwork object. In this case, Intel and Mellanox configurations are identical:

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovNetwork
  3. metadata:
  4. name: dpdk-network-1
  5. namespace: openshift-sriov-network-operator
  6. spec:
  7. ipam: '{"type": "host-local","ranges": [[{"subnet": "10.0.1.0/24"}]],"dataDir":
  8. "/run/my-orchestrator/container-ipam-state-1"}' (1)
  9. networkNamespace: dpdk-test (2)
  10. spoofChk: "off"
  11. trust: "on"
  12. resourceName: dpdk_nic_1 (3)
  13. ---
  14. apiVersion: sriovnetwork.openshift.io/v1
  15. kind: SriovNetwork
  16. metadata:
  17. name: dpdk-network-2
  18. namespace: openshift-sriov-network-operator
  19. spec:
  20. ipam: '{"type": "host-local","ranges": [[{"subnet": "10.0.2.0/24"}]],"dataDir":
  21. "/run/my-orchestrator/container-ipam-state-1"}'
  22. networkNamespace: dpdk-test
  23. spoofChk: "off"
  24. trust: "on"
  25. resourceName: dpdk_nic_2
1You can use a different IP Address Management (IPAM) implementation, such as Whereabouts. For more information, see Dynamic IP address assignment configuration with Whereabouts.
2You must request the networkNamespace where the network attachment definition will be created. You must create the sriovNetwork CR under the openshift-sriov-network-operator namespace.
3The resourceName value must match that of the resourceName created under the sriovNetworkNodePolicy.

Example DPDK base workload

The following is an example of a Data Plane Development Kit (DPDK) container:

  1. apiVersion: v1
  2. kind: Namespace
  3. metadata:
  4. name: dpdk-test
  5. ---
  6. apiVersion: v1
  7. kind: Pod
  8. metadata:
  9. annotations:
  10. k8s.v1.cni.cncf.io/networks: '[ (1)
  11. {
  12. "name": "dpdk-network-1",
  13. "namespace": "dpdk-test"
  14. },
  15. {
  16. "name": "dpdk-network-2",
  17. "namespace": "dpdk-test"
  18. }
  19. ]'
  20. irq-load-balancing.crio.io: "disable" (2)
  21. cpu-load-balancing.crio.io: "disable"
  22. cpu-quota.crio.io: "disable"
  23. labels:
  24. app: dpdk
  25. name: testpmd
  26. namespace: dpdk-test
  27. spec:
  28. runtimeClassName: performance-performance (3)
  29. containers:
  30. - command:
  31. - /bin/bash
  32. - -c
  33. - sleep INF
  34. image: registry.redhat.io/openshift4/dpdk-base-rhel8
  35. imagePullPolicy: Always
  36. name: dpdk
  37. resources: (4)
  38. limits:
  39. cpu: "16"
  40. hugepages-1Gi: 8Gi
  41. memory: 2Gi
  42. requests:
  43. cpu: "16"
  44. hugepages-1Gi: 8Gi
  45. memory: 2Gi
  46. securityContext:
  47. capabilities:
  48. add:
  49. - IPC_LOCK
  50. - SYS_RESOURCE
  51. - NET_RAW
  52. - NET_ADMIN
  53. runAsUser: 0
  54. volumeMounts:
  55. - mountPath: /mnt/huge
  56. name: hugepages
  57. terminationGracePeriodSeconds: 5
  58. volumes:
  59. - emptyDir:
  60. medium: HugePages
  61. name: hugepages
1Request the SR-IOV networks you need. Resources for the devices will be injected automatically.
2Disable the CPU and IRQ load balancing base. See Disabling interrupt processing for individual pods for more information.
3Set the runtimeClass to performance-performance. Do not set the runtimeClass to HostNetwork or privileged.
4Request an equal number of resources for requests and limits to start the pod with Guaranteed Quality of Service (QoS).

Do not start the pod with SLEEP and then exec into the pod to start the testpmd or the DPDK workload. This can add additional interrupts as the exec process is not pinned to any CPU.

Example testpmd script

The following is an example script for running testpmd:

  1. #!/bin/bash
  2. set -ex
  3. export CPU=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus)
  4. echo ${CPU}
  5. dpdk-testpmd -l ${CPU} -a ${PCIDEVICE_OPENSHIFT_IO_DPDK_NIC_1} -a ${PCIDEVICE_OPENSHIFT_IO_DPDK_NIC_2} -n 4 -- -i --nb-cores=15 --rxd=4096 --txd=4096 --rxq=7 --txq=7 --forward-mode=mac --eth-peer=0,50:00:00:00:00:01 --eth-peer=1,50:00:00:00:00:02

This example uses two different sriovNetwork CRs. The environment variable contains the Virtual Function (VF) PCI address that was allocated for the pod. If you use the same network in the pod definition, you must split the pciAddress. It is important to configure the correct MAC addresses of the traffic generator. This example uses custom MAC addresses.

Using a virtual function in RDMA mode with a Mellanox NIC

RDMA over Converged Ethernet (RoCE) is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/.

RDMA over Converged Ethernet (RoCE) is the only supported mode when using RDMA on OKD.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Install the SR-IOV Network Operator.

  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create the following SriovNetworkNodePolicy object, and then save the YAML in the mlx-rdma-node-policy.yaml file.

    1. apiVersion: sriovnetwork.openshift.io/v1
    2. kind: SriovNetworkNodePolicy
    3. metadata:
    4. name: mlx-rdma-node-policy
    5. namespace: openshift-sriov-network-operator
    6. spec:
    7. resourceName: mlxnics
    8. nodeSelector:
    9. feature.node.kubernetes.io/network-sriov.capable: "true"
    10. priority: <priority>
    11. numVfs: <num>
    12. nicSelector:
    13. vendor: "15b3"
    14. deviceID: "1015" (1)
    15. pfNames: ["<pf_name>", ...]
    16. rootDevices: ["<pci_bus_id>", "..."]
    17. deviceType: netdevice (2)
    18. isRdma: true (3)
    1Specify the device hex code of SR-IOV network device. The only allowed values for Mellanox cards are 1015, 1017.
    2Specify the driver type for the virtual functions to netdevice.
    3Enable RDMA mode.

    See the Configuring SR-IOV network devices section for a detailed explanation on each option in SriovNetworkNodePolicy.

    When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.

    After the configuration update is applied, all the pods in the openshift-sriov-network-operator namespace will change to a Running status.

  2. Create the SriovNetworkNodePolicy object by running the following command:

    1. $ oc create -f mlx-rdma-node-policy.yaml
  3. Create the following SriovNetwork object, and then save the YAML in the mlx-rdma-network.yaml file.

    1. apiVersion: sriovnetwork.openshift.io/v1
    2. kind: SriovNetwork
    3. metadata:
    4. name: mlx-rdma-network
    5. namespace: openshift-sriov-network-operator
    6. spec:
    7. networkNamespace: <target_namespace>
    8. ipam: |- (1)
    9. # ...
    10. vlan: <vlan>
    11. resourceName: mlxnics
    1Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.

    See the “Configuring SR-IOV additional network” section for a detailed explanation on each option in SriovNetwork.

    An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.

  4. Create the SriovNetworkNodePolicy object by running the following command:

    1. $ oc create -f mlx-rdma-network.yaml
  5. Create the following Pod spec, and then save the YAML in the mlx-rdma-pod.yaml file.

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: rdma-app
    5. namespace: <target_namespace> (1)
    6. annotations:
    7. k8s.v1.cni.cncf.io/networks: mlx-rdma-network
    8. spec:
    9. containers:
    10. - name: testpmd
    11. image: <RDMA_image> (2)
    12. securityContext:
    13. runAsUser: 0
    14. capabilities:
    15. add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] (3)
    16. volumeMounts:
    17. - mountPath: /dev/hugepages (4)
    18. name: hugepage
    19. resources:
    20. limits:
    21. memory: "1Gi"
    22. cpu: "4" (5)
    23. hugepages-1Gi: "4Gi" (6)
    24. requests:
    25. memory: "1Gi"
    26. cpu: "4"
    27. hugepages-1Gi: "4Gi"
    28. command: ["sleep", "infinity"]
    29. volumes:
    30. - name: hugepage
    31. emptyDir:
    32. medium: HugePages
    1Specify the same target_namespace where SriovNetwork object mlx-rdma-network is created. If you would like to create the pod in a different namespace, change target_namespace in both Pod spec and SriovNetowrk object.
    2Specify the RDMA image which includes your application and RDMA library used by application.
    3Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
    4Mount the hugepage volume to RDMA pod under /dev/hugepages. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
    5Specify number of CPUs. The RDMA pod usually requires exclusive CPUs be allocated from the kubelet. This is achieved by setting CPU Manager policy to static and create pod with Guaranteed QoS.
    6Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the RDMA pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepage requires adding kernel arguments to Nodes.
  6. Create the RDMA pod by running the following command:

    1. $ oc create -f mlx-rdma-pod.yaml

A test pod template for clusters that use OVS-DPDK on OpenStack

The following testpmd pod demonstrates container creation with huge pages, reserved CPUs, and the SR-IOV port.

An example testpmd pod

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: testpmd-dpdk
  5. namespace: mynamespace
  6. annotations:
  7. cpu-load-balancing.crio.io: "disable"
  8. cpu-quota.crio.io: "disable"
  9. # ...
  10. spec:
  11. containers:
  12. - name: testpmd
  13. command: ["sleep", "99999"]
  14. image: registry.redhat.io/openshift4/dpdk-base-rhel8:v4.9
  15. securityContext:
  16. capabilities:
  17. add: ["IPC_LOCK","SYS_ADMIN"]
  18. privileged: true
  19. runAsUser: 0
  20. resources:
  21. requests:
  22. memory: 1000Mi
  23. hugepages-1Gi: 1Gi
  24. cpu: '2'
  25. openshift.io/dpdk1: 1 (1)
  26. limits:
  27. hugepages-1Gi: 1Gi
  28. cpu: '2'
  29. memory: 1000Mi
  30. openshift.io/dpdk1: 1
  31. volumeMounts:
  32. - mountPath: /dev/hugepages
  33. name: hugepage
  34. readOnly: False
  35. runtimeClassName: performance-cnf-performanceprofile (2)
  36. volumes:
  37. - name: hugepage
  38. emptyDir:
  39. medium: HugePages
1The name dpdk1 in this example is a user-created SriovNetworkNodePolicy resource. You can substitute this name for that of a resource that you create.
2If your performance profile is not named cnf-performance profile, replace that string with the correct performance profile name.

A test pod template for clusters that use OVS hardware offloading on OpenStack

The following testpmd pod demonstrates Open vSwitch (OVS) hardware offloading on OpenStack.

An example testpmd pod

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: testpmd-sriov
  5. namespace: mynamespace
  6. annotations:
  7. k8s.v1.cni.cncf.io/networks: hwoffload1
  8. spec:
  9. runtimeClassName: performance-cnf-performanceprofile (1)
  10. containers:
  11. - name: testpmd
  12. command: ["sleep", "99999"]
  13. image: registry.redhat.io/openshift4/dpdk-base-rhel8:v4.9
  14. securityContext:
  15. capabilities:
  16. add: ["IPC_LOCK","SYS_ADMIN"]
  17. privileged: true
  18. runAsUser: 0
  19. resources:
  20. requests:
  21. memory: 1000Mi
  22. hugepages-1Gi: 1Gi
  23. cpu: '2'
  24. limits:
  25. hugepages-1Gi: 1Gi
  26. cpu: '2'
  27. memory: 1000Mi
  28. volumeMounts:
  29. - mountPath: /dev/hugepages
  30. name: hugepage
  31. readOnly: False
  32. volumes:
  33. - name: hugepage
  34. emptyDir:
  35. medium: HugePages
1If your performance profile is not named cnf-performance profile, replace that string with the correct performance profile name.

Additional resources