Recommended single-node OpenShift cluster configuration for vDU application workloads

Use the following reference information to understand the single-node OpenShift configurations required to deploy virtual distributed unit (vDU) applications in the cluster. Configurations include cluster optimizations for high performance workloads, enabling workload partitioning, and minimizing the number of reboots required post-installation.

Additional resources

Running low latency applications on OKD

OKD enables low latency processing for applications running on commercial off-the-shelf (COTS) hardware by using several technologies and specialized hardware devices:

Real-time kernel for RHCOS

Ensures workloads are handled with a high degree of process determinism.

CPU isolation

Avoids CPU scheduling delays and ensures CPU capacity is available consistently.

NUMA-aware topology management

Aligns memory and huge pages with CPU and PCI devices to pin guaranteed container memory and huge pages to the non-uniform memory access (NUMA) node. Pod resources for all Quality of Service (QoS) classes stay on the same NUMA node. This decreases latency and improves performance of the node.

Huge pages memory management

Using huge page sizes improves system performance by reducing the amount of system resources required to access page tables.

Precision timing synchronization using PTP

Allows synchronization between nodes in the network with sub-microsecond accuracy.

Running vDU application workloads requires a bare-metal host with sufficient resources to run OKD services and production workloads.

Table 1. Minimum resource requirements
ProfilevCPUMemoryStorage

Minimum

4 to 8 vCPU cores

32GB of RAM

120GB

One vCPU is equivalent to one physical core when simultaneous multithreading (SMT), or Hyper-Threading, is not enabled. When enabled, use the following formula to calculate the corresponding ratio:

  • (threads per core × cores) × sockets = vCPUs

The server must have a Baseboard Management Controller (BMC) when booting with virtual media.

Configuring host firmware for low latency and high performance

Bare-metal hosts require the firmware to be configured before the host can be provisioned. The firmware configuration is dependent on the specific hardware and the particular requirements of your installation.

Procedure

  1. Set the UEFI/BIOS Boot Mode to UEFI.

  2. In the host boot sequence order, set Hard drive first.

  3. Apply the specific firmware configuration for your hardware. The following table describes a representative firmware configuration for an Intel Xeon Skylake or Intel Cascade Lake server, based on the Intel FlexRAN 4G and 5G baseband PHY reference design.

    The exact firmware configuration depends on your specific hardware and network requirements. The following sample configuration is for illustrative purposes only.

    Table 2. Sample firmware configuration for an Intel Xeon Skylake or Cascade Lake server
    Firmware settingConfiguration

    CPU Power and Performance Policy

    Performance

    Uncore Frequency Scaling

    Disabled

    Performance P-limit

    Disabled

    Enhanced Intel SpeedStep ® Tech

    Enabled

    Intel Configurable TDP

    Enabled

    Configurable TDP Level

    Level 2

    Intel® Turbo Boost Technology

    Enabled

    Energy Efficient Turbo

    Disabled

    Hardware P-States

    Disabled

    Package C-State

    C0/C1 state

    C1E

    Disabled

    Processor C6

    Disabled

Enable global SR-IOV and VT-d settings in the firmware for the host. These settings are relevant to bare-metal environments.

Connectivity prerequisites for managed cluster networks

Before you can install and provision a managed cluster with the zero touch provisioning (ZTP) GitOps pipeline, the managed cluster host must meet the following networking prerequisites:

  • There must be bi-directional connectivity between the ZTP GitOps container in the hub cluster and the Baseboard Management Controller (BMC) of the target bare-metal host.

  • The managed cluster must be able to resolve and reach the API hostname of the hub hostname and *.apps hostname. Here is an example of the API hostname of the hub and *.apps hostname:

    • api.hub-cluster.internal.domain.com

    • console-openshift-console.apps.hub-cluster.internal.domain.com

  • The hub cluster must be able to resolve and reach the API and *.apps hostname of the managed cluster. Here is an example of the API hostname of the managed cluster and *.apps hostname:

    • api.sno-managed-cluster-1.internal.domain.com

    • console-openshift-console.apps.sno-managed-cluster-1.internal.domain.com

Workload partitioning in single-node OpenShift with GitOps ZTP

Workload partitioning configures OKD services, cluster management workloads, and infrastructure pods to run on a reserved number of host CPUs.

To configure workload partitioning with GitOps ZTP, you specify cluster management CPU resources with the cpuset field of the SiteConfig custom resource (CR) and the reserved field of the group PolicyGenTemplate CR. The GitOps ZTP pipeline uses these values to populate the required fields in the workload partitioning MachineConfig CR (cpuset) and the PerformanceProfile CR (reserved) that configure the single-node OpenShift cluster.

For maximum performance, ensure that the reserved and isolated CPU sets do not share CPU cores across NUMA zones.

  • The workload partitioning MachineConfig CR pins the OKD infrastructure pods to a defined cpuset configuration.

  • The PerformanceProfile CR pins the systemd services to the reserved CPUs.

The value for the reserved field specified in the PerformanceProfile CR must match the cpuset field in the workload partitioning MachineConfig CR.

Additional resources

  • For the recommended single-node OpenShift workload partitioning configuration, see Workload partitioning.

The ZTP pipeline applies the following custom resources (CRs) during cluster installation. These configuration CRs ensure that the cluster meets the feature and performance requirements necessary for running a vDU application.

When using the ZTP GitOps plugin and SiteConfig CRs for cluster deployment, the following MachineConfig CRs are included by default.

Use the SiteConfig extraManifests filter to alter the CRs that are included by default. For more information, see Advanced managed cluster configuration with SiteConfig CRs.

Workload partitioning

Single-node OpenShift clusters that run DU workloads require workload partitioning. This limits the cores allowed to run platform services, maximizing the CPU core for application payloads.

Workload partitioning can only be enabled during cluster installation. You cannot disable workload partitioning post-installation. However, you can reconfigure workload partitioning by updating the cpu value that you define in the performance profile, and in the related MachineConfig custom resource (CR).

  • The base64-encoded CR that enables workload partitioning contains the CPU set that the management workloads are constrained to. Encode host-specific values for crio.conf and kubelet.conf in base64. Adjust the content to match the CPU set that is specified in the cluster performance profile. It must match the number of cores in the cluster host.

    Recommended workload partitioning configuration

    1. apiVersion: machineconfiguration.openshift.io/v1
    2. kind: MachineConfig
    3. metadata:
    4. labels:
    5. machineconfiguration.openshift.io/role: master
    6. name: 02-master-workload-partitioning
    7. spec:
    8. config:
    9. ignition:
    10. version: 3.2.0
    11. storage:
    12. files:
    13. - contents:
    14. source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS53b3JrbG9hZHMubWFuYWdlbWVudF0KYWN0aXZhdGlvbl9hbm5vdGF0aW9uID0gInRhcmdldC53b3JrbG9hZC5vcGVuc2hpZnQuaW8vbWFuYWdlbWVudCIKYW5ub3RhdGlvbl9wcmVmaXggPSAicmVzb3VyY2VzLndvcmtsb2FkLm9wZW5zaGlmdC5pbyIKcmVzb3VyY2VzID0geyAiY3B1c2hhcmVzIiA9IDAsICJjcHVzZXQiID0gIjAtMSw1Mi01MyIgfQo=
    15. mode: 420
    16. overwrite: true
    17. path: /etc/crio/crio.conf.d/01-workload-partitioning
    18. user:
    19. name: root
    20. - contents:
    21. source: data:text/plain;charset=utf-8;base64,ewogICJtYW5hZ2VtZW50IjogewogICAgImNwdXNldCI6ICIwLTEsNTItNTMiCiAgfQp9Cg==
    22. mode: 420
    23. overwrite: true
    24. path: /etc/kubernetes/openshift-workload-pinning
    25. user:
    26. name: root
  • When configured in the cluster host, the contents of /etc/crio/crio.conf.d/01-workload-partitioning should look like this:

    1. [crio.runtime.workloads.management]
    2. activation_annotation = "target.workload.openshift.io/management"
    3. annotation_prefix = "resources.workload.openshift.io"
    4. resources = { "cpushares" = 0, "cpuset" = "0-1,52-53" } (1)
    1The cpuset value varies based on the installation. If Hyper-Threading is enabled, specify both threads for each core. The cpuset value must match the reserved CPUs that you define in the spec.cpu.reserved field in the performance profile.
  • When configured in the cluster, the contents of /etc/kubernetes/openshift-workload-pinning should look like this:

    1. {
    2. "management": {
    3. "cpuset": "0-1,52-53" (1)
    4. }
    5. }
    1The cpuset must match the cpuset value in /etc/crio/crio.conf.d/01-workload-partitioning.

Verification

Check that the applications and cluster system CPU pinning is correct. Run the following commands:

  1. Open a remote shell connection to the managed cluster:

    1. $ oc debug node/example-sno-1
  2. Check that the user applications CPU pinning is correct:

    1. sh-4.4# pgrep ovn | while read i; do taskset -cp $i; done

    Example output

    1. pid 8481's current affinity list: 0-3
    2. pid 8726's current affinity list: 0-3
    3. pid 9088's current affinity list: 0-3
    4. pid 9945's current affinity list: 0-3
    5. pid 10387's current affinity list: 0-3
    6. pid 12123's current affinity list: 0-3
    7. pid 13313's current affinity list: 0-3
  3. Check that the system applications CPU pinning is correct:

    1. sh-4.4# pgrep systemd | while read i; do taskset -cp $i; done

    Example output

    1. pid 1's current affinity list: 0-3
    2. pid 938's current affinity list: 0-3
    3. pid 962's current affinity list: 0-3
    4. pid 1197's current affinity list: 0-3

Reduced platform management footprint

To reduce the overall management footprint of the platform, a MachineConfig custom resource (CR) is required that places all Kubernetes-specific mount points in a new namespace separate from the host operating system. The following base64-encoded example MachineConfig CR illustrates this configuration.

Recommended container mount namespace configuration

  1. apiVersion: machineconfiguration.openshift.io/v1
  2. kind: MachineConfig
  3. metadata:
  4. labels:
  5. machineconfiguration.openshift.io/role: master
  6. name: container-mount-namespace-and-kubelet-conf-master
  7. spec:
  8. config:
  9. ignition:
  10. version: 3.2.0
  11. storage:
  12. files:
  13. - contents:
  14. source: data:text/plain;charset=utf-8;base64,IyEvYmluL2Jhc2gKCmRlYnVnKCkgewogIGVjaG8gJEAgPiYyCn0KCnVzYWdlKCkgewogIGVjaG8gVXNhZ2U6ICQoYmFzZW5hbWUgJDApIFVOSVQgW2VudmZpbGUgW3Zhcm5hbWVdXQogIGVjaG8KICBlY2hvIEV4dHJhY3QgdGhlIGNvbnRlbnRzIG9mIHRoZSBmaXJzdCBFeGVjU3RhcnQgc3RhbnphIGZyb20gdGhlIGdpdmVuIHN5c3RlbWQgdW5pdCBhbmQgcmV0dXJuIGl0IHRvIHN0ZG91dAogIGVjaG8KICBlY2hvICJJZiAnZW52ZmlsZScgaXMgcHJvdmlkZWQsIHB1dCBpdCBpbiB0aGVyZSBpbnN0ZWFkLCBhcyBhbiBlbnZpcm9ubWVudCB2YXJpYWJsZSBuYW1lZCAndmFybmFtZSciCiAgZWNobyAiRGVmYXVsdCAndmFybmFtZScgaXMgRVhFQ1NUQVJUIGlmIG5vdCBzcGVjaWZpZWQiCiAgZXhpdCAxCn0KClVOSVQ9JDEKRU5WRklMRT0kMgpWQVJOQU1FPSQzCmlmIFtbIC16ICRVTklUIHx8ICRVTklUID09ICItLWhlbHAiIHx8ICRVTklUID09ICItaCIgXV07IHRoZW4KICB1c2FnZQpmaQpkZWJ1ZyAiRXh0cmFjdGluZyBFeGVjU3RhcnQgZnJvbSAkVU5JVCIKRklMRT0kKHN5c3RlbWN0bCBjYXQgJFVOSVQgfCBoZWFkIC1uIDEpCkZJTEU9JHtGSUxFI1wjIH0KaWYgW1sgISAtZiAkRklMRSBdXTsgdGhlbgogIGRlYnVnICJGYWlsZWQgdG8gZmluZCByb290IGZpbGUgZm9yIHVuaXQgJFVOSVQgKCRGSUxFKSIKICBleGl0CmZpCmRlYnVnICJTZXJ2aWNlIGRlZmluaXRpb24gaXMgaW4gJEZJTEUiCkVYRUNTVEFSVD0kKHNlZCAtbiAtZSAnL15FeGVjU3RhcnQ9LipcXCQvLC9bXlxcXSQvIHsgcy9eRXhlY1N0YXJ0PS8vOyBwIH0nIC1lICcvXkV4ZWNTdGFydD0uKlteXFxdJC8geyBzL15FeGVjU3RhcnQ9Ly87IHAgfScgJEZJTEUpCgppZiBbWyAkRU5WRklMRSBdXTsgdGhlbgogIFZBUk5BTUU9JHtWQVJOQU1FOi1FWEVDU1RBUlR9CiAgZWNobyAiJHtWQVJOQU1FfT0ke0VYRUNTVEFSVH0iID4gJEVOVkZJTEUKZWxzZQogIGVjaG8gJEVYRUNTVEFSVApmaQo=
  15. mode: 493
  16. path: /usr/local/bin/extractExecStart
  17. - contents:
  18. source: data:text/plain;charset=utf-8;base64,IyEvYmluL2Jhc2gKbnNlbnRlciAtLW1vdW50PS9ydW4vY29udGFpbmVyLW1vdW50LW5hbWVzcGFjZS9tbnQgIiRAIgo=
  19. mode: 493
  20. path: /usr/local/bin/nsenterCmns
  21. systemd:
  22. units:
  23. - contents: |
  24. [Unit]
  25. Description=Manages a mount namespace that both kubelet and crio can use to share their container-specific mounts
  26. [Service]
  27. Type=oneshot
  28. RemainAfterExit=yes
  29. RuntimeDirectory=container-mount-namespace
  30. Environment=RUNTIME_DIRECTORY=%t/container-mount-namespace
  31. Environment=BIND_POINT=%t/container-mount-namespace/mnt
  32. ExecStartPre=bash -c "findmnt ${RUNTIME_DIRECTORY} || mount --make-unbindable --bind ${RUNTIME_DIRECTORY} ${RUNTIME_DIRECTORY}"
  33. ExecStartPre=touch ${BIND_POINT}
  34. ExecStart=unshare --mount=${BIND_POINT} --propagation slave mount --make-rshared /
  35. ExecStop=umount -R ${RUNTIME_DIRECTORY}
  36. enabled: true
  37. name: container-mount-namespace.service
  38. - dropins:
  39. - contents: |
  40. [Unit]
  41. Wants=container-mount-namespace.service
  42. After=container-mount-namespace.service
  43. [Service]
  44. ExecStartPre=/usr/local/bin/extractExecStart %n /%t/%N-execstart.env ORIG_EXECSTART
  45. EnvironmentFile=-/%t/%N-execstart.env
  46. ExecStart=
  47. ExecStart=bash -c "nsenter --mount=%t/container-mount-namespace/mnt \
  48. ${ORIG_EXECSTART}"
  49. name: 90-container-mount-namespace.conf
  50. name: crio.service
  51. - dropins:
  52. - contents: |
  53. [Unit]
  54. Wants=container-mount-namespace.service
  55. After=container-mount-namespace.service
  56. [Service]
  57. ExecStartPre=/usr/local/bin/extractExecStart %n /%t/%N-execstart.env ORIG_EXECSTART
  58. EnvironmentFile=-/%t/%N-execstart.env
  59. ExecStart=
  60. ExecStart=bash -c "nsenter --mount=%t/container-mount-namespace/mnt \
  61. ${ORIG_EXECSTART} --housekeeping-interval=30s"
  62. name: 90-container-mount-namespace.conf
  63. - contents: |
  64. [Service]
  65. Environment="OPENSHIFT_MAX_HOUSEKEEPING_INTERVAL_DURATION=60s"
  66. Environment="OPENSHIFT_EVICTION_MONITORING_PERIOD_DURATION=30s"
  67. name: 30-kubelet-interval-tuning.conf
  68. name: kubelet.service

SCTP

Stream Control Transmission Protocol (SCTP) is a key protocol used in RAN applications. This MachineConfig object adds the SCTP kernel module to the node to enable this protocol.

Recommended SCTP configuration

  1. apiVersion: machineconfiguration.openshift.io/v1
  2. kind: MachineConfig
  3. metadata:
  4. labels:
  5. machineconfiguration.openshift.io/role: master
  6. name: load-sctp-module
  7. spec:
  8. config:
  9. ignition:
  10. version: 2.2.0
  11. storage:
  12. files:
  13. - contents:
  14. source: data:,
  15. verification: {}
  16. filesystem: root
  17. mode: 420
  18. path: /etc/modprobe.d/sctp-blacklist.conf
  19. - contents:
  20. source: data:text/plain;charset=utf-8,sctp
  21. filesystem: root
  22. mode: 420
  23. path: /etc/modules-load.d/sctp-load.conf

Accelerated container startup

The following MachineConfig CR configures core OpenShift processes and containers to use all available CPU cores during system startup and shutdown. This accelerates the system recovery during initial boot and reboots.

Recommended accelerated container startup configuration

  1. apiVersion: machineconfiguration.openshift.io/v1
  2. kind: MachineConfig
  3. metadata:
  4. labels:
  5. machineconfiguration.openshift.io/role: master
  6. name: 04-accelerated-container-startup-master
  7. spec:
  8. config:
  9. ignition:
  10. version: 3.2.0
  11. storage:
  12. files:
  13. - contents:
  14. source: data:text/plain;charset=utf-8;base64,#!/bin/bash
#
# Temporarily reset the core system processes's CPU affinity to be unrestricted to accelerate startup and shutdown
#
# The defaults below can be overridden via environment variables
#

# The default set of critical processes whose affinity should be temporarily unbound:
CRITICAL_PROCESSES=${CRITICAL_PROCESSES:-"systemd ovs crio kubelet NetworkManager conmon dbus"}

# Default wait time is 600s = 10m:
MAXIMUM_WAIT_TIME=${MAXIMUM_WAIT_TIME:-600}

# Default steady-state threshold = 2%
# Allowed values:
#  4  - absolute pod count (+/-)
#  4% - percent change (+/-)
#  -1 - disable the steady-state check
STEADY_STATE_THRESHOLD=${STEADY_STATE_THRESHOLD:-2%}

# Default steady-state window = 60s
# If the running pod count stays within the given threshold for this time
# period, return CPU utilization to normal before the maximum wait time has
# expires
STEADY_STATE_WINDOW=${STEADY_STATE_WINDOW:-60}

# Default steady-state allows any pod count to be "steady state"
# Increasing this will skip any steady-state checks until the count rises above
# this number to avoid false positives if there are some periods where the
# count doesn't increase but we know we can't be at steady-state yet.
STEADY_STATE_MINIMUM=${STEADY_STATE_MINIMUM:-0}

#######################################################

KUBELET_CPU_STATE=/var/lib/kubelet/cpu_manager_state
FULL_CPU_STATE=/sys/fs/cgroup/cpuset/cpuset.cpus
unrestrictedCpuset() {
  local cpus
  if [[ -e $KUBELET_CPU_STATE ]]; then
      cpus=$(jq -r '.defaultCpuSet' <$KUBELET_CPU_STATE)
  fi
  if [[ -z $cpus ]]; then
    # fall back to using all cpus if the kubelet state is not configured yet
    [[ -e $FULL_CPU_STATE ]] || return 1
    cpus=$(<$FULL_CPU_STATE)
  fi
  echo $cpus
}

restrictedCpuset() {
  for arg in $(</proc/cmdline); do
    if [[ $arg =~ ^systemd.cpu_affinity= ]]; then
      echo ${arg#*=}
      return 0
    fi
  done
  return 1
}

getCPUCount () {
  local cpuset="$1"
  local cpulist=()
  local cpus=0
  local mincpus=2

  if [[ -z $cpuset || $cpuset =~ [^0-9,-] ]]; then
    echo $mincpus
    return 1
  fi

  IFS=',' read -ra cpulist <<< $cpuset

  for elm in "${cpulist[@]}"; do
    if [[ $elm =~ ^[0-9]+$ ]]; then
      (( cpus++ ))
    elif [[ $elm =~ ^[0-9]+-[0-9]+$ ]]; then
      local low=0 high=0
      IFS='-' read low high <<< $elm
      (( cpus += high - low + 1 ))
    else
      echo $mincpus
      return 1
    fi
  done

  # Return a minimum of 2 cpus
  echo $(( cpus > $mincpus ? cpus : $mincpus ))
  return 0
}

resetOVSthreads () {
  local cpucount="$1"
  local curRevalidators=0
  local curHandlers=0
  local desiredRevalidators=0
  local desiredHandlers=0
  local rc=0

  curRevalidators=$(ps -Teo pid,tid,comm,cmd | grep -e revalidator | grep -c ovs-vswitchd)
  curHandlers=$(ps -Teo pid,tid,comm,cmd | grep -e handler | grep -c ovs-vswitchd)

  # Calculate the desired number of threads the same way OVS does.
  # OVS will set these thread count as a one shot process on startup, so we
  # have to adjust up or down during the boot up process. The desired outcome is
  # to not restrict the number of thread at startup until we reach a steady
  # state.  At which point we need to reset these based on our restricted  set
  # of cores.
  # See OVS function that calculates these thread counts:
  # https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-upcall.c#L635
  (( desiredRevalidators=$cpucount / 4 + 1 ))
  (( desiredHandlers=$cpucount - $desiredRevalidators ))


  if [[ $curRevalidators -ne $desiredRevalidators || $curHandlers -ne $desiredHandlers ]]; then

    logger "Recovery: Re-setting OVS revalidator threads: ${curRevalidators} -> ${desiredRevalidators}"
    logger "Recovery: Re-setting OVS handler threads: ${curHandlers} -> ${desiredHandlers}"

    ovs-vsctl set \
      Open_vSwitch . \
      other-config:n-handler-threads=${desiredHandlers} \
      other-config:n-revalidator-threads=${desiredRevalidators}
    rc=$?
  fi

  return $rc
}

resetAffinity() {
  local cpuset="$1"
  local failcount=0
  local successcount=0
  logger "Recovery: Setting CPU affinity for critical processes \"$CRITICAL_PROCESSES\" to $cpuset"
  for proc in $CRITICAL_PROCESSES; do
    local pids="$(pgrep $proc)"
    for pid in $pids; do
      local tasksetOutput
      tasksetOutput="$(taskset -apc "$cpuset" $pid 2>&1)"
      if [[ $? -ne 0 ]]; then
        echo "ERROR: $tasksetOutput"
        ((failcount++))
      else
        ((successcount++))
      fi
    done
  done

  resetOVSthreads "$(getCPUCount ${cpuset})"
  if [[ $? -ne 0 ]]; then
    ((failcount++))
  else
    ((successcount++))
  fi

  logger "Recovery: Re-affined $successcount pids successfully"
  if [[ $failcount -gt 0 ]]; then
    logger "Recovery: Failed to re-affine $failcount processes"
    return 1
  fi
}

setUnrestricted() {
  logger "Recovery: Setting critical system processes to have unrestricted CPU access"
  resetAffinity "$(unrestrictedCpuset)"
}

setRestricted() {
  logger "Recovery: Resetting critical system processes back to normally restricted access"
  resetAffinity "$(restrictedCpuset)"
}

currentAffinity() {
  local pid="$1"
  taskset -pc $pid | awk -F': ' '{print $2}'
}

within() {
  local last=$1 current=$2 threshold=$3
  local delta=0 pchange
  delta=$(( current - last ))
  if [[ $current -eq $last ]]; then
    pchange=0
  elif [[ $last -eq 0 ]]; then
    pchange=1000000
  else
    pchange=$(( ( $delta * 100) / last ))
  fi
  echo -n "last:$last current:$current delta:$delta pchange:${pchange}%: "
  local absolute limit
  case $threshold in
    *%)
      absolute=${pchange##-} # absolute value
      limit=${threshold%%%}
      ;;
    *)
      absolute=${delta##-} # absolute value
      limit=$threshold
      ;;
  esac
  if [[ $absolute -le $limit ]]; then
    echo "within (+/-)$threshold"
    return 0
  else
    echo "outside (+/-)$threshold"
    return 1
  fi
}

steadystate() {
  local last=$1 current=$2
  if [[ $last -lt $STEADY_STATE_MINIMUM ]]; then
    echo "last:$last current:$current Waiting to reach $STEADY_STATE_MINIMUM before checking for steady-state"
    return 1
  fi
  within $last $current $STEADY_STATE_THRESHOLD
}

waitForReady() {
  logger "Recovery: Waiting ${MAXIMUM_WAIT_TIME}s for the initialization to complete"
  local lastSystemdCpuset="$(currentAffinity 1)"
  local lastDesiredCpuset="$(unrestrictedCpuset)"
  local t=0 s=10
  local lastCcount=0 ccount=0 steadyStateTime=0
  while [[ $t -lt $MAXIMUM_WAIT_TIME ]]; do
    sleep $s
    ((t += s))
    # Re-check the current affinity of systemd, in case some other process has changed it
    local systemdCpuset="$(currentAffinity 1)"
    # Re-check the unrestricted Cpuset, as the allowed set of unreserved cores may change as pods are assigned to cores
    local desiredCpuset="$(unrestrictedCpuset)"
    if [[ $systemdCpuset != $lastSystemdCpuset || $lastDesiredCpuset != $desiredCpuset ]]; then
      resetAffinity "$desiredCpuset"
      lastSystemdCpuset="$(currentAffinity 1)"
      lastDesiredCpuset="$desiredCpuset"
    fi

    # Detect steady-state pod count
    ccount=$(crictl ps | wc -l)
    if steadystate $lastCcount $ccount; then
      ((steadyStateTime += s))
      echo "Steady-state for ${steadyStateTime}s/${STEADY_STATE_WINDOW}s"
      if [[ $steadyStateTime -ge $STEADY_STATE_WINDOW ]]; then
        logger "Recovery: Steady-state (+/- $STEADY_STATE_THRESHOLD) for ${STEADY_STATE_WINDOW}s: Done"
        return 0
      fi
    else
      if [[ $steadyStateTime -gt 0 ]]; then
        echo "Resetting steady-state timer"
        steadyStateTime=0
      fi
    fi
    lastCcount=$ccount
  done
  logger "Recovery: Recovery Complete Timeout"
}

main() {
  if ! unrestrictedCpuset >&/dev/null; then
    logger "Recovery: No unrestricted Cpuset could be detected"
    return 1
  fi

  if ! restrictedCpuset >&/dev/null; then
    logger "Recovery: No restricted Cpuset has been configured.  We are already running unrestricted."
    return 0
  fi

  # Ensure we reset the CPU affinity when we exit this script for any reason
  # This way either after the timer expires or after the process is interrupted
  # via ^C or SIGTERM, we return things back to the way they should be.
  trap setRestricted EXIT

  logger "Recovery: Recovery Mode Starting"
  setUnrestricted
  waitForReady
}

if [[ "${BASH_SOURCE[0]}" = "${0}" ]]; then
  main "${@}"
  exit $?
fi

  15. mode: 493
  16. path: /usr/local/bin/accelerated-container-startup.sh
  17. systemd:
  18. units:
  19. - contents: |
  20. [Unit]
  21. Description=Unlocks more CPUs for critical system processes during container startup
  22. [Service]
  23. Type=simple
  24. ExecStart=/usr/local/bin/accelerated-container-startup.sh
  25. # Maximum wait time is 600s = 10m:
  26. Environment=MAXIMUM_WAIT_TIME=600
  27. # Steady-state threshold = 2%
  28. # Allowed values:
  29. # 4 - absolute pod count (+/-)
  30. # 4% - percent change (+/-)
  31. # -1 - disable the steady-state check
  32. # Note: '%' must be escaped as '%%' in systemd unit files
  33. Environment=STEADY_STATE_THRESHOLD=2%%
  34. # Steady-state window = 120s
  35. # If the running pod count stays within the given threshold for this time
  36. # period, return CPU utilization to normal before the maximum wait time has
  37. # expires
  38. Environment=STEADY_STATE_WINDOW=120
  39. # Steady-state minimum = 40
  40. # Increasing this will skip any steady-state checks until the count rises above
  41. # this number to avoid false positives if there are some periods where the
  42. # count doesn't increase but we know we can't be at steady-state yet.
  43. Environment=STEADY_STATE_MINIMUM=40
  44. [Install]
  45. WantedBy=multi-user.target
  46. enabled: true
  47. name: accelerated-container-startup.service
  48. - contents: |
  49. [Unit]
  50. Description=Unlocks more CPUs for critical system processes during container shutdown
  51. DefaultDependencies=no
  52. [Service]
  53. Type=simple
  54. ExecStart=/usr/local/bin/accelerated-container-startup.sh
  55. # Maximum wait time is 600s = 10m:
  56. Environment=MAXIMUM_WAIT_TIME=600
  57. # Steady-state threshold
  58. # Allowed values:
  59. # 4 - absolute pod count (+/-)
  60. # 4% - percent change (+/-)
  61. # -1 - disable the steady-state check
  62. # Note: '%' must be escaped as '%%' in systemd unit files
  63. Environment=STEADY_STATE_THRESHOLD=-1
  64. # Steady-state window = 60s
  65. # If the running pod count stays within the given threshold for this time
  66. # period, return CPU utilization to normal before the maximum wait time has
  67. # expires
  68. Environment=STEADY_STATE_WINDOW=60
  69. [Install]
  70. WantedBy=shutdown.target reboot.target halt.target
  71. enabled: true
  72. name: accelerated-container-shutdown.service

Automatic kernel crash dumps with kdump

kdump is a Linux kernel feature that creates a kernel crash dump when the kernel crashes. kdump is enabled with the following MachineConfig CR:

Recommended kdump configuration

  1. apiVersion: machineconfiguration.openshift.io/v1
  2. kind: MachineConfig
  3. metadata:
  4. labels:
  5. machineconfiguration.openshift.io/role: master
  6. name: 06-kdump-enable-master
  7. spec:
  8. config:
  9. ignition:
  10. version: 3.2.0
  11. systemd:
  12. units:
  13. - enabled: true
  14. name: kdump.service
  15. kernelArguments:
  16. - crashkernel=512M

Recommended post-installation cluster configurations

When the cluster installation is complete, the ZTP pipeline applies the following custom resources (CRs) that are required to run DU workloads.

In GitOps ZTP v4.10 and earlier, you configure UEFI secure boot with a MachineConfig CR. This is no longer required in GitOps ZTP v4.11 and later. In v4.11, you configure UEFI secure boot for single-node OpenShift clusters using Performance profile CRs. For more information, see Performance profile.

Operator namespaces and Operator groups

Single-node OpenShift clusters that run DU workloads require the following OperatorGroup and Namespace custom resources (CRs):

  • Local Storage Operator

  • Logging Operator

  • PTP Operator

  • SR-IOV Network Operator

The following YAML summarizes these CRs:

Recommended Operator Namespace and OperatorGroup configuration

  1. apiVersion: v1
  2. kind: Namespace
  3. metadata:
  4. annotations:
  5. workload.openshift.io/allowed: management
  6. name: openshift-local-storage
  7. ---
  8. apiVersion: operators.coreos.com/v1
  9. kind: OperatorGroup
  10. metadata:
  11. name: openshift-local-storage
  12. namespace: openshift-local-storage
  13. spec:
  14. targetNamespaces:
  15. - openshift-local-storage
  16. ---
  17. apiVersion: v1
  18. kind: Namespace
  19. metadata:
  20. annotations:
  21. workload.openshift.io/allowed: management
  22. name: openshift-logging
  23. ---
  24. apiVersion: operators.coreos.com/v1
  25. kind: OperatorGroup
  26. metadata:
  27. name: cluster-logging
  28. namespace: openshift-logging
  29. spec:
  30. targetNamespaces:
  31. - openshift-logging
  32. ---
  33. apiVersion: v1
  34. kind: Namespace
  35. metadata:
  36. annotations:
  37. workload.openshift.io/allowed: management
  38. labels:
  39. openshift.io/cluster-monitoring: "true"
  40. name: openshift-ptp
  41. ---
  42. apiVersion: operators.coreos.com/v1
  43. kind: OperatorGroup
  44. metadata:
  45. name: ptp-operators
  46. namespace: openshift-ptp
  47. spec:
  48. targetNamespaces:
  49. - openshift-ptp
  50. ---
  51. apiVersion: v1
  52. kind: Namespace
  53. metadata:
  54. annotations:
  55. workload.openshift.io/allowed: management
  56. name: openshift-sriov-network-operator
  57. ---
  58. apiVersion: operators.coreos.com/v1
  59. kind: OperatorGroup
  60. metadata:
  61. name: sriov-network-operators
  62. namespace: openshift-sriov-network-operator
  63. spec:
  64. targetNamespaces:
  65. - openshift-sriov-network-operator

Operator subscriptions

Single-node OpenShift clusters that run DU workloads require the following Subscription CRs. The subscription provides the location to download the following Operators:

  • Local Storage Operator

  • Logging Operator

  • PTP Operator

  • SR-IOV Network Operator

Recommended Operator subscriptions

  1. apiVersion: operators.coreos.com/v1alpha1
  2. kind: Subscription
  3. metadata:
  4. name: cluster-logging
  5. namespace: openshift-logging
  6. spec:
  7. channel: "stable" (1)
  8. name: cluster-logging
  9. source: redhat-operators
  10. sourceNamespace: openshift-marketplace
  11. installPlanApproval: Manual (2)
  12. ---
  13. apiVersion: operators.coreos.com/v1alpha1
  14. kind: Subscription
  15. metadata:
  16. name: local-storage-operator
  17. namespace: openshift-local-storage
  18. spec:
  19. channel: "stable"
  20. installPlanApproval: Automatic
  21. name: local-storage-operator
  22. source: redhat-operators
  23. sourceNamespace: openshift-marketplace
  24. installPlanApproval: Manual
  25. ---
  26. apiVersion: operators.coreos.com/v1alpha1
  27. kind: Subscription
  28. metadata:
  29. name: ptp-operator-subscription
  30. namespace: openshift-ptp
  31. spec:
  32. channel: "stable"
  33. name: ptp-operator
  34. source: redhat-operators
  35. sourceNamespace: openshift-marketplace
  36. installPlanApproval: Manual
  37. ---
  38. apiVersion: operators.coreos.com/v1alpha1
  39. kind: Subscription
  40. metadata:
  41. name: sriov-network-operator-subscription
  42. namespace: openshift-sriov-network-operator
  43. spec:
  44. channel: "stable"
  45. name: sriov-network-operator
  46. source: redhat-operators
  47. sourceNamespace: openshift-marketplace
  48. installPlanApproval: Manual
1Specify the channel to get the Operator from. stable is the recommended channel.
2Specify Manual or Automatic. In Automatic mode, the Operator automatically updates to the latest versions in the channel as they become available in the registry. In Manual mode, new Operator versions are installed only after they are explicitly approved.

Cluster logging and log forwarding

Single-node OpenShift clusters that run DU workloads require logging and log forwarding for debugging. The following example YAML illustrates the required ClusterLogging and ClusterLogForwarder CRs.

Recommended cluster logging and log forwarding configuration

  1. apiVersion: logging.openshift.io/v1
  2. kind: ClusterLogging (1)
  3. metadata:
  4. name: instance
  5. namespace: openshift-logging
  6. spec:
  7. collection:
  8. logs:
  9. fluentd: {}
  10. type: fluentd
  11. curation:
  12. type: "curator"
  13. curator:
  14. schedule: "30 3 * * *"
  15. managementState: Managed
  16. ---
  17. apiVersion: logging.openshift.io/v1
  18. kind: ClusterLogForwarder (2)
  19. metadata:
  20. name: instance
  21. namespace: openshift-logging
  22. spec:
  23. inputs:
  24. - infrastructure: {}
  25. name: infra-logs
  26. outputs:
  27. - name: kafka-open
  28. type: kafka
  29. url: tcp://10.46.55.190:9092/test (3)
  30. pipelines:
  31. - inputRefs:
  32. - audit
  33. name: audit-logs
  34. outputRefs:
  35. - kafka-open
  36. - inputRefs:
  37. - infrastructure
  38. name: infrastructure-logs
  39. outputRefs:
  40. - kafka-open
1Updates the existing ClusterLogging instance or creates the instance if it does not exist.
2Updates the existing ClusterLogForwarder instance or creates the instance if it does not exist.
3Specifies the URL of the Kafka server where the logs are forwarded to.

Performance profile

Single-node OpenShift clusters that run DU workloads require a Node Tuning Operator performance profile to use real-time host capabilities and services.

In earlier versions of OKD, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OKD 4.11 and later, this functionality is part of the Node Tuning Operator.

The following example PerformanceProfile CR illustrates the required cluster configuration.

Recommended performance profile configuration

  1. apiVersion: performance.openshift.io/v2
  2. kind: PerformanceProfile
  3. metadata:
  4. name: openshift-node-performance-profile (1)
  5. spec:
  6. additionalKernelArgs:
  7. - "rcupdate.rcu_normal_after_boot=0"
  8. - "efi=runtime" (2)
  9. cpu:
  10. isolated: 2-51,54-103 (3)
  11. reserved: 0-1,52-53 (4)
  12. hugepages:
  13. defaultHugepagesSize: 1G
  14. pages:
  15. - count: 32 (5)
  16. size: 1G (6)
  17. node: 0 (7)
  18. machineConfigPoolSelector:
  19. pools.operator.machineconfiguration.openshift.io/master: ""
  20. nodeSelector:
  21. node-role.kubernetes.io/master: ""
  22. numa:
  23. topologyPolicy: "restricted"
  24. realTimeKernel:
  25. enabled: true (8)
1Ensure that the value for name matches that specified in the spec.profile.data field of TunedPerformancePatch.yaml and the status.configuration.source.name field of validatorCRs/informDuValidator.yaml.
2Configures UEFI secure boot for the cluster host.
3Set the isolated CPUs. Ensure all of the Hyper-Threading pairs match.

The reserved and isolated CPU pools must not overlap and together must span all available cores. CPU cores that are not accounted for cause an undefined behaviour in the system.

4Set the reserved CPUs. When workload partitioning is enabled, system processes, kernel threads, and system container threads are restricted to these CPUs. All CPUs that are not isolated should be reserved.
5Set the number of huge pages.
6Set the huge page size.
7Set node to the NUMA node where the hugepages are allocated.
8Set enabled to true to install the real-time Linux kernel.

PTP

Single-node OpenShift clusters use Precision Time Protocol (PTP) for network time synchronization. The following example PtpConfig CR illustrates the required PTP slave configuration.

Recommended PTP configuration

  1. apiVersion: ptp.openshift.io/v1
  2. kind: PtpConfig
  3. metadata:
  4. name: du-ptp-slave
  5. namespace: openshift-ptp
  6. spec:
  7. profile:
  8. - interface: ens5f0 (1)
  9. name: slave
  10. phc2sysOpts: -a -r -n 24
  11. ptp4lConf: |
  12. [global]
  13. #
  14. # Default Data Set
  15. #
  16. twoStepFlag 1
  17. slaveOnly 0
  18. priority1 128
  19. priority2 128
  20. domainNumber 24
  21. #utc_offset 37
  22. clockClass 248
  23. clockAccuracy 0xFE
  24. offsetScaledLogVariance 0xFFFF
  25. free_running 0
  26. freq_est_interval 1
  27. dscp_event 0
  28. dscp_general 0
  29. dataset_comparison ieee1588
  30. G.8275.defaultDS.localPriority 128
  31. #
  32. # Port Data Set
  33. #
  34. logAnnounceInterval -3
  35. logSyncInterval -4
  36. logMinDelayReqInterval -4
  37. logMinPdelayReqInterval -4
  38. announceReceiptTimeout 3
  39. syncReceiptTimeout 0
  40. delayAsymmetry 0
  41. fault_reset_interval 4
  42. neighborPropDelayThresh 20000000
  43. masterOnly 0
  44. G.8275.portDS.localPriority 128
  45. #
  46. # Run time options
  47. #
  48. assume_two_step 0
  49. logging_level 6
  50. path_trace_enabled 0
  51. follow_up_info 0
  52. hybrid_e2e 0
  53. inhibit_multicast_service 0
  54. net_sync_monitor 0
  55. tc_spanning_tree 0
  56. tx_timestamp_timeout 1
  57. unicast_listen 0
  58. unicast_master_table 0
  59. unicast_req_duration 3600
  60. use_syslog 1
  61. verbose 0
  62. summary_interval 0
  63. kernel_leap 1
  64. check_fup_sync 0
  65. #
  66. # Servo Options
  67. #
  68. pi_proportional_const 0.0
  69. pi_integral_const 0.0
  70. pi_proportional_scale 0.0
  71. pi_proportional_exponent -0.3
  72. pi_proportional_norm_max 0.7
  73. pi_integral_scale 0.0
  74. pi_integral_exponent 0.4
  75. pi_integral_norm_max 0.3
  76. step_threshold 2.0
  77. first_step_threshold 0.00002
  78. max_frequency 900000000
  79. clock_servo pi
  80. sanity_freq_limit 200000000
  81. ntpshm_segment 0
  82. #
  83. # Transport options
  84. #
  85. transportSpecific 0x0
  86. ptp_dst_mac 01:1B:19:00:00:00
  87. p2p_dst_mac 01:80:C2:00:00:0E
  88. udp_ttl 1
  89. udp6_scope 0x0E
  90. uds_address /var/run/ptp4l
  91. #
  92. # Default interface options
  93. #
  94. clock_type OC
  95. network_transport L2
  96. delay_mechanism E2E
  97. time_stamping hardware
  98. tsproc_mode filter
  99. delay_filter moving_median
  100. delay_filter_length 10
  101. egressLatency 0
  102. ingressLatency 0
  103. boundary_clock_jbod 0
  104. #
  105. # Clock description
  106. #
  107. productDescription ;;
  108. revisionData ;;
  109. manufacturerIdentity 00:00:00
  110. userDescription ;
  111. timeSource 0xA0
  112. ptp4lOpts: -2 -s --summary_interval -4
  113. recommend:
  114. - match:
  115. - nodeLabel: node-role.kubernetes.io/master
  116. priority: 4
  117. profile: slave
1Sets the interface used to receive the PTP clock signal.

Extended Tuned profile

Single-node OpenShift clusters that run DU workloads require additional performance tuning configurations necessary for high-performance workloads. The following example Tuned CR extends the Tuned profile:

Recommended extended Tuned profile configuration

  1. apiVersion: tuned.openshift.io/v1
  2. kind: Tuned
  3. metadata:
  4. name: performance-patch
  5. namespace: openshift-cluster-node-tuning-operator
  6. spec:
  7. profile:
  8. - data: |
  9. [main]
  10. summary=Configuration changes profile inherited from performance created tuned
  11. include=openshift-node-performance-openshift-node-performance-profile
  12. [bootloader]
  13. cmdline_crash=nohz_full=2-51,54-103
  14. [sysctl]
  15. kernel.timer_migration=1
  16. [scheduler]
  17. group.ice-ptp=0:f:10:*:ice-ptp.*
  18. [service]
  19. service.stalld=start,enable
  20. service.chronyd=stop,disable
  21. name: performance-patch
  22. recommend:
  23. - machineConfigLabels:
  24. machineconfiguration.openshift.io/role: master
  25. priority: 19
  26. profile: performance-patch

SR-IOV

Single root I/O virtualization (SR-IOV) is commonly used to enable the fronthaul and the midhaul networks. The following YAML example configures SR-IOV for a single-node OpenShift cluster.

Recommended SR-IOV configuration

  1. apiVersion: sriovnetwork.openshift.io/v1
  2. kind: SriovOperatorConfig
  3. metadata:
  4. name: default
  5. namespace: openshift-sriov-network-operator
  6. spec:
  7. configDaemonNodeSelector:
  8. node-role.kubernetes.io/master: ""
  9. disableDrain: true
  10. enableInjector: true
  11. enableOperatorWebhook: true
  12. ---
  13. apiVersion: sriovnetwork.openshift.io/v1
  14. kind: SriovNetwork
  15. metadata:
  16. name: sriov-nw-du-mh
  17. namespace: openshift-sriov-network-operator
  18. spec:
  19. networkNamespace: openshift-sriov-network-operator
  20. resourceName: du_mh
  21. vlan: 150 (1)
  22. ---
  23. apiVersion: sriovnetwork.openshift.io/v1
  24. kind: SriovNetworkNodePolicy
  25. metadata:
  26. name: sriov-nnp-du-mh
  27. namespace: openshift-sriov-network-operator
  28. spec:
  29. deviceType: vfio-pci (2)
  30. isRdma: false
  31. nicSelector:
  32. pfNames:
  33. - ens7f0 (3)
  34. nodeSelector:
  35. node-role.kubernetes.io/master: ""
  36. numVfs: 8 (4)
  37. priority: 10
  38. resourceName: du_mh
  39. ---
  40. apiVersion: sriovnetwork.openshift.io/v1
  41. kind: SriovNetwork
  42. metadata:
  43. name: sriov-nw-du-fh
  44. namespace: openshift-sriov-network-operator
  45. spec:
  46. networkNamespace: openshift-sriov-network-operator
  47. resourceName: du_fh
  48. vlan: 140 (5)
  49. ---
  50. apiVersion: sriovnetwork.openshift.io/v1
  51. kind: SriovNetworkNodePolicy
  52. metadata:
  53. name: sriov-nnp-du-fh
  54. namespace: openshift-sriov-network-operator
  55. spec:
  56. deviceType: netdevice (6)
  57. isRdma: true
  58. nicSelector:
  59. pfNames:
  60. - ens5f0 (7)
  61. nodeSelector:
  62. node-role.kubernetes.io/master: ""
  63. numVfs: 8 (8)
  64. priority: 10
  65. resourceName: du_fh
1Specifies the VLAN for the midhaul network.
2Select either vfio-pci or netdevice, as needed.
3Specifies the interface connected to the midhaul network.
4Specifies the number of VFs for the midhaul network.
5The VLAN for the fronthaul network.
6Select either vfio-pci or netdevice, as needed.
7Specifies the interface connected to the fronthaul network.
8Specifies the number of VFs for the fronthaul network.

Console Operator

The console-operator installs and maintains the web console on a cluster. When the node is centrally managed the Operator is not needed and makes space for application workloads. The following Console custom resource (CR) example disables the console.

Recommended console configuration

  1. apiVersion: operator.openshift.io/v1
  2. kind: Console
  3. metadata:
  4. annotations:
  5. include.release.openshift.io/ibm-cloud-managed: "false"
  6. include.release.openshift.io/self-managed-high-availability: "false"
  7. include.release.openshift.io/single-node-developer: "false"
  8. release.openshift.io/create-only: "true"
  9. name: cluster
  10. spec:
  11. logLevel: Normal
  12. managementState: Removed
  13. operatorLogLevel: Normal

Grafana and Alertmanager

Single-node OpenShift clusters that run DU workloads require reduced CPU resources consumed by the OKD monitoring components. The following ConfigMap custom resource (CR) disables Grafana and Alertmanager.

Recommended cluster monitoring configuration

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: cluster-monitoring-config
  5. namespace: openshift-monitoring
  6. data:
  7. config.yaml: |
  8. grafana:
  9. enabled: false
  10. alertmanagerMain:
  11. enabled: false
  12. prometheusK8s:
  13. retention: 24h

Network diagnostics

Single-node OpenShift clusters that run DU workloads require less inter-pod network connectivity checks to reduce the additional load created by these pods. The following custom resource (CR) disables these checks.

Recommended network diagnostics configuration

  1. apiVersion: operator.openshift.io/v1
  2. kind: Network
  3. metadata:
  4. name: cluster
  5. spec:
  6. disableNetworkDiagnostics: true

Additional resources