Planning your environment according to object maximums

Consider the following tested object maximums when you plan your OKD cluster.

These guidelines are based on the largest possible cluster. For smaller clusters, the maximums are lower. There are many factors that influence the stated thresholds, including the etcd version or storage data format.

These guidelines apply to OKD with software-defined networking (SDN), not Open Virtual Network (OVN).

In most cases, exceeding these numbers results in lower overall performance. It does not necessarily mean that the cluster will fail.

OKD tested cluster maximums for major releases

Tested Cloud Platforms for OKD 3.x: Red Hat OpenStack Platform (RHOSP), Amazon Web Services and Microsoft Azure. Tested Cloud Platforms for OKD 4.x: Amazon Web Services, Microsoft Azure and Google Cloud Platform.

Maximum type3.x tested maximum4.x tested maximum

Number of nodes

2,000

2,000 [1]

Number of pods [2]

150,000

150,000

Number of pods per node

250

500 [3]

Number of pods per core

There is no default value.

There is no default value.

Number of namespaces [4]

10,000

10,000

Number of builds

10,000 (Default pod RAM 512 Mi) - Pipeline Strategy

10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy

Number of pods per namespace [5]

25,000

25,000

Number of services [6]

10,000

10,000

Number of services per namespace

5,000

5,000

Number of back-ends per service

5,000

5,000

Number of deployments per namespace [5]

2,000

2,000

  1. Pause pods were deployed to stress the control plane components of OpenShift at 2000 node scale.

  2. The pod count displayed here is the number of test pods. The actual number of pods depends on the application’s memory, CPU, and storage requirements.

  3. This was tested on a cluster with 100 worker nodes with 500 pods per worker node. The default maxPods is still 250. To get to 500 maxPods, the cluster must be created with a maxPods set to 500 using a custom kubelet config. If you need 500 user pods, you need a hostPrefix of 22 because there are 10-15 system pods already running on the node. The maximum number of pods with attached persistent volume claims (PVC) depends on storage backend from where PVC are allocated. In our tests, only OpenShift Container Storage v4 (OCS v4) was able to satisfy the number of pods per node discussed in this document.

  4. When there are a large number of active projects, etcd might suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd, including defragmentaion, is highly recommended to free etcd storage.

  5. There are a number of control loops in the system that must iterate over all objects in a given namespace as a reaction to some changes in state. Having a large number of objects of a given type in a single namespace can make those loops expensive and slow down processing given state changes. The limit assumes that the system has enough CPU, memory, and disk to satisfy the application requirements.

  6. Each service port and each service back-end has a corresponding entry in iptables. The number of back-ends of a given service impact the size of the endpoints objects, which impacts the size of data that is being sent all over the system.

OKD environment and configuration on which the cluster maximums are tested

AWS cloud platform:

NodeFlavorvCPURAM(GiB)Disk typeDisk size(GiB)/IOSCountRegion

Master/etcd [1]

r5.4xlarge

16

128

io1

220 / 3000

3

us-west-2

Infra [2]

m5.12xlarge

48

192

gp2

100

3

us-west-2

Workload [3]

m5.4xlarge

16

64

gp2

500 [4]

1

us-west-2

Worker

m5.2xlarge

8

32

gp2

100

3/25/250/500 [5]

us-west-2

  1. io1 disks with 3000 IOPS are used for master/etcd nodes as etcd is I/O intensive and latency sensitive.

  2. Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have enough resources to run at large scale.

  3. Workload node is dedicated to run performance and scalability workload generators.

  4. Larger disk size is used so that there is enough space to store the large amounts of data that is collected during the performance and scalability test run.

  5. Cluster is scaled in iterations and performance and scalability tests are executed at the specified node counts.

IBM Power Systems platform:

NodevCPURAM(GiB)Disk typeDisk size(GiB)/IOSCount

Master/etcd [1]

16

32

io1

120 / 3 IOPS per GB

3

Infra [2]

16

64

gp2

120

2

Workload [3]

16

256

gp2

120 [4]

1

Worker

16

64

gp2

120

3/25/250/500 [5]

  1. io1 disks with 120 / 3 IOPS per GB are used for master/etcd nodes as etcd is I/O intensive and latency sensitive.

  2. Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have enough resources to run at large scale.

  3. Workload node is dedicated to run performance and scalability workload generators.

  4. Larger disk size is used so that there is enough space to store the large amounts of data that is collected during the performance and scalability test run.

  5. Cluster is scaled in iterations and performance and scalability tests are executed at the specified node counts.

How to plan your environment according to tested cluster maximums

Oversubscribing the physical resources on a node affects resource guarantees the Kubernetes scheduler makes during pod placement. Learn what measures you can take to avoid memory swapping.

Some of the tested maximums are stretched only in a single dimension. They will vary when many objects are running on the cluster.

The numbers noted in this documentation are based on Red Hat’s test methodology, setup, configuration, and tunings. These numbers can vary based on your own individual setup and environments.

While planning your environment, determine how many pods are expected to fit per node:

  1. required pods per cluster / pods per node = total number of nodes needed

The current maximum number of pods per node is 250. However, the number of pods that fit on a node is dependent on the application itself. Consider the application’s memory, CPU, and storage requirements, as described in How to plan your environment according to application requirements.

Example scenario

If you want to scope your cluster for 2200 pods per cluster, you would need at least five nodes, assuming that there are 500 maximum pods per node:

  1. 2200 / 500 = 4.4

If you increase the number of nodes to 20, then the pod distribution changes to 110 pods per node:

  1. 2200 / 20 = 110

Where:

  1. required pods per cluster / total number of nodes = expected pods per node

How to plan your environment according to application requirements

Consider an example application environment:

Pod typePod quantityMax memoryCPU coresPersistent storage

apache

100

500 MB

0.5

1 GB

node.js

200

1 GB

1

1 GB

postgresql

100

1 GB

2

10 GB

JBoss EAP

100

1 GB

1

1 GB

Extrapolated requirements: 550 CPU cores, 450GB RAM, and 1.4TB storage.

Instance size for nodes can be modulated up or down, depending on your preference. Nodes are often resource overcommitted. In this deployment scenario, you can choose to run additional smaller nodes or fewer larger nodes to provide the same amount of resources. Factors such as operational agility and cost-per-instance should be considered.

Node typeQuantityCPUsRAM (GB)

Nodes (option 1)

100

4

16

Nodes (option 2)

50

8

32

Nodes (option 3)

25

16

64

Some applications lend themselves well to overcommitted environments, and some do not. Most Java applications and applications that use huge pages are examples of applications that would not allow for overcommitment. That memory can not be used for other applications. In the example above, the environment would be roughly 30 percent overcommitted, a common ratio.

The application pods can access a service either by using environment variables or DNS. If using environment variables, for each active service the variables are injected by the kubelet when a pod is run on a node. A cluster-aware DNS server watches the Kubernetes API for new services and creates a set of DNS records for each one. If DNS is enabled throughout your cluster, then all pods should automatically be able to resolve services by their DNS name. Service discovery using DNS can be used in case you must go beyond 5000 services. When using environment variables for service discovery, the argument list exceeds the allowed length after 5000 services in a namespace, then the pods and deployments will start failing. Disable the service links in the deployment’s service specification file to overcome this:

  1. ---
  2. apiVersion: v1
  3. kind: Template
  4. metadata:
  5. name: deployment-config-template
  6. creationTimestamp:
  7. annotations:
  8. description: This template will create a deploymentConfig with 1 replica, 4 env vars and a service.
  9. tags: ''
  10. objects:
  11. - apiVersion: v1
  12. kind: DeploymentConfig
  13. metadata:
  14. name: deploymentconfig${IDENTIFIER}
  15. spec:
  16. template:
  17. metadata:
  18. labels:
  19. name: replicationcontroller${IDENTIFIER}
  20. spec:
  21. enableServiceLinks: false
  22. containers:
  23. - name: pause${IDENTIFIER}
  24. image: "${IMAGE}"
  25. ports:
  26. - containerPort: 8080
  27. protocol: TCP
  28. env:
  29. - name: ENVVAR1_${IDENTIFIER}
  30. value: "${ENV_VALUE}"
  31. - name: ENVVAR2_${IDENTIFIER}
  32. value: "${ENV_VALUE}"
  33. - name: ENVVAR3_${IDENTIFIER}
  34. value: "${ENV_VALUE}"
  35. - name: ENVVAR4_${IDENTIFIER}
  36. value: "${ENV_VALUE}"
  37. resources: {}
  38. imagePullPolicy: IfNotPresent
  39. capabilities: {}
  40. securityContext:
  41. capabilities: {}
  42. privileged: false
  43. restartPolicy: Always
  44. serviceAccount: ''
  45. replicas: 1
  46. selector:
  47. name: replicationcontroller${IDENTIFIER}
  48. triggers:
  49. - type: ConfigChange
  50. strategy:
  51. type: Rolling
  52. - apiVersion: v1
  53. kind: Service
  54. metadata:
  55. name: service${IDENTIFIER}
  56. spec:
  57. selector:
  58. name: replicationcontroller${IDENTIFIER}
  59. ports:
  60. - name: serviceport${IDENTIFIER}
  61. protocol: TCP
  62. port: 80
  63. targetPort: 8080
  64. portalIP: ''
  65. type: ClusterIP
  66. sessionAffinity: None
  67. status:
  68. loadBalancer: {}
  69. parameters:
  70. - name: IDENTIFIER
  71. description: Number to append to the name of resources
  72. value: '1'
  73. required: true
  74. - name: IMAGE
  75. description: Image to use for deploymentConfig
  76. value: gcr.io/google-containers/pause-amd64:3.0
  77. required: false
  78. - name: ENV_VALUE
  79. description: Value to use for environment variables
  80. generate: expression
  81. from: "[A-Za-z0-9]{255}"
  82. required: false
  83. labels:
  84. template: deployment-config-template

The number of application pods that can run in a namespace is dependent on the number of services and the length of the service name when the environment variables are used for service discovery. ARG_MAX on the system defines the maximum argument length for a new process and it is set to 2097152 KiB by default. The Kubelet injects environment variables in to each pod scheduled to run in the namespace including:

  • <SERVICE_NAME>_SERVICE_HOST=<IP>

  • <SERVICE_NAME>_SERVICE_PORT=<PORT>

  • <SERVICE_NAME>_PORT=tcp://<IP>:<PORT>

  • <SERVICE_NAME>_PORT_<PORT>_TCP=tcp://<IP>:<PORT>

  • <SERVICE_NAME>_PORT_<PORT>_TCP_PROTO=tcp

  • <SERVICE_NAME>_PORT_<PORT>_TCP_PORT=<PORT>

  • <SERVICE_NAME>_PORT_<PORT>_TCP_ADDR=<ADDR>

The pods in the namespace will start to fail if the argument length exceeds the allowed value and the number of characters in a service name impacts it. For example, in a namespace with 5000 services, the limit on the service name is 33 characters, which enables you to run 5000 pods in the namespace.