Scheduler Performance Tuning

FEATURE STATE: Kubernetes 1.14betaThis feature is currently in a beta state, meaning:

  • The version names contain beta (e.g. v2beta3).
  • Code is well tested. Enabling the feature is considered safe. Enabled by default.
  • Support for the overall feature will not be dropped, though details may change.
  • The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release. When this happens, we will provide instructions for migrating to the next version. This may require deleting, editing, and re-creating API objects. The editing process may require some thought. This may require downtime for applications that rely on the feature.
  • Recommended for only non-business-critical uses because of potential for incompatible changes in subsequent releases. If you have multiple clusters that can be upgraded independently, you may be able to relax this restriction.
  • Please do try our beta features and give feedback on them! After they exit beta, it may not be practical for us to make more changes.

kube-scheduleris the Kubernetes default scheduler. It is responsible for placement of Podson Nodes in a cluster.

Nodes in a cluster that meet the scheduling requirements of a Pod arecalled feasible Nodes for the Pod. The scheduler finds feasible Nodesfor a Pod and then runs a set of functions to score the feasible Nodes,picking a Node with the highest score among the feasible ones to runthe Pod. The scheduler then notifies the API server about this decisionin a process called Binding.

This page explains performance tuning optimizations that are relevant forlarge Kubernetes clusters.

Percentage of Nodes to Score

Before Kubernetes 1.12, Kube-scheduler used to check the feasibility of allnodes in a cluster and then scored the feasible ones. Kubernetes 1.12 added anew feature that allows the scheduler to stop looking for more feasible nodesonce it finds a certain number of them. This improves the scheduler’sperformance in large clusters. The number is specified as a percentage of thecluster size. The percentage can be controlled by a configuration option calledpercentageOfNodesToScore. The range should be between 1 and 100. Larger valuesare considered as 100%. Zero is equivalent to not providing the config option.Kubernetes 1.14 has logic to find the percentage of nodes to score based on thesize of the cluster if it is not specified in the configuration. It uses alinear formula which yields 50% for a 100-node cluster. The formula yields 10%for a 5000-node cluster. The lower bound for the automatic value is 5%. In otherwords, the scheduler always scores at least 5% of the cluster no matter howlarge the cluster is, unless the user provides the config option with a valuesmaller than 5.

Below is an example configuration that sets percentageOfNodesToScore to 50%.

  1. apiVersion: kubescheduler.config.k8s.io/v1alpha1
  2. kind: KubeSchedulerConfiguration
  3. algorithmSource:
  4. provider: DefaultProvider
  5. ...
  6. percentageOfNodesToScore: 50
Note: In clusters with less than 50 feasible nodes, the scheduler stillchecks all the nodes, simply because there are not enough feasible nodes to stopthe scheduler’s search early.

To disable this feature, you can set percentageOfNodesToScore to 100.

Tuning percentageOfNodesToScore

percentageOfNodesToScore must be a value between 1 and 100 with the defaultvalue being calculated based on the cluster size. There is also a hardcodedminimum value of 50 nodes. This means that changingthis option to lower values in clusters with several hundred nodes will not havemuch impact on the number of feasible nodes that the scheduler tries to find.This is intentional as this option is unlikely to improve performance noticeablyin smaller clusters. In large clusters with over a 1000 nodes setting this valueto lower numbers may show a noticeable performance improvement.

An important note to consider when setting this value is that when a smallernumber of nodes in a cluster are checked for feasibility, some nodes are notsent to be scored for a given Pod. As a result, a Node which could possiblyscore a higher value for running the given Pod might not even be passed to thescoring phase. This would result in a less than ideal placement of the Pod. Forthis reason, the value should not be set to very low percentages. A general ruleof thumb is to never set the value to anything lower than 10. Lower valuesshould be used only when the scheduler’s throughput is critical for yourapplication and the score of nodes is not important. In other words, you preferto run the Pod on any Node as long as it is feasible.

If your cluster has several hundred Nodes or fewer, we do not recommend loweringthe default value of this configuration option. It is unlikely to improve thescheduler’s performance significantly.

How the scheduler iterates over Nodes

This section is intended for those who want to understand the internal detailsof this feature.

In order to give all the Nodes in a cluster a fair chance of being consideredfor running Pods, the scheduler iterates over the nodes in a round robinfashion. You can imagine that Nodes are in an array. The scheduler starts fromthe start of the array and checks feasibility of the nodes until it finds enoughNodes as specified by percentageOfNodesToScore. For the next Pod, thescheduler continues from the point in the Node array that it stopped at whenchecking feasibility of Nodes for the previous Pod.

If Nodes are in multiple zones, the scheduler iterates over Nodes in variouszones to ensure that Nodes from different zones are considered in thefeasibility checks. As an example, consider six nodes in two zones:

  1. Zone 1: Node 1, Node 2, Node 3, Node 4
  2. Zone 2: Node 5, Node 6

The Scheduler evaluates feasibility of the nodes in this order:

  1. Node 1, Node 5, Node 2, Node 6, Node 3, Node 4

After going over all the Nodes, it goes back to Node 1.

Feedback

Was this page helpful?

Thanks for the feedback. If you have a specific, answerable question about how to use Kubernetes, ask it onStack Overflow.Open an issue in the GitHub repo if you want toreport a problemorsuggest an improvement.