Scheduling Framework

FEATURE STATE: Kubernetes 1.15alphaThis feature is currently in a alpha state, meaning:

  • The version names contain alpha (e.g. v1alpha1).
  • Might be buggy. Enabling the feature may expose bugs. Disabled by default.
  • Support for feature may be dropped at any time without notice.
  • The API may change in incompatible ways in a later software release without notice.
  • Recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.

The scheduling framework is a new pluggable architecture for Kubernetes Schedulerthat makes scheduler customizations easy. It adds a new set of “plugin” APIs tothe existing scheduler. Plugins are compiled into the scheduler. The APIsallow most scheduling features to be implemented as plugins, while keeping thescheduling “core” simple and maintainable. Refer to the design proposal of thescheduling framework for more technical information on the design of theframework.

Framework workflow

The Scheduling Framework defines a few extension points. Scheduler pluginsregister to be invoked at one or more extension points. Some of these pluginscan change the scheduling decisions and some are informational only.

Each attempt to schedule one Pod is split into two phases, the schedulingcycle and the binding cycle.

Scheduling Cycle & Binding Cycle

The scheduling cycle selects a node for the Pod, and the binding cycle appliesthat decision to the cluster. Together, a scheduling cycle and binding cycle arereferred to as a “scheduling context”.

Scheduling cycles are run serially, while binding cycles may run concurrently.

A scheduling or binding cycle can be aborted if the Pod is determined tobe unschedulable or if there is an internal error. The Pod will be returned tothe queue and retried.

Extension points

The following picture shows the scheduling context of a Pod and the extensionpoints that the scheduling framework exposes. In this picture “Filter” isequivalent to “Predicate” and “Scoring” is equivalent to “Priority function”.

One plugin may register at multiple extension points to perform more complex orstateful tasks.

Scheduling Framework (EN) - 图1

scheduling framework extension points

Queue sort

These plugins are used to sort Pods in the scheduling queue. A queue sort pluginessentially will provide a “less(Pod1, Pod2)” function. Only one queue sortplugin may be enabled at a time.

Pre-filter

These plugins are used to pre-process info about the Pod, or to check certainconditions that the cluster or the Pod must meet. If a pre-filter plugin returnsan error, the scheduling cycle is aborted.

Filter

These plugins are used to filter out nodes that cannot run the Pod. For eachnode, the scheduler will call filter plugins in their configured order. If anyfilter plugin marks the node as infeasible, the remaining plugins will not becalled for that node. Nodes may be evaluated concurrently.

Post-filter

This is an informational extension point. Plugins will be called with a list ofnodes that passed the filtering phase. A plugin may use this data to updateinternal state or to generate logs/metrics.

Note: Plugins wishing to perform “pre-scoring” work should use thepost-filter extension point.

Scoring

These plugins are used to rank nodes that have passed the filtering phase. Thescheduler will call each scoring plugin for each node. There will be a welldefined range of integers representing the minimum and maximum scores. After thenormalize scoring phase, the scheduler will combine nodescores from all plugins according to the configured plugin weights.

Normalize scoring

These plugins are used to modify scores before the scheduler computes a finalranking of Nodes. A plugin that registers for this extension point will becalled with the scoring results from the same plugin. This is calledonce per plugin per scheduling cycle.

For example, suppose a plugin BlinkingLightScorer ranks Nodes based on howmany blinking lights they have.

  1. func ScoreNode(_ *v1.pod, n *v1.Node) (int, error) {
  2. return getBlinkingLightCount(n)
  3. }

However, the maximum count of blinking lights may be small compared toNodeScoreMax. To fix this, BlinkingLightScorer should also register for thisextension point.

  1. func NormalizeScores(scores map[string]int) {
  2. highest := 0
  3. for _, score := range scores {
  4. highest = max(highest, score)
  5. }
  6. for node, score := range scores {
  7. scores[node] = score*NodeScoreMax/highest
  8. }
  9. }

If any normalize-scoring plugin returns an error, the scheduling cycle isaborted.

Note: Plugins wishing to perform “pre-reserve” work should use thenormalize-scoring extension point.

Reserve

This is an informational extension point. Plugins which maintain runtime state(aka “stateful plugins”) should use this extension point to be notified by thescheduler when resources on a node are being reserved for a given Pod. Thishappens before the scheduler actually binds the Pod to the Node, and it existsto prevent race conditions while the scheduler waits for the bind to succeed.

This is the last step in a scheduling cycle. Once a Pod is in the reservedstate, it will either trigger Unreserve plugins (on failure) orPost-bind plugins (on success) at the end of the binding cycle.

Note: This concept used to be referred to as “assume”.

Permit

These plugins are used to prevent or delay the binding of a Pod. A permit plugincan do one of three things.

  • approveOnce all permit plugins approve a Pod, it is sent for binding.

  • denyIf any permit plugin denies a Pod, it is returned to the scheduling queue.This will trigger Unreserve plugins.

  • wait (with a timeout)If a permit plugin returns “wait”, then the Pod is kept in the permit phaseuntil a plugin approves it. If a timeout occurs, waitbecomes deny and the Pod is returned to the scheduling queue, triggeringUnreserve plugins.

Approving a Pod binding

While any plugin can access the list of “waiting” Pods from the cache andapprove them (see FrameworkHandle) we expect only the permitplugins to approve binding of reserved Pods that are in “waiting” state. Once aPod is approved, it is sent to the pre-bind phase.

Pre-bind

These plugins are used to perform any work required before a Pod is bound. Forexample, a pre-bind plugin may provision a network volume and mount it on thetarget node before allowing the Pod to run there.

If any pre-bind plugin returns an error, the Pod is rejected andreturned to the scheduling queue.

Bind

These plugins are used to bind a Pod to a Node. Bind plugins will not be calleduntil all pre-bind plugins have completed. Each bind plugin is called in theconfigured order. A bind plugin may choose whether or not to handle the givenPod. If a bind plugin chooses to handle a Pod, the remaining bind plugins areskipped.

Post-bind

This is an informational extension point. Post-bind plugins are called after aPod is successfully bound. This is the end of a binding cycle, and can be usedto clean up associated resources.

Unreserve

This is an informational extension point. If a Pod was reserved and thenrejected in a later phase, then unreserve plugins will be notified. Unreserveplugins should clean up state associated with the reserved Pod.

Plugins that use this extension point usually should also useReserve.

Plugin API

There are two steps to the plugin API. First, plugins must register and getconfigured, then they use the extension point interfaces. Extension pointinterfaces have the following form.

  1. type Plugin interface {
  2. Name() string
  3. }
  4. type QueueSortPlugin interface {
  5. Plugin
  6. Less(*v1.pod, *v1.pod) bool
  7. }
  8. type PreFilterPlugin interface {
  9. Plugin
  10. PreFilter(PluginContext, *v1.pod) error
  11. }
  12. // ...

Plugin Configuration

Plugins can be enabled in the scheduler configuration. Also, default plugins canbe disabled in the configuration. In 1.15, there are no default plugins for thescheduling framework.

The scheduler configuration can include configuration for plugins as well. Suchconfigurations are passed to the plugins at the time the scheduler initializesthem. The configuration is an arbitrary value. The receiving plugin shoulddecode and process the configuration.

The following example shows a scheduler configuration that enables someplugins at reserve and preBind extension points and disables a plugin. Italso provides a configuration to plugin foo.

  1. apiVersion: kubescheduler.config.k8s.io/v1alpha1
  2. kind: KubeSchedulerConfiguration
  3. ...
  4. plugins:
  5. reserve:
  6. enabled:
  7. - name: foo
  8. - name: bar
  9. disabled:
  10. - name: baz
  11. preBind:
  12. enabled:
  13. - name: foo
  14. disabled:
  15. - name: baz
  16. pluginConfig:
  17. - name: foo
  18. args: >
  19. Arbitrary set of args to plugin foo

When an extension point is omitted from the configuration default plugins forthat extension points are used. When an extension point exists and enabled isprovided, the enabled plugins are called in addition to default plugins.Default plugins are called first and then the additional enabled plugins arecalled in the same order specified in the configuration. If a different order ofcalling default plugins is desired, default plugins must be disabled andenabled in the desired order.

Assuming there is a default plugin called foo at reserve and we are addingplugin bar that we want to be invoked before foo, we should disable fooand enable bar and foo in order. The following example shows theconfiguration that achieves this:

  1. apiVersion: kubescheduler.config.k8s.io/v1alpha1
  2. kind: KubeSchedulerConfiguration
  3. ...
  4. plugins:
  5. reserve:
  6. enabled:
  7. - name: bar
  8. - name: foo
  9. disabled:
  10. - name: foo

Feedback

Was this page helpful?

Thanks for the feedback. If you have a specific, answerable question about how to use Kubernetes, ask it onStack Overflow.Open an issue in the GitHub repo if you want toreport a problemorsuggest an improvement.