Capacity Scheduling - Elastic Quota Management

Capacity Scheduling is an ability of koord-scheduler to manage different user’s resource usage in a shared-cluster.

Introduction

When several users or teams share a cluster, fairness of resource allocation is very important. the Koordinator provides multi-hierarchy elastic quota management mechanism for the scheduler.

  • It supports configuring quota groups in a tree structure, which is similar to the organizational structure of most companies.
  • It supports the borrowing / returning of resources between different quota groups, for better resource utilization efficiency. The busy quota groups can automatically temporarily borrow the resources from the idle quota groups, which can improve the utilization of the cluster. At the same time, when the idle quota group turn into the busy quota group, it can also automatically take back the “lent-to” resources.
  • It considers the resource fairness between different quota groups. When the busy quota groups borrow the resources from the idle quota groups, the resources can be allocated to the busy quota groups under some fair rules.

Setup

Prerequisite

  • Kubernetes >= 1.18
  • Koordinator >= 0.71

Installation

Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to Installation.

Configurations

Capacity-Scheduling is Enabled by default. You can use it without any modification on the koord-descheduler config.

Use Capacity-Scheduling

Quick Start by Label

1.Create a Deployment quota-example with the YAML file below.

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/parent: ""
  8. quota.scheduling.koordinator.sh/is-parent: "false"
  9. spec:
  10. max:
  11. cpu: 40
  12. memory: 40Gi
  13. min:
  14. cpu: 10
  15. memory: 20Mi
  1. $ kubectl apply -f quota-example.yaml
  2. elasticquota.scheduling.sigs.k8s.io/quota-example created
  3. $ kubectl get eqs -n default
  4. NAME AGE
  5. test-d 2s

2.Create a pod pod-example with the YAML file below.

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: pod-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/name: "quota-example"
  8. spec:
  9. schedulerName: koord-scheduler
  10. containers:
  11. - command:
  12. - sleep
  13. - 365d
  14. image: busybox
  15. imagePullPolicy: IfNotPresent
  16. name: curlimage
  17. resources:
  18. limits:
  19. cpu: 40m
  20. memory: 40Mi
  21. requests:
  22. cpu: 40m
  23. memory: 40Mi
  24. terminationMessagePath: /dev/termination-log
  25. terminationMessagePolicy: File
  26. restartPolicy: Always
  1. $ kubectl apply -f pod-example.yaml
  2. pod/pod-example created

3.Verify quota-example has changed.

  1. $ kubectl get eqs -n default quota-example -o yaml
  1. kind: ElasticQuota
  2. metadata:
  3. annotations:
  4. quota.scheduling.koordinator.sh/request: '{"cpu":"40m","memory":"40Mi"}'
  5. quota.scheduling.koordinator.sh/runtime: '{"cpu":"40m","memory":"40Mi"}'
  6. quota.scheduling.koordinator.sh/shared-weight: '{"cpu":"40","memory":"40Gi"}'
  7. creationTimestamp: "2022-10-08T09:26:38Z"
  8. generation: 2
  9. labels:
  10. quota.scheduling.koordinator.sh/is-parent: "false"
  11. quota.scheduling.koordinator.sh/parent: root
  12. manager: koord-scheduler
  13. operation: Update
  14. time: "2022-10-08T09:26:50Z"
  15. name: quota-example
  16. namespace: default
  17. resourceVersion: "39012008"
  18. spec:
  19. max:
  20. cpu: "40"
  21. memory: 40Gi
  22. min:
  23. cpu: "10"
  24. memory: 20Mi
  25. status:
  26. used:
  27. cpu: 40m
  28. memory: 40Mi

Quick Start by Namespace

1.Create namespace

  1. $ kubectl create ns quota-example
  2. namespace/quota-example created

2.Create a Deployment quota-example with the YAML file below.

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-example
  5. namespace: quota-example
  6. labels:
  7. quota.scheduling.koordinator.sh/parent: ""
  8. quota.scheduling.koordinator.sh/is-parent: "false"
  9. spec:
  10. max:
  11. cpu: 40
  12. memory: 40Gi
  13. min:
  14. cpu: 10
  15. memory: 20Mi
  1. $ kubectl apply -f quota-example.yaml
  2. elasticquota.scheduling.sigs.k8s.io/quota-example created
  3. $ kubectl get eqs -n quota-example
  4. NAME AGE
  5. test-d 2s

2.Create a pod pod-example with the YAML file below.

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: pod-example
  5. namespace: quota-example
  6. spec:
  7. schedulerName: koord-scheduler
  8. containers:
  9. - command:
  10. - sleep
  11. - 365d
  12. image: busybox
  13. imagePullPolicy: IfNotPresent
  14. name: curlimage
  15. resources:
  16. limits:
  17. cpu: 40m
  18. memory: 40Mi
  19. requests:
  20. cpu: 40m
  21. memory: 40Mi
  22. terminationMessagePath: /dev/termination-log
  23. terminationMessagePolicy: File
  24. restartPolicy: Always
  1. $ kubectl apply -f pod-example.yaml
  2. pod/pod-example created

3.Verify quota-example has changed.

  1. $ kubectl get eqs -n quota-example quota-example -o yaml
  1. kind: ElasticQuota
  2. metadata:
  3. annotations:
  4. quota.scheduling.koordinator.sh/request: '{"cpu":"40m","memory":"40Mi"}'
  5. quota.scheduling.koordinator.sh/runtime: '{"cpu":"40m","memory":"40Mi"}'
  6. quota.scheduling.koordinator.sh/shared-weight: '{"cpu":"40","memory":"40Gi"}'
  7. creationTimestamp: "2022-10-08T09:26:38Z"
  8. generation: 2
  9. labels:
  10. quota.scheduling.koordinator.sh/is-parent: "false"
  11. quota.scheduling.koordinator.sh/parent: root
  12. manager: koord-scheduler
  13. operation: Update
  14. time: "2022-10-08T09:26:50Z"
  15. name: quota-example
  16. namespace: quota-example
  17. resourceVersion: "39012008"
  18. spec:
  19. max:
  20. cpu: "40"
  21. memory: 40Gi
  22. min:
  23. cpu: "10"
  24. memory: 20Mi
  25. status:
  26. used:
  27. cpu: 40m
  28. memory: 40Mi

Quota Debug Api.

  1. $ kubectl -n koordinator-system get lease koord-scheduler --no-headers | awk '{print $2}' | cut -d'_' -f1 | xargs -I {} kubectl -n koordinator-system get pod {} -o wide --no-headers | awk '{print $6}'
  2. 10.244.0.64
  3. $ curl 10.244.0.64:10251/apis/v1/plugins/ElasticQuota/quota/quota-example
  1. {
  2. "allowLentResource": true,
  3. "autoScaleMin": {
  4. "cpu": "10",
  5. "memory": "20Mi",
  6. },
  7. "isParent": false,
  8. "max": {
  9. "cpu": "40",
  10. "memory": "40Gi",
  11. },
  12. "min": {
  13. "cpu": "10",
  14. "memory": "20Mi",
  15. },
  16. "name": "quota-example",
  17. "parentName": "root",
  18. "podCache": {
  19. "pod-example": {
  20. "isAssigned": true,
  21. "resource": {
  22. "cpu": "40m",
  23. "memory": "40Mi"
  24. }
  25. }
  26. },
  27. "request": {
  28. "cpu": "40m",
  29. "memory": "40Mi"
  30. },
  31. "runtime": {
  32. "cpu": "40m",
  33. "memory": "41943040",
  34. },
  35. "runtimeVersion": 39,
  36. "sharedWeight": {
  37. "cpu": "40",
  38. "memory": "40Gi",
  39. },
  40. "used": {
  41. "cpu": "40m",
  42. "memory": "40Mi"
  43. }
  44. }

The main different with yaml is that we can find all quota’s pods and its status in podCache.

Advanced Configurations

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/is-parent: false
  8. quota.scheduling.koordinator.sh/parent: "parent"
  9. quota.scheduling.koordinator.sh/allow-lent-resource: true
  10. quota.scheduling.koordinator.sh/shared-weight: '{"cpu":"40","memory":"40Gi"}'
  11. spec:
  12. max:
  13. cpu: 40
  14. memory: 40Gi
  15. min:
  16. cpu: 10
  17. memory: 20Mi
  • quota.scheduling.koordinator.sh/is-parent is disposed by the user. It reflects the “child\parent” attribute of the quota group. Default is child.
  • quota.scheduling.koordinator.sh/parent is disposed by the user. It reflects the parent quota name. Default is root.
  • quota.scheduling.koordinator.sh/shared-weight is disposed by the user. It reflects the ability to share the “lent to” resource. Default equals to “max”.
  • quota.scheduling.koordinator.sh/allow-lent-resource is disposed by the user. It reflects whether quota group allows lent unused “min” to others.

WebHook Verify

1.Except for the first level quota group, we require that the sum of “min” of all sub quota groups should be less than or equal to the “min” of parent group.

first create parent quota:

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-parent-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/is-parent: true
  8. spec:
  9. max:
  10. cpu: 40
  11. memory: 40Gi
  12. min:
  13. cpu: 10
  14. memory: 20Mi

then create child quota:

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/is-parent: false
  8. quota.scheduling.koordinator.sh/parent: "quota-parent-example"
  9. spec:
  10. max:
  11. cpu: 40
  12. memory: 40Gi
  13. min:
  14. cpu: 20
  15. memory: 20Mi
  1. kubectl apply -f quota-example.yaml
  2. Error from server: error when creating "quota-example.yaml": admission webhook "vquota.kb.io" denied the request: checkMinQuotaSum allChildren SumMinQuota > parentMinQuota, parent: quota-parent-example

2.Parent and child’s min\max resource key must same. first create parent quota:

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-parent-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/is-parent: true
  8. spec:
  9. max:
  10. cpu: 40
  11. memory: 40Gi
  12. min:
  13. cpu: 10
  14. memory: 20Mi

then create child quota:

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/is-parent: false
  8. quota.scheduling.koordinator.sh/parent: "quota-parent-example"
  9. spec:
  10. max:
  11. cpu: 40
  12. memory: 40Gi
  13. test: 200
  14. min:
  15. cpu: 10
  16. memory: 20Mi
  1. $ kubectl apply -f quota-example.yaml
  2. Error from server: error when creating "quota-example.yaml": admission webhook "vquota.kb.io" denied the request: checkSubAndParentGroupMaxQuotaKeySame failed: quota-parent-example's key is not the same with quota-example

3.Parent group cannot run pod.

first create parent quota:

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-parent-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/is-parent: true
  8. spec:
  9. max:
  10. cpu: 40
  11. memory: 40Gi
  12. min:
  13. cpu: 10
  14. memory: 20Mi

then create pod:

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: pod-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/name: "quota-parent-example"
  8. spec:
  9. schedulerName: koord-scheduler
  10. containers:
  11. - command:
  12. - sleep
  13. - 365d
  14. image: busybox
  15. imagePullPolicy: IfNotPresent
  16. name: curlimage
  17. resources:
  18. limits:
  19. cpu: 40m
  20. memory: 40Mi
  21. requests:
  22. cpu: 40m
  23. memory: 40Mi
  24. terminationMessagePath: /dev/termination-log
  25. terminationMessagePolicy: File
  26. restartPolicy: Always
  1. $ kubectl apply -f pod-example_xb.yaml
  2. Error from server: error when creating "pod-example.yaml": admission webhook "vpod.kb.io" denied the request: pod can not be linked to a parentQuotaGroup,quota:quota-parent-example, pod:pod-example

4.The parent of node can only be parent group, not child group.

first create parent quota:

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-parent-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/is-parent: false
  8. spec:
  9. max:
  10. cpu: 40
  11. memory: 40Gi
  12. min:
  13. cpu: 10
  14. memory: 20Mi

then create child quota:

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/is-parent: false
  8. quota.scheduling.koordinator.sh/parent: "quota-parent-example"
  9. spec:
  10. max:
  11. cpu: 40
  12. memory: 40Gi
  13. test: 200
  14. min:
  15. cpu: 10
  16. memory: 20Mi
  1. $ kubectl apply -f quota-example.yaml
  2. Error from server: error when creating "elastic-quota-example_xb.yaml": admission webhook "vquota.kb.io" denied the request: quota-example has parentName quota-parent-example but the parentQuotaInfo's IsParent is false

5.A quota group can’t be converted on the attribute of parent group\child group.

first create parent quota:

  1. apiVersion: scheduling.sigs.k8s.io/v1alpha1
  2. kind: ElasticQuota
  3. metadata:
  4. name: quota-parent-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/is-parent: true
  8. spec:
  9. max:
  10. cpu: 40
  11. memory: 40Gi
  12. min:
  13. cpu: 10
  14. memory: 20Mi

then modify quota.scheduling.koordinator.sh/is-parent:false:

  1. $ kubectl apply -f quota-parent-example.yaml
  2. elastic-quota-example_xb_parent.yaml": admission webhook "vquota.kb.io" denied the request: IsParent is forbidden modify now, quotaName:quota-parent-example

used > runtime revoke

We offer a config to control if quota’s used > runtime, we allow the scheduler to delete over-resource-used pod from low priority to high priority. you should follow the below config of koord-scheduler-config.yaml in helm.

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: koord-scheduler-config
  5. namespace: {{ .Values.installation.namespace }}
  6. data:
  7. koord-scheduler-config: |
  8. apiVersion: kubescheduler.config.k8s.io/v1beta2
  9. kind: KubeSchedulerConfiguration
  10. leaderElection:
  11. leaderElect: true
  12. resourceLock: leases
  13. resourceName: koord-scheduler
  14. resourceNamespace: {{ .Values.installation.namespace }}
  15. profiles:
  16. - pluginConfig:
  17. - name: ElasticQuota
  18. args:
  19. apiVersion: kubescheduler.config.k8s.io/v1beta2
  20. kind: ElasticQuotaArgs
  21. quotaGroupNamespace: {{ .Values.installation.namespace }}
  22. enableCheckParentQuota: true
  23. monitorAllQuotas: true
  24. revokePodInterval: 60s
  25. delayEvictTime: 300s
  26. plugins:
  27. queueSort:
  28. disabled:
  29. - name: "*"
  30. enabled:
  31. - name: Coscheduling
  32. preFilter:
  33. enabled:
  34. - name: NodeNUMAResource
  35. - name: DeviceShare
  36. - name: Reservation
  37. - name: Coscheduling
  38. - name: ElasticQuota
  39. filter:
  40. ...
  • enableCheckParentQuota check parentQuotaGroups’ used and runtime Quota. Default is false.
  • monitorAllQuotas enable “used > runtime revoke” logic. Default is false.
  • revokePodInterval check loop time interval.
  • delayEvictTime when “used > runtime” continues over delayEvictTime will really trigger eviction.

To let scheduler can really delete the pod successfully, you should config the rbac/koord-scheduler.yaml as below in helm.

  1. apiVersion: rbac.authorization.k8s.io/v1
  2. kind: ClusterRole
  3. metadata:
  4. name: koord-scheduler-role
  5. rules:
  6. {{- if semverCompare "<= 1.20-0" .Capabilities.KubeVersion.Version }}
  7. - apiGroups:
  8. - ""
  9. resources:
  10. - namespaces
  11. verbs:
  12. - get
  13. - list
  14. - watch
  15. {{- end }}
  16. - apiGroups:
  17. - coordination.k8s.io
  18. resources:
  19. - leases
  20. verbs:
  21. - create
  22. - get
  23. - update
  24. - apiGroups:
  25. - ""
  26. resources:
  27. - pods
  28. verbs:
  29. - patch
  30. - update
  31. - delete
  32. - apiGroups:
  33. - ""
  34. resources:
  35. - pods/eviction
  36. verbs:
  37. - create
  38. - apiGroups:
  39. ...

To prevent Pods from being revoked, you can add label quota.scheduling.koordinator.sh/preemptible: false to the Pod:

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: pod-example
  5. namespace: default
  6. labels:
  7. quota.scheduling.koordinator.sh/name: "quota-example"
  8. quota.scheduling.koordinator.sh/preemptible: false
  9. spec:
  10. ...

In this case, the Pod is not allowed to use resources exceeding the Min. Since the “Min” resources are the guaranteed resources, the Pod will not be evicted.