Device Scheduling

We provide a fine-grained mechanism for managing GPUs and other devices such as RDMA and FPGA, defines a set of APIs to describe device information on nodes, including GPU, RDMA, and FPGA, and a new set of resource names to flexibly support users to apply at a finer granularity GPU resources. This mechanism is the basis for subsequent other GPU scheduling capabilities such as GPU Share, GPU Overcommitment, etc.

Introduction

GPU devices have very strong computing power, but are expensive. How to make better use of GPU equipment, give full play to the value of GPU and reduce costs is a problem that needs to be solved. In the existing GPU allocation mechanism of the K8s community, the GPU is allocated by the kubelet, and it is a complete device allocation. This method is simple and reliable, but similar to the CPU and memory, the GPU will also be wasted. Therefore, some users expect to use only a portion of the GPU’s resources and share the rest with other workloads to save costs. Moreover, GPU has particularities. For example, the NVLink and oversold scenarios supported by NVIDIA GPU mentioned below both require a central decision through the scheduler to obtain globally optimal allocation results.

Setup

Prerequisite

  • Kubernetes >= 1.18
  • Koordinator >= 0.71

Installation

Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to Installation.

Runtime Requirements

The scheduled GPU devices are bound to the container requires support from the runtime environment. Currently, there are two solutions to achieve this:

Runtime EnvironmentInstallation
Containerd >= 1.7.0
Koordinator >= 1.3
Please make sure NRI is enabled in containerd. If not, please refer to Enable NRI in Containerd
othersPlease make sure koord-runtime-proxy component is correctly installed in you cluser. If not, please refer to Installation Runtime Proxy.

Configurations

DeviceScheduling is Enabled by default. You can use it without any modification on the koord-scheduler config.

Use DeviceScheduling

Quick Start

1.check device crd:

  1. $ kubectl get device host04 -o yaml
  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: Device
  3. metadata:
  4. creationTimestamp: "2022-10-08T09:26:42Z"
  5. generation: 1
  6. managedFields:
  7. - apiVersion: scheduling.koordinator.sh/v1alpha1
  8. fieldsType: FieldsV1
  9. fieldsV1:
  10. f:metadata:
  11. f:ownerReferences: {}
  12. f:spec:
  13. .: {}
  14. f:devices: {}
  15. f:status: {}
  16. manager: koordlet
  17. operation: Update
  18. time: "2022-10-08T09:26:42Z"
  19. name: host04
  20. ownerReferences:
  21. - apiVersion: v1
  22. blockOwnerDeletion: true
  23. controller: true
  24. kind: Node
  25. name: host04
  26. uid: 09c4f912-6026-467a-85d2-6b2147c9557e
  27. resourceVersion: "39011943"
  28. selfLink: /apis/scheduling.koordinator.sh/v1alpha1/devices/host04
  29. uid: 5a498e1f-1357-4518-b74c-cab251d6c18c
  30. spec:
  31. devices:
  32. - health: true
  33. id: GPU-04cea5cd-966f-7116-1d58-1ac34421541b
  34. minor: 0
  35. resources:
  36. kubernetes.io/gpu-core: "100"
  37. kubernetes.io/gpu-memory: 16Gi
  38. kubernetes.io/gpu-memory-ratio: "100"
  39. type: gpu
  40. - health: true
  41. id: GPU-3680858f-1753-371e-3c1a-7d8127fc7113
  42. minor: 1
  43. resources:
  44. kubernetes.io/gpu-core: "100"
  45. kubernetes.io/gpu-memory: 16Gi
  46. kubernetes.io/gpu-memory-ratio: "100"
  47. type: gpu
  48. status: {}

We can find this node has two gpu cards, we can find the detail info of each gpu card here.

2.check node allocatable resource:

  1. $ kubectl get node host04 -o yaml
  1. apiVersion: v1
  2. kind: Node
  3. metadata:
  4. annotations:
  5. flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"5a:69:48:10:29:25"}'
  6. creationTimestamp: "2022-08-29T09:12:55Z"
  7. labels:
  8. beta.kubernetes.io/os: linux
  9. status:
  10. addresses:
  11. - address: 10.15.0.37
  12. type: InternalIP
  13. - address: host04
  14. type: Hostname
  15. allocatable:
  16. cpu: "6"
  17. ephemeral-storage: "200681483926"
  18. kubernetes.io/gpu: "200"
  19. kubernetes.io/gpu-core: "200"
  20. kubernetes.io/gpu-memory: 32Gi
  21. kubernetes.io/gpu-memory-ratio: "200"
  22. memory: 59274552Ki
  23. nvidia.com/gpu: "2"
  24. pods: "220"
  25. capacity:
  26. cpu: "8"
  27. kubernetes.io/gpu: "200"
  28. kubernetes.io/gpu-core: "200"
  29. kubernetes.io/gpu-memory: 32Gi
  30. kubernetes.io/gpu-memory-ratio: "200"
  31. memory: 61678904Ki
  32. nvidia.com/gpu: "2"
  33. pods: "220"

We can find the node allocatable resource has merged each gpu card resource.

3.apply pod:

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: pod-example
  5. namespace: default
  6. spec:
  7. schedulerName: koord-scheduler
  8. containers:
  9. - command:
  10. - sleep
  11. - 365d
  12. image: busybox
  13. imagePullPolicy: IfNotPresent
  14. name: curlimage
  15. resources:
  16. limits:
  17. cpu: 40m
  18. memory: 40Mi
  19. requests:
  20. cpu: 40m
  21. memory: 40Mi
  22. kubernetes.io/gpu: "100"
  23. terminationMessagePath: /dev/termination-log
  24. terminationMessagePolicy: File
  25. restartPolicy: Always
  1. $ kubectl get pod -n default pod-example -o yaml
  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. annotations:
  5. scheduling.koordinator.sh/device-allocated: '{"gpu":[{"minor":0,"resources":{"kubernetes.io/gpu-core":"100","kubernetes.io/gpu-memory":"12508288Ki","kubernetes.io/gpu-memory-ratio":"100"}}]}'
  6. creationTimestamp: "2022-10-08T09:33:07Z"
  7. name: pod-example
  8. namespace: default
  9. resourceVersion: "39015044"
  10. selfLink: /api/v1/namespaces/xlf/pods/gpu-pod7
  11. uid: 6bf1ac3c-0c9f-472a-8b86-de350bbfa795
  12. spec:
  13. containers:
  14. - command:
  15. - sleep
  16. - 365d
  17. image: busybox
  18. imagePullPolicy: IfNotPresent
  19. name: curlimage
  20. resources:
  21. limits:
  22. cpu: "1"
  23. kubernetes.io/gpu: "100"
  24. memory: 256Mi
  25. requests:
  26. cpu: "1"
  27. kubernetes.io/gpu: "100"
  28. memory: 256Mi
  29. status:
  30. conditions:
  31. ...
  32. hostIP: 10.0.0.149
  33. phase: Running
  34. podIP: 10.244.2.45
  35. podIPs:
  36. - ip: 10.244.2.45
  37. qosClass: Guaranteed
  38. startTime: "2022-10-08T09:33:07Z"

You can find the concrete device allocate result through annotation scheduling.koordinator.sh/device-allocated.

4.more apply protocol:

  1. apiVersion: v1
  2. kind: Pod
  3. ...
  4. spec:
  5. ...
  6. resources:
  7. requests:
  8. cpu: 40m
  9. memory: 40Mi
  10. nvidia.com/gpu: "100"
  1. apiVersion: v1
  2. kind: Pod
  3. ...
  4. spec:
  5. ...
  6. resources:
  7. requests:
  8. cpu: 40m
  9. memory: 40Mi
  10. kubernetes.io/gpu-core: "100"
  11. kubernetes.io/gpu-memory-ratio: "100"
  1. apiVersion: v1
  2. kind: Pod
  3. ...
  4. spec:
  5. ...
  6. resources:
  7. requests:
  8. cpu: 40m
  9. memory: 40Mi
  10. kubernetes.io/gpu-core: "100"
  11. kubernetes.io/gpu-memory: "16Mi"

4.device resource debug api:

  1. $ kubectl -n koordinator-system get lease koord-scheduler --no-headers | awk '{print $2}' | cut -d'_' -f1 | xargs -I {} kubectl -n koordinator-system get pod {} -o wide --no-headers | awk '{print $6}'
  2. 10.244.0.64
  3. $ curl 10.244.0.64:10251/apis/v1/plugins/DeviceShare/nodeDeviceSummaries
  4. $ curl 10.244.0.64:10251/apis/v1/plugins/DeviceShare/nodeDeviceSummaries/host04
  1. {
  2. "allocateSet": {
  3. "gpu": {
  4. "xlf/gpu-pod7": {
  5. "0": {
  6. "kubernetes.io/gpu-core": "100",
  7. "kubernetes.io/gpu-memory": "12508288Ki",
  8. "kubernetes.io/gpu-memory-ratio": "100"
  9. }
  10. }
  11. }
  12. },
  13. "deviceFree": {
  14. "kubernetes.io/gpu-core": "0",
  15. "kubernetes.io/gpu-memory": "0",
  16. "kubernetes.io/gpu-memory-ratio": "0"
  17. },
  18. "deviceFreeDetail": {
  19. "gpu": {
  20. "0": {
  21. "kubernetes.io/gpu-core": "0",
  22. "kubernetes.io/gpu-memory": "0",
  23. "kubernetes.io/gpu-memory-ratio": "0"
  24. }
  25. }
  26. },
  27. "deviceTotal": {
  28. "kubernetes.io/gpu-core": "100",
  29. "kubernetes.io/gpu-memory": "12508288Ki",
  30. "kubernetes.io/gpu-memory-ratio": "100"
  31. },
  32. "deviceTotalDetail": {
  33. "gpu": {
  34. "0": {
  35. "kubernetes.io/gpu-core": "100",
  36. "kubernetes.io/gpu-memory": "12508288Ki",
  37. "kubernetes.io/gpu-memory-ratio": "100"
  38. }
  39. }
  40. },
  41. "deviceUsed": {
  42. "kubernetes.io/gpu-core": "100",
  43. "kubernetes.io/gpu-memory": "12508288Ki",
  44. "kubernetes.io/gpu-memory-ratio": "100"
  45. },
  46. "deviceUsedDetail": {
  47. "gpu": {
  48. "0": {
  49. "kubernetes.io/gpu-core": "100",
  50. "kubernetes.io/gpu-memory": "12508288Ki",
  51. "kubernetes.io/gpu-memory-ratio": "100"
  52. }
  53. }
  54. }
  55. }