资源预留

资源预留是koord-scheduler的一种为某些特定Pod或负载预留节点资源的能力。

介绍

Pod是kubernetes节点资源分配的基础载体,他根据业务逻辑绑定对应的资源需求。但是我们可能分为一些还没创建的特定Pod和负载分配资源,例如:

  1. 抢占:已经存在的抢占规则不能保证只有正在抢占中的Pod才能分配抢占的资源,我们期望调度器能锁定资源,防止这些资源被有相同或更高优先级的其他Pod抢占。
  2. 重调度:在重调度场景下,最好能保证在Pod被重调度之前保留足够的资源。否则,被重调度的Pod可能再也没法运行,然后对应的应用可能就会崩溃。
  3. 水平扩容:为了能更精准地进行水平扩容,我们希望能为扩容的Pod副本分配节点资源。
  4. 资源预分配:即使当前的资源还不可用,我们可能想为将来的资源需求提前预留节点资源。

为了增强kubernetes的资源调度能力,koord-scheduler提供了一个名字叫Reservation的调度API,允许我们为一些当前还未创建的特定的Pod和负载,提前预留节点资源。

image

更多信息,请看 设计文档:资源预留

设置

前提

  • Kubernetes >= 1.18
  • Koordinator >= 0.6

安装步骤

请确保Koordinator的组件已经在你的集群中正确安装,如果还未正确安装,请参考安装说明

配置

资源预留功能默认启用,你无需对koord-scheduler配置做任何修改,即可使用。

使用指南

快速上手

  1. 使用如下yaml文件预留资源:reservation-demo
  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: Reservation
  3. metadata:
  4. name: reservation-demo
  5. spec:
  6. template: # set resource requirements
  7. namespace: default
  8. spec:
  9. containers:
  10. - args:
  11. - '-c'
  12. - '1'
  13. command:
  14. - stress
  15. image: polinux/stress
  16. imagePullPolicy: Always
  17. name: stress
  18. resources: # reserve 500m cpu and 800Mi memory
  19. requests:
  20. cpu: 500m
  21. memory: 800Mi
  22. schedulerName: koord-scheduler # use koord-scheduler
  23. owners: # set the owner specifications
  24. - object: # owner pods whose name is `default/pod-demo-0`
  25. name: pod-demo-0
  26. namespace: default
  27. ttl: 1h # set the TTL, the reservation will get expired 1 hour later
  1. $ kubectl create -f reservation-demo.yaml
  2. reservation.scheduling.koordinator.sh/reservation-demo created
  1. 跟踪reservation-demo的状态,直到它变成可用状态。
  1. $ kubectl get reservation reservation-demo -o wide
  2. NAME PHASE AGE NODE TTL EXPIRES
  3. reservation-demo Available 88s node-0 1h
  1. 使用如下YAML文件部署一个Pod:Pod-demo-0
  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: pod-demo-0 # match the owner spec of `reservation-demo`
  5. spec:
  6. containers:
  7. - args:
  8. - '-c'
  9. - '1'
  10. command:
  11. - stress
  12. image: polinux/stress
  13. imagePullPolicy: Always
  14. name: stress
  15. resources:
  16. limits:
  17. cpu: '1'
  18. memory: 1Gi
  19. requests:
  20. cpu: 200m
  21. memory: 400Mi
  22. restartPolicy: Always
  23. schedulerName: koord-scheduler # use koord-scheduler
  1. $ kubectl create -f pod-demo-0.yaml
  2. pod/pod-demo-0 created
  1. 检查Pod-demo-0的调度状态。
  1. $ kubectl get pod pod-demo-0 -o wide
  2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  3. pod-demo-0 1/1 Running 0 32s 10.17.0.123 node-0 <none> <none>

Pod-demo-0将会和reservation-demo被调度到同一个节点。

  1. 检查reservation-demo的状态。
  1. $ kubectl get reservation reservation-demo -oyaml
  2. apiVersion: scheduling.koordinator.sh/v1alpha1
  3. kind: Reservation
  4. metadata:
  5. name: reservation-demo
  6. creationTimestamp: "YYYY-MM-DDT05:24:58Z"
  7. uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  8. ...
  9. spec:
  10. owners:
  11. - object:
  12. name: pod-demo-0
  13. namespace: default
  14. template:
  15. spec:
  16. containers:
  17. - args:
  18. - -c
  19. - "1"
  20. command:
  21. - stress
  22. image: polinux/stress
  23. imagePullPolicy: Always
  24. name: stress
  25. resources:
  26. requests:
  27. cpu: 500m
  28. memory: 800Mi
  29. schedulerName: koord-scheduler
  30. ttl: 1h
  31. status:
  32. allocatable: # total reserved
  33. cpu: 500m
  34. memory: 800Mi
  35. allocated: # current allocated
  36. cpu: 200m
  37. memory: 400Mi
  38. conditions:
  39. - lastProbeTime: "YYYY-MM-DDT05:24:58Z"
  40. lastTransitionTime: "YYYY-MM-DDT05:24:58Z"
  41. reason: Scheduled
  42. status: "True"
  43. type: Scheduled
  44. - lastProbeTime: "YYYY-MM-DDT05:24:58Z"
  45. lastTransitionTime: "YYYY-MM-DDT05:24:58Z"
  46. reason: Available
  47. status: "True"
  48. type: Ready
  49. currentOwners:
  50. - name: pod-demo-0
  51. namespace: default
  52. uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
  53. nodeName: node-0
  54. phase: Available

现在我们可以看到reservation-demo预留了500m cpu和 800Mi内存, Pod-demo-0从预留的资源中分配了200m cpu and 400Mi内存。

  1. 清理reservation-demo的预留资源。
  1. $ kubectl delete reservation reservation-demo
  2. reservation.scheduling.koordinator.sh "reservation-demo" deleted
  3. $ kubectl get pod pod-demo-0
  4. NAME READY STATUS RESTARTS AGE
  5. pod-demo-0 1/1 Running 0 110s

在预留资源被删除后,Pod-demo-0依然正常运行。

高级特性

最新的API可以在这里查看: reservation_types

  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: Reservation
  3. metadata:
  4. name: reservation-demo
  5. spec:
  6. # pod template (required): Reserve resources and play pod/node affinities according to the template.
  7. # The resource requirements of the pod indicates the resource requirements of the reservation
  8. template:
  9. namespace: default
  10. spec:
  11. containers:
  12. - args:
  13. - '-c'
  14. - '1'
  15. command:
  16. - stress
  17. image: polinux/stress
  18. imagePullPolicy: Always
  19. name: stress
  20. resources:
  21. requests:
  22. cpu: 500m
  23. memory: 800Mi
  24. # scheduler name (required): use koord-scheduler to schedule the reservation
  25. schedulerName: koord-scheduler
  26. # owner spec (required): Specify what kinds of pods can allocate resources of this reservation.
  27. # Currently support three kinds of owner specifications:
  28. # - object: specify the name, namespace, uid of the owner pods
  29. # - controller: specify the owner reference of the owner pods, e.g. name, namespace(extended by koordinator), uid, kind
  30. # - labelSelector: specify the matching labels are matching expressions of the owner pods
  31. owners:
  32. - object:
  33. name: pod-demo-0
  34. namespace: default
  35. - labelSelector:
  36. matchLabels:
  37. app: app-demo
  38. # TTL (optional): Time-To-Live duration of the reservation. The reservation will get expired after the TTL period.
  39. # If not set, use `24h` as default.
  40. ttl: 1h
  41. # Expires (optional): Expired timestamp when the reservation is expected to expire.
  42. # If both `expires` and `ttl` are set, `expires` is checked first.
  43. expires: "YYYY-MM-DDTHH:MM:SSZ"

案例:多个属主在同一个节点预留资源

  1. 检查每个节点的可分配资源。
  1. $ kubectl get node -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory
  2. NAME CPU MEMORY
  3. node-0 7800m 28625036Ki
  4. node-1 7800m 28629692Ki
  5. ...
  6. $ kubectl describe node node-1 | grep -A 8 "Allocated resources"
  7. Allocated resources:
  8. (Total limits may be over 100 percent, i.e., overcommitted.)
  9. Resource Requests Limits
  10. -------- -------- ------
  11. cpu 780m (10%) 7722m (99%)
  12. memory 1216Mi (4%) 14044Mi (50%)
  13. ephemeral-storage 0 (0%) 0 (0%)
  14. hugepages-1Gi 0 (0%) 0 (0%)
  15. hugepages-2Mi 0 (0%) 0 (0%)

如上图,node-1节点还保留7.0 cpu and 26Gi memory未分配。

  1. 用如下YAML文件预留资源:reservation-demo-big
  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: Reservation
  3. metadata:
  4. name: reservation-demo-big
  5. spec:
  6. template:
  7. namespace: default
  8. spec:
  9. containers:
  10. - args:
  11. - '-c'
  12. - '1'
  13. command:
  14. - stress
  15. image: polinux/stress
  16. imagePullPolicy: Always
  17. name: stress
  18. resources: # reserve 6 cpu and 20Gi memory
  19. requests:
  20. cpu: 6
  21. memory: 20Gi
  22. nodeName: node-1 # set the expected node name to schedule at
  23. schedulerName: koord-scheduler
  24. owners: # set multiple owners
  25. - object: # owner pods whose name is `default/pod-demo-0`
  26. name: pod-demo-1
  27. namespace: default
  28. - labelSelector: # owner pods who have label `app=app-demo` can allocate the reserved resources
  29. matchLabels:
  30. app: app-demo
  31. ttl: 1h
  1. $ kubectl create -f reservation-demo-big.yaml
  2. reservation.scheduling.koordinator.sh/reservation-demo-big created
  1. 跟踪reservation-demo-big的状态,直到他变成可用状态。
  1. $ kubectl get reservation reservation-demo-big -o wide
  2. NAME PHASE AGE NODE TTL EXPIRES
  3. reservation-demo-big Available 37s node-1 1h

reservation-demo-big将被调度到Pod模板中设置的nodeName属性节点:node-1

  1. 用如下YAML文件创建一次部署:app-demo
  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: app-demo
  5. spec:
  6. replicas: 2
  7. selector:
  8. matchLabels:
  9. app: app-demo
  10. template:
  11. metadata:
  12. name: stress
  13. labels:
  14. app: app-demo # match the owner spec of `reservation-demo-big`
  15. spec:
  16. schedulerName: koord-scheduler # use koord-scheduler
  17. containers:
  18. - name: stress
  19. image: polinux/stress
  20. args:
  21. - '-c'
  22. - '1'
  23. command:
  24. - stress
  25. resources:
  26. requests:
  27. cpu: 2
  28. memory: 10Gi
  29. limits:
  30. cpu: 4
  31. memory: 20Gi
  1. $ kubectl create -f app-demo.yaml
  2. deployment.apps/app-demo created
  1. 检查app-demo的Pod调度结果.
  1. k get pod -l app=app-demo -o wide
  2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  3. app-demo-798c66db46-ctnbr 1/1 Running 0 2m 10.17.0.124 node-1 <none> <none>
  4. app-demo-798c66db46-pzphc 1/1 Running 0 2m 10.17.0.125 node-1 <none> <none>

app-demo的Pod将会被调度到reservation-demo-big所在的节点。

  1. 检查reservation-demo-big的状态。
  1. $ kubectl get reservation reservation-demo-big -oyaml
  2. apiVersion: scheduling.koordinator.sh/v1alpha1
  3. kind: Reservation
  4. metadata:
  5. name: reservation-demo-big
  6. creationTimestamp: "YYYY-MM-DDT06:28:16Z"
  7. uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  8. ...
  9. spec:
  10. owners:
  11. - object:
  12. name: pod-demo-0
  13. namespace: default
  14. template:
  15. spec:
  16. containers:
  17. - args:
  18. - -c
  19. - "1"
  20. command:
  21. - stress
  22. image: polinux/stress
  23. imagePullPolicy: Always
  24. name: stress
  25. resources:
  26. requests:
  27. cpu: 500m
  28. memory: 800Mi
  29. schedulerName: koord-scheduler
  30. ttl: 1h
  31. status:
  32. allocatable:
  33. cpu: 6
  34. memory: 20Gi
  35. allocated:
  36. cpu: 4
  37. memory: 20Gi
  38. conditions:
  39. - lastProbeTime: "YYYY-MM-DDT06:28:17Z"
  40. lastTransitionTime: "YYYY-MM-DDT06:28:17Z"
  41. reason: Scheduled
  42. status: "True"
  43. type: Scheduled
  44. - lastProbeTime: "YYYY-MM-DDT06:28:17Z"
  45. lastTransitionTime: "YYYY-MM-DDT06:28:17Z"
  46. reason: Available
  47. status: "True"
  48. type: Ready
  49. currentOwners:
  50. - name: app-demo-798c66db46-ctnbr
  51. namespace: default
  52. uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
  53. - name: app-demo-798c66db46-pzphc
  54. namespace: default
  55. uid: zzzzzzzz-zzzz-zzzz-zzzzzzzzzzzz
  56. nodeName: node-1
  57. phase: Available

现在我们能看到reservation-demo-big预留了6 cpu和20Gi内存,app-demo从预留的资源中分配了4 cpu and 20Gi内存,预留资源的分配不会增加节点资源的请求容量,否则node-1的请求资源总容量将会超过可分配的资源容量。而且当有足够的未分配的预留资源时,这些预留资源可以被同时分配给多个属主。