Resource Reservation

Resource Reservation is an ability of koord-scheduler for reserving node resources for specific pods or workloads.

Introduction

Pods are fundamental for allocating node resources in Kubernetes, which bind resource requirements with business logic. However, we may allocate resources for specific pods or workloads not created yet in the scenarios below:

  1. Preemption: Existing preemption does not guarantee that only preempting pods can allocate preempted resources. We expect that the scheduler can “lock” resources preventing from allocation of other pods even if they have the same or higher priorities.
  2. De-scheduling: For the descheduler, it is better to ensure sufficient resources before pods get rescheduled. Otherwise, rescheduled pods may not be runnable anymore and make the belonging application disrupted.
  3. Horizontal scaling: To achieve more deterministic horizontal scaling, we expect to allocate node resources for the replicas to scale.
  4. Resource Pre-allocation: We may want to pre-allocate node resources for future resource demands even if the resources are not currently allocatable.

To enhance the resource scheduling of Kubernetes, koord-scheduler provides a scheduling API named Reservation, which allows us to reserve node resources for specified pods or workloads even if they haven’t get created yet.

image

For more information, please see Design: Resource Reservation.

Setup

Prerequisite

  • Kubernetes >= 1.18
  • Koordinator >= 0.6

Installation

Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to Installation.

Configurations

Resource Reservation is Enabled by default. You can use it without any modification on the koord-scheduler config.

Use Resource Reservation

Quick Start

  1. Deploy a reservation reservation-demo with the YAML file below.
  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: Reservation
  3. metadata:
  4. name: reservation-demo
  5. spec:
  6. template: # set resource requirements
  7. namespace: default
  8. spec:
  9. containers:
  10. - args:
  11. - '-c'
  12. - '1'
  13. command:
  14. - stress
  15. image: polinux/stress
  16. imagePullPolicy: Always
  17. name: stress
  18. resources: # reserve 500m cpu and 800Mi memory
  19. requests:
  20. cpu: 500m
  21. memory: 800Mi
  22. schedulerName: koord-scheduler # use koord-scheduler
  23. owners: # set the owner specifications
  24. - object: # owner pods whose name is `default/pod-demo-0`
  25. name: pod-demo-0
  26. namespace: default
  27. ttl: 1h # set the TTL, the reservation will get expired 1 hour later
  1. $ kubectl create -f reservation-demo.yaml
  2. reservation.scheduling.koordinator.sh/reservation-demo created
  1. Watch the reservation status util it becomes available.
  1. $ kubectl get reservation reservation-demo -o wide
  2. NAME PHASE AGE NODE TTL EXPIRES
  3. reservation-demo Available 88s node-0 1h
  1. Deploy a pod pod-demo-0 with the YAML file below.
  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: pod-demo-0 # match the owner spec of `reservation-demo`
  5. spec:
  6. containers:
  7. - args:
  8. - '-c'
  9. - '1'
  10. command:
  11. - stress
  12. image: polinux/stress
  13. imagePullPolicy: Always
  14. name: stress
  15. resources:
  16. limits:
  17. cpu: '1'
  18. memory: 1Gi
  19. requests:
  20. cpu: 200m
  21. memory: 400Mi
  22. restartPolicy: Always
  23. schedulerName: koord-scheduler # use koord-scheduler
  1. $ kubectl create -f pod-demo-0.yaml
  2. pod/pod-demo-0 created
  1. Check the scheduled result of the pod pod-demo-0.
  1. $ kubectl get pod pod-demo-0 -o wide
  2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  3. pod-demo-0 1/1 Running 0 32s 10.17.0.123 node-0 <none> <none>

pod-demo-0 is scheduled at the same node with reservation-demo.

  1. Check the status of the reservation reservation-demo.
  1. $ kubectl get reservation reservation-demo -oyaml
  2. apiVersion: scheduling.koordinator.sh/v1alpha1
  3. kind: Reservation
  4. metadata:
  5. name: reservation-demo
  6. creationTimestamp: "YYYY-MM-DDT05:24:58Z"
  7. uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  8. ...
  9. spec:
  10. owners:
  11. - object:
  12. name: pod-demo-0
  13. namespace: default
  14. template:
  15. spec:
  16. containers:
  17. - args:
  18. - -c
  19. - "1"
  20. command:
  21. - stress
  22. image: polinux/stress
  23. imagePullPolicy: Always
  24. name: stress
  25. resources:
  26. requests:
  27. cpu: 500m
  28. memory: 800Mi
  29. schedulerName: koord-scheduler
  30. ttl: 1h
  31. status:
  32. allocatable: # total reserved
  33. cpu: 500m
  34. memory: 800Mi
  35. allocated: # current allocated
  36. cpu: 200m
  37. memory: 400Mi
  38. conditions:
  39. - lastProbeTime: "YYYY-MM-DDT05:24:58Z"
  40. lastTransitionTime: "YYYY-MM-DDT05:24:58Z"
  41. reason: Scheduled
  42. status: "True"
  43. type: Scheduled
  44. - lastProbeTime: "YYYY-MM-DDT05:24:58Z"
  45. lastTransitionTime: "YYYY-MM-DDT05:24:58Z"
  46. reason: Available
  47. status: "True"
  48. type: Ready
  49. currentOwners:
  50. - name: pod-demo-0
  51. namespace: default
  52. uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
  53. nodeName: node-0
  54. phase: Available

Now we can see the reservation reservation-demo has reserved 500m cpu and 800Mi memory, and the pod pod-demo-0 allocates 200m cpu and 400Mi memory from the reserved resources.

  1. Cleanup the reservation reservation-demo.
  1. $ kubectl delete reservation reservation-demo
  2. reservation.scheduling.koordinator.sh "reservation-demo" deleted
  3. $ kubectl get pod pod-demo-0
  4. NAME READY STATUS RESTARTS AGE
  5. pod-demo-0 1/1 Running 0 110s

After the reservation deleted, the pod pod-demo-0 is still running.

Advanced Configurations

The latest API can be found in reservation_types.

  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: Reservation
  3. metadata:
  4. name: reservation-demo
  5. spec:
  6. # pod template (required): Reserve resources and play pod/node affinities according to the template.
  7. # The resource requirements of the pod indicates the resource requirements of the reservation
  8. template:
  9. namespace: default
  10. spec:
  11. containers:
  12. - args:
  13. - '-c'
  14. - '1'
  15. command:
  16. - stress
  17. image: polinux/stress
  18. imagePullPolicy: Always
  19. name: stress
  20. resources:
  21. requests:
  22. cpu: 500m
  23. memory: 800Mi
  24. # scheduler name (required): use koord-scheduler to schedule the reservation
  25. schedulerName: koord-scheduler
  26. # owner spec (required): Specify what kinds of pods can allocate resources of this reservation.
  27. # Currently support three kinds of owner specifications:
  28. # - object: specify the name, namespace, uid of the owner pods
  29. # - controller: specify the owner reference of the owner pods, e.g. name, namespace(extended by koordinator), uid, kind
  30. # - labelSelector: specify the matching labels are matching expressions of the owner pods
  31. owners:
  32. - object:
  33. name: pod-demo-0
  34. namespace: default
  35. - labelSelector:
  36. matchLabels:
  37. app: app-demo
  38. # TTL (optional): Time-To-Live duration of the reservation. The reservation will get expired after the TTL period.
  39. # If not set, use `24h` as default.
  40. ttl: 1h
  41. # Expires (optional): Expired timestamp when the reservation is expected to expire.
  42. # If both `expires` and `ttl` are set, `expires` is checked first.
  43. expires: "YYYY-MM-DDTHH:MM:SSZ"

Example: Reserve on Specified Node, with Multiple Owners

  1. Check the resources allocatable of each node.
  1. $ kubectl get node -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEMORY:.status.allocatable.memory
  2. NAME CPU MEMORY
  3. node-0 7800m 28625036Ki
  4. node-1 7800m 28629692Ki
  5. ...
  6. $ kubectl describe node node-1 | grep -A 8 "Allocated resources"
  7. Allocated resources:
  8. (Total limits may be over 100 percent, i.e., overcommitted.)
  9. Resource Requests Limits
  10. -------- -------- ------
  11. cpu 780m (10%) 7722m (99%)
  12. memory 1216Mi (4%) 14044Mi (50%)
  13. ephemeral-storage 0 (0%) 0 (0%)
  14. hugepages-1Gi 0 (0%) 0 (0%)
  15. hugepages-2Mi 0 (0%) 0 (0%)

As above, the node node-1 has about 7.0 cpu and 26Gi memory unallocated.

  1. Deploy a reservation reservation-demo-big with the YAML file below.
  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: Reservation
  3. metadata:
  4. name: reservation-demo-big
  5. spec:
  6. template:
  7. namespace: default
  8. spec:
  9. containers:
  10. - args:
  11. - '-c'
  12. - '1'
  13. command:
  14. - stress
  15. image: polinux/stress
  16. imagePullPolicy: Always
  17. name: stress
  18. resources: # reserve 6 cpu and 20Gi memory
  19. requests:
  20. cpu: 6
  21. memory: 20Gi
  22. nodeName: node-1 # set the expected node name to schedule at
  23. schedulerName: koord-scheduler
  24. owners: # set multiple owners
  25. - object: # owner pods whose name is `default/pod-demo-0`
  26. name: pod-demo-1
  27. namespace: default
  28. - labelSelector: # owner pods who have label `app=app-demo` can allocate the reserved resources
  29. matchLabels:
  30. app: app-demo
  31. ttl: 1h
  1. $ kubectl create -f reservation-demo-big.yaml
  2. reservation.scheduling.koordinator.sh/reservation-demo-big created
  1. Watch the reservation status util it becomes available.
  1. $ kubectl get reservation reservation-demo-big -o wide
  2. NAME PHASE AGE NODE TTL EXPIRES
  3. reservation-demo-big Available 37s node-1 1h

The reservation reservation-demo-big is scheduled at the node node-1, which matches the nodeName set in pod template.

  1. Deploy a deployment app-demo with the YAML file below.
  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: app-demo
  5. spec:
  6. replicas: 2
  7. selector:
  8. matchLabels:
  9. app: app-demo
  10. template:
  11. metadata:
  12. name: stress
  13. labels:
  14. app: app-demo # match the owner spec of `reservation-demo-big`
  15. spec:
  16. schedulerName: koord-scheduler # use koord-scheduler
  17. containers:
  18. - name: stress
  19. image: polinux/stress
  20. args:
  21. - '-c'
  22. - '1'
  23. command:
  24. - stress
  25. resources:
  26. requests:
  27. cpu: 2
  28. memory: 10Gi
  29. limits:
  30. cpu: 4
  31. memory: 20Gi
  1. $ kubectl create -f app-demo.yaml
  2. deployment.apps/app-demo created
  1. Check the scheduled result of the pods of deployment app-demo.
  1. k get pod -l app=app-demo -o wide
  2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  3. app-demo-798c66db46-ctnbr 1/1 Running 0 2m 10.17.0.124 node-1 <none> <none>
  4. app-demo-798c66db46-pzphc 1/1 Running 0 2m 10.17.0.125 node-1 <none> <none>

Pods of deployment app-demo are scheduled at the same node with reservation-demo-big.

  1. Check the status of the reservation reservation-demo-big.
  1. $ kubectl get reservation reservation-demo-big -oyaml
  2. apiVersion: scheduling.koordinator.sh/v1alpha1
  3. kind: Reservation
  4. metadata:
  5. name: reservation-demo-big
  6. creationTimestamp: "YYYY-MM-DDT06:28:16Z"
  7. uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  8. ...
  9. spec:
  10. owners:
  11. - object:
  12. name: pod-demo-0
  13. namespace: default
  14. template:
  15. spec:
  16. containers:
  17. - args:
  18. - -c
  19. - "1"
  20. command:
  21. - stress
  22. image: polinux/stress
  23. imagePullPolicy: Always
  24. name: stress
  25. resources:
  26. requests:
  27. cpu: 500m
  28. memory: 800Mi
  29. schedulerName: koord-scheduler
  30. ttl: 1h
  31. status:
  32. allocatable:
  33. cpu: 6
  34. memory: 20Gi
  35. allocated:
  36. cpu: 4
  37. memory: 20Gi
  38. conditions:
  39. - lastProbeTime: "YYYY-MM-DDT06:28:17Z"
  40. lastTransitionTime: "YYYY-MM-DDT06:28:17Z"
  41. reason: Scheduled
  42. status: "True"
  43. type: Scheduled
  44. - lastProbeTime: "YYYY-MM-DDT06:28:17Z"
  45. lastTransitionTime: "YYYY-MM-DDT06:28:17Z"
  46. reason: Available
  47. status: "True"
  48. type: Ready
  49. currentOwners:
  50. - name: app-demo-798c66db46-ctnbr
  51. namespace: default
  52. uid: yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
  53. - name: app-demo-798c66db46-pzphc
  54. namespace: default
  55. uid: zzzzzzzz-zzzz-zzzz-zzzzzzzzzzzz
  56. nodeName: node-1
  57. phase: Available

Now we can see the reservation reservation-demo-big has reserved 6 cpu and 20Gi memory, and the pods of deployment app-demo allocates 4 cpu and 20Gi memory from the reserved resources. The allocation for reserved resources does not increase the requested of node resources, otherwise the total request of node-1 would exceed the node allocatable. Moreover, a reservation can be allocated by multiple owners when there are enough reserved resources unallocated.