Schedule based on Cluster Resource Modeling

Overview

When scheduling an application to a specific cluster, the resource status of the destination cluster is a factor that cannot be ignored. When cluster resources are insufficient to run a given replica, we want the scheduler to avoid this scheduling behavior as much as possible. This article will focus on how Karmada performs scheduling based on the cluster resource modeling.

Cluster Resource Modeling

In the scheduling progress, the karmada-scheduler now makes decisions as per a bunch of factors, one of the factors is the resource details of the cluster. Now Karmada has two different scheduling behaviors based on cluster resources. One of them is general cluster modeling, another one is customized cluster modeling.

General Cluster Modeling

Start to use General Cluster Resource Models

For the purpose above, we introduced ResourceSummary to the Cluster API.

For example:

  1. resourceSummary:
  2. allocatable:
  3. cpu: "4"
  4. ephemeral-storage: 206291924Ki
  5. hugepages-1Gi: "0"
  6. hugepages-2Mi: "0"
  7. memory: 16265856Ki
  8. pods: "110"
  9. allocated:
  10. cpu: 950m
  11. memory: 290Mi
  12. pods: "11"

From the example above, we can know the allocatable and allocated resources of the cluster.

Schedule based on General Cluster Resource Models

Assume that there is a Pod which will be scheduled to one of the clusters managed by Karmada.

Member1 is like:

  1. resourceSummary:
  2. allocatable:
  3. cpu: "4"
  4. ephemeral-storage: 206291924Ki
  5. hugepages-1Gi: "0"
  6. hugepages-2Mi: "0"
  7. memory: 16265856Ki
  8. pods: "110"
  9. allocated:
  10. cpu: 950m
  11. memory: 290Mi
  12. pods: "11"

Member2 is like:

  1. resourceSummary:
  2. allocatable:
  3. cpu: "4"
  4. ephemeral-storage: 206291924Ki
  5. hugepages-1Gi: "0"
  6. hugepages-2Mi: "0"
  7. memory: 16265856Ki
  8. pods: "110"
  9. allocated:
  10. cpu: "2"
  11. memory: 290Mi
  12. pods: "11"

Member3 is like:

  1. resourceSummary:
  2. allocatable:
  3. cpu: "4"
  4. ephemeral-storage: 206291924Ki
  5. hugepages-1Gi: "0"
  6. hugepages-2Mi: "0"
  7. memory: 16265856Ki
  8. pods: "110"
  9. allocated:
  10. cpu: "2"
  11. memory: 290Mi
  12. pods: "110"

Assume that the Pod’s request is 500m CPU. Member1 and Member2 have sufficient resources to run this replica but Member3 has no quota for Pods. Considering the amount of available resources, the scheduler prefers to schedule the Pod to member1.

Clustermember1member2member3
AvailableReplicas(4 - 0.95) / 0.5 = 6.1(4 - 2) / 0.5 = 40

Customized Cluster Modeling

Background

ResourceSummary describes the overall available resources of the cluster. However, the ResourceSummary is not precise enough, it mechanically counts the resources on all nodes, but ignores the fragment resources. For example, a cluster with 2000 nodes, 1 core CPU left on each node. From the ResourceSummary, we get that there are 2000 core CPU left for the cluster, but actually, this cluster cannot run any pod that requires CPU greater than 1 core.

Therefore, we introduce a CustomizedClusterResourceModeling for each cluster that records the resource portrait of each node. Karmada will collect node and pod information for each cluster. After calculation, this node will be divided into the appropriate resource model configured by the users.

Start to use Customized Cluster Resource Models

CustomizedClusterResourceModeling feature gate has evolved to Beta sine Karmada v1.4 and is enabled by default. If you use Karmada v1.3, you need to enable this feature gate in karmada-scheduler, karmada-aggregated-server and karmada-controller-manager.

For example, you can use the command below to turn on the feature gate in the karmada-controller-manager.

  1. kubectl --kubeconfig ~/.kube/karmada.config --context karmada-host edit deploy/karmada-controller-manager -nkarmada-system
  1. - command:
  2. - /bin/karmada-controller-manager
  3. - --kubeconfig=/etc/kubeconfig
  4. - --bind-address=0.0.0.0
  5. - --cluster-status-update-frequency=10s
  6. - --secure-port=10357
  7. - --feature-gates=CustomizedClusterResourceModeling=true
  8. - --v=4

After that, when a cluster is registered to the Karmada control plane, Karmada will automatically sets up a generic model for the cluster. You can see it in cluster.spec.

By default, a resource model is like:

  1. resourceModels:
  2. - grade: 0
  3. ranges:
  4. - max: "1"
  5. min: "0"
  6. name: cpu
  7. - max: 4Gi
  8. min: "0"
  9. name: memory
  10. - grade: 1
  11. ranges:
  12. - max: "2"
  13. min: "1"
  14. name: cpu
  15. - max: 16Gi
  16. min: 4Gi
  17. name: memory
  18. - grade: 2
  19. ranges:
  20. - max: "4"
  21. min: "2"
  22. name: cpu
  23. - max: 32Gi
  24. min: 16Gi
  25. name: memory
  26. - grade: 3
  27. ranges:
  28. - max: "8"
  29. min: "4"
  30. name: cpu
  31. - max: 64Gi
  32. min: 32Gi
  33. name: memory
  34. - grade: 4
  35. ranges:
  36. - max: "16"
  37. min: "8"
  38. name: cpu
  39. - max: 128Gi
  40. min: 64Gi
  41. name: memory
  42. - grade: 5
  43. ranges:
  44. - max: "32"
  45. min: "16"
  46. name: cpu
  47. - max: 256Gi
  48. min: 128Gi
  49. name: memory
  50. - grade: 6
  51. ranges:
  52. - max: "64"
  53. min: "32"
  54. name: cpu
  55. - max: 512Gi
  56. min: 256Gi
  57. name: memory
  58. - grade: 7
  59. ranges:
  60. - max: "128"
  61. min: "64"
  62. name: cpu
  63. - max: 1Ti
  64. min: 512Gi
  65. name: memory
  66. - grade: 8
  67. ranges:
  68. - max: "9223372036854775807"
  69. min: "128"
  70. name: cpu
  71. - max: "9223372036854775807"
  72. min: 1Ti
  73. name: memory

Customize your cluster resource models

In some cases, the default cluster resource model may not match your cluster. You can adjust the granularity of the cluster resource model to better deliver resources to the cluster.

For example, you can use the command below to customize the cluster resource models of member1.

  1. kubectl --kubeconfig ~/.kube/karmada.config --context karmada-apiserver edit cluster/member1

A Customized resource model should meet the following requirements:

  • The grade of each models should not be the same.
  • The number of resource types in each model should be the same.
  • Now only support cpu, memory, storage, ephemeral-storage.
  • The max value of each resource must be greater than the min value.
  • The min value of each resource in the first model should be 0.
  • The max value of each resource in the last model should be MaxInt64.
  • The resource types of each models should be the same.
  • Model intervals for resources must be contiguous and non-overlapping.

For example: there is a cluster resource model below:

  1. resourceModels:
  2. - grade: 0
  3. ranges:
  4. - max: "1"
  5. min: "0"
  6. name: cpu
  7. - max: 4Gi
  8. min: "0"
  9. name: memory
  10. - grade: 1
  11. ranges:
  12. - max: "2"
  13. min: "1"
  14. name: cpu
  15. - max: 16Gi
  16. min: 4Gi
  17. name: memory
  18. - grade: 2
  19. ranges:
  20. - max: "9223372036854775807"
  21. min: "2"
  22. name: cpu
  23. - max: "9223372036854775807"
  24. min: 16Gi
  25. name: memory

It means that there are three models in the cluster resource models. if there is a node with 0.5C and 2Gi, it will be divided into Grade 0. If there is a node with 1.5C and 10Gi, it will be divided into Grade 1.

Schedule based on Customized Cluster Resource Models

Cluster resource model divides nodes into levels of different intervals. And when a Pod needs to be scheduled to a specific cluster, they will compare the number of nodes in the model that satisfies the resource request of the Pod in different clusters, and schedule the Pod to the cluster with more node numbers.

Assume that there is a Pod which will be scheduled to one of the clusters managed by Karmada with the same cluster resource models.

Member1 is like:

  1. spec:
  2. ...
  3. - grade: 2
  4. ranges:
  5. - max: "4"
  6. min: "2"
  7. name: cpu
  8. - max: 32Gi
  9. min: 16Gi
  10. name: memory
  11. - grade: 3
  12. ranges:
  13. - max: "8"
  14. min: "4"
  15. name: cpu
  16. - max: 64Gi
  17. min: 32Gi
  18. name: memory
  19. ...
  20. ...
  21. status:
  22. - count: 1
  23. grade: 2
  24. - count: 6
  25. grade: 3

Member2 is like:

  1. spec:
  2. ...
  3. - grade: 2
  4. ranges:
  5. - max: "4"
  6. min: "2"
  7. name: cpu
  8. - max: 32Gi
  9. min: 16Gi
  10. name: memory
  11. - grade: 3
  12. ranges:
  13. - max: "8"
  14. min: "4"
  15. name: cpu
  16. - max: 64Gi
  17. min: 32Gi
  18. name: memory
  19. ...
  20. ...
  21. status:
  22. - count: 4
  23. grade: 2
  24. - count: 4
  25. grade: 3

Member3 is like:

  1. spec:
  2. ...
  3. - grade: 6
  4. ranges:
  5. - max: "64"
  6. min: "32"
  7. name: cpu
  8. - max: 512Gi
  9. min: 256Gi
  10. name: memory
  11. ...
  12. ...
  13. status:
  14. - count: 1
  15. grade: 6

Assume that the Pod’s request is 3C 20Gi. All nodes that meet Grade 2 and above meet this requirement. Considering the amount of available resources, the scheduler prefers to schedule the Pod to member3.

Clustermember1member2member3
AvailableReplicas1 + 6 = 74 + 4 = 81 * min(32/3, 256/20) = 10

Assume that the Pod’s request is 3C 60Gi. Nodes from Grade2 does not satisfy all resource requests. Considering the amount of available resources above Grade 2, the scheduler prefers to schedule the Pod to member1.

Clustermember1member2member3
AvailableReplicas6 1 = 64 1 = 41 * min(32/3, 256/60) = 4

Disable Cluster Resource Modeling

The resource modeling is always be used by the scheduler to make scheduling decisions in scenario of dynamic replica assignment based on cluster free resources. In the process of resource modeling, it will collect node and pod information from all clusters managed by Karmada. This imposes a considerable performance burden in large-scale scenarios.

You can disable cluster resource modeling by setting --enable-cluster-resource-modeling to false in karmada-controller-manager and karmada-agent.