Descheduler For Rescheduling

Users could divide their replicas of a workload into different clusters in terms of available resources of member clusters. However, the scheduler’s decisions are influenced by its view of Karmada at that point of time when a new ResourceBinding appears for scheduling. As Karmada multi-clusters are very dynamic and their state changes over time, there may be desire to move already running replicas to some other clusters due to lack of resources for the cluster. This may happen when some nodes of a cluster failed and the cluster does not have enough resource to accommodate their pods or the estimators have some estimation deviation, which is inevitable.

The karmada-descheduler will detect all deployments once in a while, every 2 minutes by default. In every period, it will find out how many unschedulable replicas a deployment has in target scheduled clusters by calling karmada-scheduler-estimator. Then it will evict them from decreasing spec.clusters and trigger karmada-scheduler to do a ‘Scale Schedule’ based on the current situation. Note that it will take effect only when the replica scheduling strategy is dynamic division.

Prerequisites

Karmada has been installed

We can install Karmada by referring to quick-start, or directly run hack/local-up-karmada.sh script which is also used to run our E2E cases.

Member cluster component is ready

Ensure that all member clusters have joined Karmada and their corresponding karmada-scheduler-estimator is installed into karmada-host.

Check member clusters using the following command:

  1. # check whether member clusters have joined
  2. $ kubectl get cluster
  3. NAME VERSION MODE READY AGE
  4. member1 v1.19.1 Push True 11m
  5. member2 v1.19.1 Push True 11m
  6. member3 v1.19.1 Pull True 5m12s
  7. # check whether the karmada-scheduler-estimator of a member cluster has been working well
  8. $ kubectl --context karmada-host -n karmada-system get pod | grep estimator
  9. karmada-scheduler-estimator-member1-696b54fd56-xt789 1/1 Running 0 77s
  10. karmada-scheduler-estimator-member2-774fb84c5d-md4wt 1/1 Running 0 75s
  11. karmada-scheduler-estimator-member3-5c7d87f4b4-76gv9 1/1 Running 0 72s
  • If a cluster has not joined, use hack/deploy-agent-and-estimator.sh to deploy both karmada-agent and karmada-scheduler-estimator.
  • If the clusters have joined, use hack/deploy-scheduler-estimator.sh to only deploy karmada-scheduler-estimator.

Scheduler option ‘—enable-scheduler-estimator’

After all member clusters have joined and estimators are all ready, specify the option --enable-scheduler-estimator=true to enable scheduler estimator.

  1. # edit the deployment of karmada-scheduler
  2. $ kubectl --context karmada-host -n karmada-system edit deployments.apps karmada-scheduler

Add the option --enable-scheduler-estimator=true into the command of container karmada-scheduler.

Descheduler has been installed

Ensure that the karmada-descheduler has been installed onto karmada-host.

  1. $ kubectl --context karmada-host -n karmada-system get pod | grep karmada-descheduler
  2. karmada-descheduler-658648d5b-c22qf 1/1 Running 0 80s

Example

Let’s simulate a replica scheduling failure in a member cluster due to lack of resources.

First we create a deployment with 3 replicas and divide them into 3 member clusters.

  1. apiVersion: policy.karmada.io/v1alpha1
  2. kind: PropagationPolicy
  3. metadata:
  4. name: nginx-propagation
  5. spec:
  6. resourceSelectors:
  7. - apiVersion: apps/v1
  8. kind: Deployment
  9. name: nginx
  10. placement:
  11. clusterAffinity:
  12. clusterNames:
  13. - member1
  14. - member2
  15. - member3
  16. replicaScheduling:
  17. replicaDivisionPreference: Weighted
  18. replicaSchedulingType: Divided
  19. weightPreference:
  20. dynamicWeight: AvailableReplicas
  21. ---
  22. apiVersion: apps/v1
  23. kind: Deployment
  24. metadata:
  25. name: nginx
  26. labels:
  27. app: nginx
  28. spec:
  29. replicas: 3
  30. selector:
  31. matchLabels:
  32. app: nginx
  33. template:
  34. metadata:
  35. labels:
  36. app: nginx
  37. spec:
  38. containers:
  39. - image: nginx
  40. name: nginx
  41. resources:
  42. requests:
  43. cpu: "2"

It is possible for these 3 replicas to be evenly divided into 3 member clusters, that is, one replica in each cluster. Now we taint all nodes in member1 and evict the replica.

  1. # mark node "member1-control-plane" as unschedulable in cluster member1
  2. $ kubectl --context member1 cordon member1-control-plane
  3. # delete the pod in cluster member1
  4. $ kubectl --context member1 delete pod -l app=nginx

A new pod will be created and cannot be scheduled by kube-scheduler due to lack of resources.

  1. # the state of pod in cluster member1 is pending
  2. $ kubectl --context member1 get pod
  3. NAME READY STATUS RESTARTS AGE
  4. nginx-68b895fcbd-fccg4 1/1 Pending 0 80s

After about 5 to 7 minutes, the pod in member1 will be evicted and scheduled to other available clusters.

  1. # get the pod in cluster member1
  2. $ kubectl --context member1 get pod
  3. No resources found in default namespace.
  4. # get a list of pods in cluster member2
  5. $ kubectl --context member2 get pod
  6. NAME READY STATUS RESTARTS AGE
  7. nginx-68b895fcbd-dgd4x 1/1 Running 0 6m3s
  8. nginx-68b895fcbd-nwgjn 1/1 Running 0 4s