Failover Analysis

Let’s briefly analyze the Karmada failover feature.

Add taints on fault cluster

After the cluster is determined to be unhealthy, a taint with Effect set to NoSchedule will be added to the cluster as follows:

  • when cluster’s Ready condition is False, add the following taint:
  1. key: cluster.karmada.io/not-ready
  2. effect: NoSchedule
  • when cluster’s Ready condition is Unknown, add the following taint:
  1. key: cluster.karmada.io/unreachable
  2. effect: NoSchedule

If an unhealthy cluster is not recovered for a period of time, which can be configured via --failover-eviction-timeout flag(default is 5 minutes), a new taint with Effect set to NoExecute will be added to the cluster as follows:

  • when cluster’s Ready condition is False, add the following taint:
  1. key: cluster.karmada.io/not-ready
  2. effect: NoExecute
  • when cluster’s Ready condition is Unknown, add the following taint:
  1. key: cluster.karmada.io/unreachable
  2. effect: NoExecute

Tolerate cluster taints

After users creates a PropagationPolicy/ClusterPropagationPolicy, Karmada will automatically add the following toleration through webhook:

  1. apiVersion: policy.karmada.io/v1alpha1
  2. kind: PropagationPolicy
  3. metadata:
  4. name: nginx-propagation
  5. namespace: default
  6. spec:
  7. placement:
  8. clusterTolerations:
  9. - effect: NoExecute
  10. key: cluster.karmada.io/not-ready
  11. operator: Exists
  12. tolerationSeconds: 300
  13. - effect: NoExecute
  14. key: cluster.karmada.io/unreachable
  15. operator: Exists
  16. tolerationSeconds: 300
  17. resourceSelectors:
  18. - apiVersion: apps/v1
  19. kind: Deployment
  20. name: nginx
  21. namespace: default

The tolerationSeconds can be configured via --default-not-ready-toleration-seconds flag(default is 300) and default-unreachable-toleration-seconds flag(default is 300).

Failover

When karmada detects that the faulty cluster is no longer tolerated by PropagationPolicy/ClusterPropagationPolicy, the cluster will be removed from the resource scheduling result and the karmada scheduler will reschedule the reference application.

There are several constraints:

  • For each rescheduled application, it still needs to meet the restrictions of PropagationPolicy/ClusterPropagationPolicy, such as ClusterAffinity or SpreadConstraints.
  • The application distributed on the ready clusters after the initial scheduling will remain when failover schedule.

Duplicated schedule type

For Duplicated schedule policy, when the number of candidate clusters that meet the PropagationPolicy restriction is not less than the number of failed clusters, it will be rescheduled to candidate clusters according to the number of failed clusters. Otherwise, no rescheduling. The candidate cluster refers to the newly calculated cluster scheduling result in this scheduling process, which is different from the scheduled cluster in the last scheduling result.

Take Deployment as example:

unfold me to see the yaml

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: nginx
  5. labels:
  6. app: nginx
  7. spec:
  8. replicas: 2
  9. selector:
  10. matchLabels:
  11. app: nginx
  12. template:
  13. metadata:
  14. labels:
  15. app: nginx
  16. spec:
  17. containers:
  18. - image: nginx
  19. name: nginx
  20. ---
  21. apiVersion: policy.karmada.io/v1alpha1
  22. kind: PropagationPolicy
  23. metadata:
  24. name: nginx-propagation
  25. spec:
  26. resourceSelectors:
  27. - apiVersion: apps/v1
  28. kind: Deployment
  29. name: nginx
  30. placement:
  31. clusterAffinity:
  32. clusterNames:
  33. - member1
  34. - member2
  35. - member3
  36. - member5
  37. spreadConstraints:
  38. - maxGroups: 2
  39. minGroups: 2
  40. replicaScheduling:
  41. replicaSchedulingType: Duplicated

Suppose there are 5 member clusters, and the initial scheduling result is in member1 and member2. When member2 fails, it triggers rescheduling.

It should be noted that rescheduling will not delete the application on the ready cluster member1. In the remaining 3 clusters, only member3 and member5 match the clusterAffinity policy.

Due to the limitations of spreadConstraints, the final result can be [member1, member3] or [member1, member5].

Divided schedule type

For Divided schedule policy, karmada scheduler will try to migrate replicas to the other health clusters.

Take Deployment as example:

unfold me to see the yaml

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: nginx
  5. labels:
  6. app: nginx
  7. spec:
  8. replicas: 3
  9. selector:
  10. matchLabels:
  11. app: nginx
  12. template:
  13. metadata:
  14. labels:
  15. app: nginx
  16. spec:
  17. containers:
  18. - image: nginx
  19. name: nginx
  20. ---
  21. apiVersion: policy.karmada.io/v1alpha1
  22. kind: PropagationPolicy
  23. metadata:
  24. name: nginx-propagation
  25. spec:
  26. resourceSelectors:
  27. - apiVersion: apps/v1
  28. kind: Deployment
  29. name: nginx
  30. placement:
  31. clusterAffinity:
  32. clusterNames:
  33. - member1
  34. - member2
  35. replicaScheduling:
  36. replicaDivisionPreference: Weighted
  37. replicaSchedulingType: Divided
  38. weightPreference:
  39. staticWeightList:
  40. - targetCluster:
  41. clusterNames:
  42. - member1
  43. weight: 1
  44. - targetCluster:
  45. clusterNames:
  46. - member2
  47. weight: 2

Karmada scheduler will divide the replicas according the weightPreference. The initial schedule result is member1 with 1 replica and member2 with 2 replicas.

When member1 fails, it triggers rescheduling. Karmada scheduler will try to migrate replicas to the other health clusters. The final result will be member2 with 3 replicas.

Graceful eviction feature

In order to prevent service interruption during cluster failover, Karmada need to ensure the removal of evicted workloads will be delayed until the workloads are available on new clusters.

The GracefulEvictionTasks field is added to ResourceBinding/ClusterResourceBinding to indicate the eviction task queue.

When the faulty cluster is removed from the resource scheduling result by taint-manager, it will be added to the eviction task queue.

The gracefulEviction controller is responsible for processing tasks in the eviction task queue. During the procession, the The gracefulEviction controller evaluates whether the current task can be removed form the eviction task queue one by one. The judgement conditions are as follows:

  • Check the health status of the current resource scheduling result. If the resource health status is healthy, the condition is met.
  • Check whether the waiting duration of the current task exceeds the timeout interval, which can be configured via graceful-eviction-timeout flag(default is 10 minutes). If exceeds, and meets the condition.