Node Affinity & Cluster Topology

Node Affinity

Kubernetes allows pods to be assigned to nodes based on various critera through node affinity.

M3DB was built with failure tolerance as a core feature. M3DB’s isolation groups allow shards to be placed across failure domains such that the loss of no single domain can cause the cluster to lose quorum. More details on M3DB’s resiliency can be found in the deployment docs.

By leveraging Kubernetes’ node affinity and M3DB’s isolation groups, the operator can guarantee that M3DB pods are distributed across failure domains. For example, in a Kubernetes cluster spread across 3 zones in a cloud region, the isolationGroups configuration below would guarantee that no single zone failure could degrade the M3DB cluster.

M3DB is unaware of the underlying zone topology: it just views the isolation groups as group1, group2, group3 in its placement. Thanks to the Kubernetes scheduler, however, these groups are actually scheduled across separate failure domains.

  1. apiVersion: operator.m3db.io/v1alpha1
  2. kind: M3DBCluster
  3. ...
  4. spec:
  5. replicationFactor: 3
  6. isolationGroups:
  7. - name: group1
  8. numInstances: 3
  9. nodeAffinityTerms:
  10. - key: failure-domain.beta.kubernetes.io/zone
  11. values:
  12. - us-east1-b
  13. - name: group2
  14. numInstances: 3
  15. nodeAffinityTerms:
  16. - key: failure-domain.beta.kubernetes.io/zone
  17. values:
  18. - us-east1-c
  19. - name: group3
  20. numInstances: 3
  21. nodeAffinityTerms:
  22. - key: failure-domain.beta.kubernetes.io/zone
  23. values:
  24. - us-east1-d

Tolerations

In addition to allowing pods to be assigned to certain nodes via node affinity, Kubernetes allows pods to be repelled from nodes through taints if they don’t tolerate the taint. For example, the following config would ensure:

  1. Pods are spread across zones.

  2. Pods are only assigned to nodes in the m3db-dedicated-pool pool.

  3. No other pods could be assigned to those nodes (assuming they were tainted with the taint m3db-dedicated-taint).

  1. apiVersion: operator.m3db.io/v1alpha1
  2. kind: M3DBCluster
  3. ...
  4. spec:
  5. replicationFactor: 3
  6. isolationGroups:
  7. - name: group1
  8. numInstances: 3
  9. nodeAffinityTerms:
  10. - key: failure-domain.beta.kubernetes.io/zone
  11. values:
  12. - us-east1-b
  13. - key: nodepool
  14. values:
  15. - m3db-dedicated-pool
  16. - name: group2
  17. numInstances: 3
  18. nodeAffinityTerms:
  19. - key: failure-domain.beta.kubernetes.io/zone
  20. values:
  21. - us-east1-c
  22. - key: nodepool
  23. values:
  24. - m3db-dedicated-pool
  25. - name: group3
  26. numInstances: 3
  27. nodeAffinityTerms:
  28. - key: failure-domain.beta.kubernetes.io/zone
  29. values:
  30. - us-east1-d
  31. - key: nodepool
  32. values:
  33. - m3db-dedicated-pool
  34. tolerations:
  35. - key: m3db-dedicated
  36. effect: NoSchedule
  37. operator: Exists

Example Affinity Configurations

Zonal Cluster

The examples so far have focused on multi-zone Kubernetes clusters. Some users may only have a cluster in a single zone and accept the reduced fault tolerance. The following configuration shows how to configure the operator in a zonal cluster.

  1. apiVersion: operator.m3db.io/v1alpha1
  2. kind: M3DBCluster
  3. ...
  4. spec:
  5. replicationFactor: 3
  6. isolationGroups:
  7. - name: group1
  8. numInstances: 3
  9. nodeAffinityTerms:
  10. - key: failure-domain.beta.kubernetes.io/zone
  11. values:
  12. - us-east1-b
  13. - name: group2
  14. numInstances: 3
  15. nodeAffinityTerms:
  16. - key: failure-domain.beta.kubernetes.io/zone
  17. values:
  18. - us-east1-b
  19. - name: group3
  20. numInstances: 3
  21. nodeAffinityTerms:
  22. - key: failure-domain.beta.kubernetes.io/zone
  23. values:
  24. - us-east1-b

6 Zone Cluster

In the above examples we created clusters with 1 isolation group in each of 3 zones. Because values within a single NodeAffinityTerm are OR’d, we can also spread an isolationgroup across multiple zones. For example, if we had 6 zones available to us:

  1. apiVersion: operator.m3db.io/v1alpha1
  2. kind: M3DBCluster
  3. ...
  4. spec:
  5. replicationFactor: 3
  6. isolationGroups:
  7. - name: group1
  8. numInstances: 3
  9. nodeAffinityTerms:
  10. - key: failure-domain.beta.kubernetes.io/zone
  11. values:
  12. - us-east1-a
  13. - us-east1-b
  14. - name: group2
  15. numInstances: 3
  16. nodeAffinityTerms:
  17. - key: failure-domain.beta.kubernetes.io/zone
  18. values:
  19. - us-east1-c
  20. - us-east1-d
  21. - name: group3
  22. numInstances: 3
  23. nodeAffinityTerms:
  24. - key: failure-domain.beta.kubernetes.io/zone
  25. values:
  26. - us-east1-e
  27. - us-east1-f

No Affinity

If there are no failure domains available, one can have a cluster with no affinity where the pods will be scheduled however Kubernetes would place them by default:

  1. apiVersion: operator.m3db.io/v1alpha1
  2. kind: M3DBCluster
  3. ...
  4. spec:
  5. replicationFactor: 3
  6. isolationGroups:
  7. - name: group1
  8. numInstances: 3
  9. - name: group2
  10. numInstances: 3
  11. - name: group3
  12. numInstances: 3