PersistentPodState

FEATURE STATE: Kruise v1.2.0

With the development of cloud native, more and more companies start to deploy stateful services (e.g., Etcd, MQ) using Kubernetes. K8S StatefulSet is a workload for managing stateful services, and it considers the deployment characteristics of stateful services in many aspects. However, StatefulSet persistent only limited pod state, such as Pod Name is ordered and unchanging, PVC persistence, and can not cover other states, e.g. Pod IP retention, priority scheduling to previously deployed Nodes, etc. Typical Cases:

  • Service Discovery Middleware services are exceptionally sensitive to the Pod IP after deployment, requiring that the IP cannot be changed.

  • Database services persist data to the host disk, and changes to the Node to which they belong will result in data loss.

In response to the above description, by customizing PersistentPodState CRD, Kruise is able to persistent other states of the Pod, such as “IP Retention”.

For detailed design, please refer to: PPS Proposal.

Usage

Annotation Auto Generate PersistentPodState

  1. apiVersion: apps.kruise.io/v1alpha1
  2. kind: StatefulSet
  3. metadata:
  4. annotations:
  5. # auto generate PersistentPodState
  6. kruise.io/auto-generate-persistent-pod-state: "true"
  7. # preferred node affinity, As follows, Pod rebuild will preferred deployment to the same node
  8. kruise.io/preferred-persistent-topology: kubernetes.io/hostname[,other node labels]
  9. # required node affinity, As follows, Pod rebuild will force deployment to the same zone
  10. kruise.io/required-persistent-topology: failure-domain.beta.kubernetes.io/zone[,other node labels]

Some common PersistentPodState can be generated by annotation to satisfy most of the scenarios. For some complex scenarios, you can use PersistentPodState CRD to define them directly.

Define PersistentPodState CRD

  1. apiVersion: apps.kruise.io/v1alpha1
  2. kind: PersistentPodState
  3. metadata:
  4. name: echoserver
  5. namespace: echoserver
  6. spec:
  7. targetRef:
  8. # Native k8s or kruise StatefulSet
  9. # only support StatefulSet
  10. apiVersion: apps.kruise.io/v1beta1
  11. kind: StatefulSet
  12. name: echoserver
  13. # required node affinity. As follows, Pod rebuild will force deployment to the same zone
  14. requiredPersistentTopology:
  15. nodeTopologyKeys:
  16. failure-domain.beta.kubernetes.io/zone[,other node labels]
  17. # preferred node affinity. As follows, Pod rebuild will preferred deployment to the same node
  18. preferredPersistentTopology:
  19. - preference:
  20. nodeTopologyKeys:
  21. kubernetes.io/hostname[,other node labels]
  22. # int [1, 100]
  23. weight: 100

IP Retention Practice

“IP Retention” should be a common requirement for K8S deployments of stateful services. It does not mean “Specified Pod IP”, but requires that the Pod IP does not change after the first deployment, either by service release or by machine eviction. To achieve this, we need the K8S network component to support Pod IP retention and the ability to keep the IP as unchanged as possible. In this article, we have modified the Host-local plugin in the flannel network component so that it can achieve the effect of keeping the Pod IP unchanged under the same Node. Related principles will not be stated here, please refer to the code: host-local.

IP retention seems to be supported by the network component, how is it related with PersistentPodState? Well, there are some limitations to the implementation of “Pod IP unchanged” by network components. For example, flannel can only support the same Node to keep the Pod IP unchanged. However, the most important feature of K8S scheduling is “uncertainty”, so “how to ensure that Pods are rebuilt and scheduled to the same Node” is the problem that PersistentPodState solves.

1. Deploy stateful service echoserver, declaring “IP Retention” via annotations, as follows:

  1. apiVersion: apps.kruise.io/v1alpha1
  2. kind: StatefulSet
  3. metadata:
  4. name: echoserver
  5. labels:
  6. app: echoserver
  7. annotations:
  8. kruise.io/auto-generate-persistent-pod-state: "true"
  9. kruise.io/preferred-persistent-topology: kubernetes.io/hostname
  10. spec:
  11. serviceName: echoserver
  12. replicas: 2
  13. selector:
  14. matchLabels:
  15. app: echoserver
  16. template:
  17. metadata:
  18. labels:
  19. app: echoserver
  20. annotations:
  21. # Notify the flannel network component that the Pod rebuild keeps the IP unchanged and "10" means the Pod is deleted until the next successful dispatch, with a maximum of 10 minutes in between
  22. # Mainly consider scenarios such as deletion, capacity reduction, etc.
  23. io.kubernetes.cri/reserved-ip-duration: "10"
  24. spec:
  25. terminationGracePeriodSeconds: 5
  26. containers:
  27. - name: echoserver
  28. image: cilium/echoserver:latest
  29. imagePullPolicy: IfNotPresent

2. According to the above configuration, kruise automatically generates PersistentPodState and records the node status of the first deployment of Pod in PersistentPodState.Status.

  1. apiVersion: apps.kruise.io/v1alpha1
  2. kind: PersistentPodState
  3. metadata:
  4. name: configserver
  5. namespace: configserver
  6. spec:
  7. targetRef:
  8. apiVersion: apps.kruise.io/v1beta1
  9. kind: StatefulSet
  10. name: configserver
  11. preferredPersistentTopology:
  12. - preference:
  13. nodeTopologyKeys:
  14. kubernetes.io/hostname
  15. weight: 100
  16. status:
  17. podStates:
  18. # Record that pod-0 is deployed on worker2 node and pod-1 is deployed on worker1 node
  19. configserver-0:
  20. nodeName: worker2
  21. nodeTopologyLabels:
  22. kubernetes.io/hostname: worker2
  23. configserver-1:
  24. nodeName: worker1
  25. nodeTopologyLabels:
  26. kubernetes.io/hostname: worker1

3. After Pod rebuild due to service release or Node eviction, etc., kruise injects the recorded Pod node information into Pod NodeAffinity, which in turn enables the Pod IP to remain unchanged, as follows:

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: configserver-0
  5. namespace: configserver
  6. annotations:
  7. io.kubernetes.cri/reserved-ip-duration: 10
  8. spec:
  9. # kruise webhook injection
  10. affinity:
  11. nodeAffinity:
  12. preferredDuringSchedulingIgnoredDuringExecution:
  13. - preference:
  14. matchExpressions:
  15. - key: kubernetes.io/hostname
  16. operator: In
  17. values:
  18. - worker2
  19. weight: 100
  20. containers:
  21. ...

staticIP