StatefulSets

StatefulSet is the workload API object used to manage stateful applications.

Manages the deployment and scaling of a set of PodsThe smallest and simplest Kubernetes object. A Pod represents a set of running containers on your cluster., and provides guarantees about the ordering and uniqueness of these Pods.

Like a DeploymentAn API object that manages a replicated application., a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.

Using StatefulSets

StatefulSets are valuable for applications that require one or more of thefollowing.

  • Stable, unique network identifiers.
  • Stable, persistent storage.
  • Ordered, graceful deployment and scaling.
  • Ordered, automated rolling updates.

In the above, stable is synonymous with persistence across Pod (re)scheduling.If an application doesn’t require any stable identifiers or ordered deployment,deletion, or scaling, you should deploy your application using a workload objectthat provides a set of stateless replicas.Deployment orReplicaSet may be better suited to your stateless needs.

Limitations

  • The storage for a given Pod must either be provisioned by a PersistentVolume Provisioner based on the requested storage class, or pre-provisioned by an admin.
  • Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.
  • StatefulSets currently require a Headless Service to be responsible for the network identity of the Pods. You are responsible for creating this Service.
  • StatefulSets do not provide any guarantees on the termination of pods when a StatefulSet is deleted. To achieve ordered and graceful termination of the pods in the StatefulSet, it is possible to scale the StatefulSet down to 0 prior to deletion.
  • When using Rolling Updates with the defaultPod Management Policy (OrderedReady),it’s possible to get into a broken state that requiresmanual intervention to repair.

Components

The example below demonstrates the components of a StatefulSet.

  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4. name: nginx
  5. labels:
  6. app: nginx
  7. spec:
  8. ports:
  9. - port: 80
  10. name: web
  11. clusterIP: None
  12. selector:
  13. app: nginx
  14. ---
  15. apiVersion: apps/v1
  16. kind: StatefulSet
  17. metadata:
  18. name: web
  19. spec:
  20. selector:
  21. matchLabels:
  22. app: nginx # has to match .spec.template.metadata.labels
  23. serviceName: "nginx"
  24. replicas: 3 # by default is 1
  25. template:
  26. metadata:
  27. labels:
  28. app: nginx # has to match .spec.selector.matchLabels
  29. spec:
  30. terminationGracePeriodSeconds: 10
  31. containers:
  32. - name: nginx
  33. image: k8s.gcr.io/nginx-slim:0.8
  34. ports:
  35. - containerPort: 80
  36. name: web
  37. volumeMounts:
  38. - name: www
  39. mountPath: /usr/share/nginx/html
  40. volumeClaimTemplates:
  41. - metadata:
  42. name: www
  43. spec:
  44. accessModes: [ "ReadWriteOnce" ]
  45. storageClassName: "my-storage-class"
  46. resources:
  47. requests:
  48. storage: 1Gi

In the above example:

  • A Headless Service, named nginx, is used to control the network domain.
  • The StatefulSet, named web, has a Spec that indicates that 3 replicas of the nginx container will be launched in unique Pods.
  • The volumeClaimTemplates will provide stable storage using PersistentVolumes provisioned by a PersistentVolume Provisioner.

Pod Selector

You must set the .spec.selector field of a StatefulSet to match the labels of its .spec.template.metadata.labels. Prior to Kubernetes 1.8, the .spec.selector field was defaulted when omitted. In 1.8 and later versions, failing to specify a matching Pod Selector will result in a validation error during StatefulSet creation.

Pod Identity

StatefulSet Pods have a unique identity that is comprised of an ordinal, astable network identity, and stable storage. The identity sticks to the Pod,regardless of which node it’s (re)scheduled on.

Ordinal Index

For a StatefulSet with N replicas, each Pod in the StatefulSet will beassigned an integer ordinal, from 0 up through N-1, that is unique over the Set.

Stable Network ID

Each Pod in a StatefulSet derives its hostname from the name of the StatefulSetand the ordinal of the Pod. The pattern for the constructed hostnameis $(statefulset name)-$(ordinal). The example above will create three Podsnamed web-0,web-1,web-2.A StatefulSet can use a Headless Serviceto control the domain of its Pods. The domain managed by this Service takes the form:$(service name).$(namespace).svc.cluster.local, where “cluster.local” is thecluster domain.As each Pod is created, it gets a matching DNS subdomain, taking the form:$(podname).$(governing service domain), where the governing service is definedby the serviceName field on the StatefulSet.

As mentioned in the limitations section, you are responsible forcreating the Headless Serviceresponsible for the network identity of the pods.

Here are some examples of choices for Cluster Domain, Service name,StatefulSet name, and how that affects the DNS names for the StatefulSet’s Pods.

Cluster DomainService (ns/name)StatefulSet (ns/name)StatefulSet DomainPod DNSPod Hostname
cluster.localdefault/nginxdefault/webnginx.default.svc.cluster.localweb-{0..N-1}.nginx.default.svc.cluster.localweb-{0..N-1}
cluster.localfoo/nginxfoo/webnginx.foo.svc.cluster.localweb-{0..N-1}.nginx.foo.svc.cluster.localweb-{0..N-1}
kube.localfoo/nginxfoo/webnginx.foo.svc.kube.localweb-{0..N-1}.nginx.foo.svc.kube.localweb-{0..N-1}
Note: Cluster Domain will be set to cluster.local unless otherwise configured.

Stable Storage

Kubernetes creates one PersistentVolume for eachVolumeClaimTemplate. In the nginx example above, each Pod will receive a single PersistentVolumewith a StorageClass of my-storage-class and 1 Gib of provisioned storage. If no StorageClassis specified, then the default StorageClass will be used. When a Pod is (re)scheduledonto a node, its volumeMounts mount the PersistentVolumes associated with itsPersistentVolume Claims. Note that, the PersistentVolumes associated with thePods’ PersistentVolume Claims are not deleted when the Pods, or StatefulSet are deleted.This must be done manually.

Pod Name Label

When the StatefulSet ControllerA control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired state. creates a Pod,it adds a label, statefulset.kubernetes.io/pod-name, that is set to the name ofthe Pod. This label allows you to attach a Service to a specific Pod inthe StatefulSet.

Deployment and Scaling Guarantees

  • For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
  • When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
  • Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
  • Before a Pod is terminated, all of its successors must be completely shutdown.

The StatefulSet should not specify a pod.Spec.TerminationGracePeriodSeconds of 0. This practice is unsafe and strongly discouraged. For further explanation, please refer to force deleting StatefulSet Pods.

When the nginx example above is created, three Pods will be deployed in the orderweb-0, web-1, web-2. web-1 will not be deployed before web-0 isRunning and Ready, and web-2 will not be deployed untilweb-1 is Running and Ready. If web-0 should fail, after web-1 is Running and Ready, but beforeweb-2 is launched, web-2 will not be launched until web-0 is successfully relaunched andbecomes Running and Ready.

If a user were to scale the deployed example by patching the StatefulSet such thatreplicas=1, web-2 would be terminated first. web-1 would not be terminated until web-2is fully shutdown and deleted. If web-0 were to fail after web-2 has been terminated andis completely shutdown, but prior to web-1’s termination, web-1 would not be terminateduntil web-0 is Running and Ready.

Pod Management Policies

In Kubernetes 1.7 and later, StatefulSet allows you to relax its ordering guarantees whilepreserving its uniqueness and identity guarantees via its .spec.podManagementPolicy field.

OrderedReady Pod Management

OrderedReady pod management is the default for StatefulSets. It implements the behaviordescribed above.

Parallel Pod Management

Parallel pod management tells the StatefulSet controller to launch orterminate all Pods in parallel, and to not wait for Pods to become Runningand Ready or completely terminated prior to launching or terminating anotherPod. This option only affects the behavior for scaling operations. Updates are notaffected.

Update Strategies

In Kubernetes 1.7 and later, StatefulSet’s .spec.updateStrategy field allows you to configureand disable automated rolling updates for containers, labels, resource request/limits, andannotations for the Pods in a StatefulSet.

On Delete

The OnDelete update strategy implements the legacy (1.6 and prior) behavior. When a StatefulSet’s.spec.updateStrategy.type is set to OnDelete, the StatefulSet controller will not automaticallyupdate the Pods in a StatefulSet. Users must manually delete Pods to cause the controller tocreate new Pods that reflect modifications made to a StatefulSet’s .spec.template.

Rolling Updates

The RollingUpdate update strategy implements automated, rolling update for the Pods in aStatefulSet. It is the default strategy when .spec.updateStrategy is left unspecified. When a StatefulSet’s .spec.updateStrategy.type is set to RollingUpdate, theStatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceedin the same order as Pod termination (from the largest ordinal to the smallest), updatingeach Pod one at a time. It will wait until an updated Pod is Running and Ready prior toupdating its predecessor.

Partitions

The RollingUpdate update strategy can be partitioned, by specifying a.spec.updateStrategy.rollingUpdate.partition. If a partition is specified, all Pods with anordinal that is greater than or equal to the partition will be updated when the StatefulSet’s.spec.template is updated. All Pods with an ordinal that is less than the partition will notbe updated, and, even if they are deleted, they will be recreated at the previous version. If aStatefulSet’s .spec.updateStrategy.rollingUpdate.partition is greater than its .spec.replicas,updates to its .spec.template will not be propagated to its Pods.In most cases you will not need to use a partition, but they are useful if you want to stage anupdate, roll out a canary, or perform a phased roll out.

Forced Rollback

When using Rolling Updates with the defaultPod Management Policy (OrderedReady),it’s possible to get into a broken state that requires manual intervention to repair.

If you update the Pod template to a configuration that never becomes Running andReady (for example, due to a bad binary or application-level configuration error),StatefulSet will stop the rollout and wait.

In this state, it’s not enough to revert the Pod template to a good configuration.Due to a known issue,StatefulSet will continue to wait for the broken Pod to become Ready(which never happens) before it will attempt to revert it back to the workingconfiguration.

After reverting the template, you must also delete any Pods that StatefulSet hadalready attempted to run with the bad configuration.StatefulSet will then begin to recreate the Pods using the reverted template.

What's next

Feedback

Was this page helpful?

Thanks for the feedback. If you have a specific, answerable question about how to use Kubernetes, ask it onStack Overflow.Open an issue in the GitHub repo if you want toreport a problemorsuggest an improvement.