Backing up etcd

Kubernetes is relying on etcd for state storage. More details about the usagecan be found here andhere.

Backup requirement

A Kubernetes cluster deployed with kops stores the etcd state in two differentAWS EBS volumes per master node. One volume is used to store the Kubernetesmain data, the other one for events. For a HA master with three nodes this willresult in six volumes for etcd data (one in each AZ). An EBS volume is designedto have a failure rateof 0.1%-0.2% per year.

Create volume backups

Kubernetes does currently not provide any option to do regular backups of etcdout of the box.

Therefore we have to either manually backup the etcd volumes regularly or useother AWS services to do this in a automated, scheduled way. You can for exampleuse CloudWatch to trigger an AWS Lambda with a defined schedule (e.g. once perhour). The Lambda will then create a new snapshot of all etcd volumes. A completeguide on how to setup automated snapshots can be found here.

Note: this is one of many examples on how to do scheduled snapshots.

Restore volume backups

In case the Kubernetes cluster fails in a way that too many master nodes can’taccess their etcd volumes it is impossible to get a etcd quorum.

In this case it is now possible to restore the volume from a snapshot we createdearlier. Details about creating a volume from a snapshot can be found in theAWS documentation.

Kubernetes uses protokube to identify the right volumes for etcd. Therefore itis important to tag the EBS volumes with the correct tags after restoring themfrom a EBS snapshot.

protokube will look for the following tags:

  • KubernetesCluster containing the cluster name (e.g. k8s.mycompany.tld)
  • Name containing the volume name (e.g. eu-central-1a.etcd-main.k8s.mycompany.tld)
  • k8s.io/etcd/main containing the availability zone of the volume (e.g. eu-central-1a/eu-central-1a)
  • k8s.io/role/master with the value 1

After fully restoring the volume ensure that the old volume is no longer there,or you’ve removed the tags from the old volume. After restarting the master nodeKubernetes should pick up the new volume and start running again.