Node Resource Allocation

Node Resource Handling In Kubernetes

An aspect of Kubernetes clusters that is often overlooked is the resources non-pod components require to run, such as:

  • Operating system components i.e. sshd, udev etc.
  • Kubernetes system components i.e. kubelet, container runtime (e.g. Docker), node problem detector, journald etc.

As you manage your cluster, it's important that you are cognisant of thesecomponents because if your critical non-pod components don't have enoughresources, you might end up with a very unstable cluster.

Understanding Node Resources

Each node in a cluster has resources available to it and pods scheduled to runon the node may or may not have resource requests or limits set on them.Kubernetes schedules pods on nodes that have resources that satisfy the pod'sspecified requirements. Broadly, pods are bin-packed onto the nodes in abest effort attempt to utilize as much of the resources available with as fewnodes as possible.

  1. Node Capacity

kube-reserved
system-reserved
————————————-
eviction-threshold
————————————-
allocatable
(available for pods)

Node resources can be categorised into 4 (as shown above):

  • kube-reserved – reserves resources for kubernetes system daemons.
  • system-reserved – reserves resources for operating system components.
  • eviction-threshold – specifies limits that trigger evictions when node resources drop below the reserved value.
  • allocatable – the remaining node resources available for scheduling of pods when kube-reserved, system-reserved and eviction-threshold resources have been accounted for.

For example, with a 30.5 GB, 4 vCPUs machine with only eviction-thresholds setas —eviction-hard=memory.available<100Mi we'd get the following Capacityand Allocatable resources:

  1. $ kubectl describe node/ip-xx-xx-xx-xxx.internal
  2. ...
  3. Capacity:
  4. cpu: 4
  5. memory: 31402412Ki
  6. ...
  7. Allocatable:
  8. cpu: 4
  9. memory: 31300012Ki
  10. ...

So, What Could Possibly Go Wrong?

The scheduler ensures that for each resource type, the sum of the resourcesscheduled does not surpass the sum of allocatable resources. But suppose youhave a couple of applications deployed in your cluster that are constantly usingup way more resources set in their resource requests (burst above requests butbelow limits during workload). You end up with a node with pods that are eachattempting to take up more resources than there are available on the node!

This is particularly an issue with non-compressible resources like memory. Forexample, in the aforementioned case, with an eviction threshold of onlymemory.available<100Mi and no kube-reserved nor system-reservedreservations set, it is possible for a node to OOM prior to when kubelet isable to reclaim memory (because it may not observe memory pressure right away,since it polls cAdvisor to collect memory usage stats at a regular interval).

All the while, keep in mind that without kube-reserved nor system-reservedreservations set (which is most clusters i.e. GKE, Kops), thescheduler doesn't account for resources that non-pod components would require tofunction properly because Capacity and Allocatable resources are more orless equal.

Where Do We Go From Here?

It's difficult to give a one size fits all answer to node resource allocation.The behaviour of your cluster depends on the resource requirements of the appsrunning on the cluster, the pod density and the cluster size. But there's anode performance dashboard that exposes cpu and memory usage profilesof kubelet and docker engine at multiple levels of pod density which mayserve as a guide for what values would be appropriate for your cluster.

But, it seems fitting to recommend the following:

  • Always set requests with some breathing room – do not set requests to match your application's resource profile during idle time too closely.
  • Always set limits – so that your application doesn't hog all the memory on a node during a spike.
  • Don't set your limits for incompressible resources too high - at the end of the day, the Kubernetes scheduler schedules based on resource requests which match what's available on the node. During a spike, your pod technically will try to access resources outside what it's guaranteed to have access to. As explained before, this can be an issue if a bunch of your pods are all bursting at the same time.
  • Increase eviction thresholds if they are too low - while extreme utilization is ideal, it might be too close to the edge such that the system doesn't have enough time to reclaim resources via evictions if the resource increases within that window rapidly.
  • Reserve resources for system components once you've been able to profile your nodes i.e. kube-reserved and system-reserved. Further Reading: