GPU Support

kOps managed device driver

Introduced
kOps 1.22

kOps can install nvidia device drivers, plugin, and runtime, as well as configure containerd to make use of the runtime.

kOps will also install a RuntimeClass nvidia. As the nvidia runtime is not the default runtime, you will need to add runtimeClassName: nvidia to any Pod spec you want to use for GPU workloads. The RuntimeClass also configures the appropriate node selectors and tolerations to run on GPU Nodes.

kOps will add kops.k8s.io/gpu="1" as node selector as well as the following taint:

  1. taints:
  2. - effect: NoSchedule
  3. key: nvidia.com/gpu

The taint will prevent you from accidentially scheduling workloads on GPU Nodes.

You can enable nvidia by adding the following to your Cluster spec:

  1. containerd:
  2. nvidiaGPU:
  3. enabled: true

Creating an instance group with GPU nodeN

Due to the cost of GPU instances you want to minimize the amount of pods running on them. Therefore start by provisioning a regular cluster following the getting started documentation.

Once the cluster is running, add an instance group with GPUs:

  1. apiVersion: kops.k8s.io/v1alpha2
  2. kind: InstanceGroup
  3. metadata:
  4. labels:
  5. kops.k8s.io/cluster: <cluster name>
  6. name: gpu-nodes
  7. spec:
  8. image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
  9. nodeLabels:
  10. kops.k8s.io/instancegroup: gpu-nodes
  11. machineType: g4dn.xlarge
  12. maxSize: 1
  13. minSize: 1
  14. role: Node
  15. subnets:
  16. - eu-central-1c

GPUs in OpenStack

OpenStack does not support enabling containerd configuration in cluster level. It needs to be done in instance group:

  1. apiVersion: kops.k8s.io/v1alpha2
  2. kind: InstanceGroup
  3. metadata:
  4. labels:
  5. kops.k8s.io/cluster: <cluster name>
  6. name: gpu-nodes
  7. spec:
  8. image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200907
  9. nodeLabels:
  10. kops.k8s.io/instancegroup: gpu-nodes
  11. machineType: g4dn.xlarge
  12. maxSize: 1
  13. minSize: 1
  14. role: Node
  15. subnets:
  16. - eu-central-1c
  17. containerd:
  18. nvidiaGPU:
  19. enabled: true

Verifying GPUs

  1. after new GPU nodes are coming up, you should see them in kubectl get nodes
  2. nodes should have kops.k8s.io/gpu label and nvidia.com/gpu:NoSchedule taint
  3. kube-system namespace should have nvidia-device-plugin-daemonset pod provisioned to GPU node(s)
  4. if you see nvidia.com/gpu in kubectl describe node everything should work.
  1. Capacity:
  2. cpu: 4
  3. ephemeral-storage: 9983232Ki
  4. hugepages-1Gi: 0
  5. hugepages-2Mi: 0
  6. memory: 32796292Ki
  7. nvidia.com/gpu: 1 <- this one
  8. pods: 110