1 部署带有 GPU 的 Kubernetes 集群

1.1 先决条件

  • 至少一台 Worker 节点带有 NVIDIA GPU 卡
  • Kubernetes 1.6.X: Package >= 1.16.4
  • Kubernetes 1.5.X: Package >= 1.15.7

1.2 集群规划

名称CPU (核心)内存 (GB)操作系统GPU (个)角色
master48CentOS 7.60Master
worker1420CentOS 7.61Worker
worker248CentOS 7.60Worker
nfs12CentOS 7.60NFS

1.3 添加 GPU 主机

gpu-host-list

gpu-host-detail

1.4 创建 K8s 集群

部署步骤请参考在自行准备的主机上部署 K8s 集群。

1.5 验证 GPU 调度

使用 Webkubectl 进入集群 Terminal, 新建文件 gpu.yml 并输入以下内容:

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: nvidia-deployment
  5. spec:
  6. replicas: 2 # 创建POD副本数
  7. selector:
  8. matchLabels:
  9. name: nvidia-gpu-deploy
  10. template:
  11. metadata:
  12. labels:
  13. name: nvidia-gpu-deploy
  14. spec:
  15. containers:
  16. - name: cuda-container
  17. image: ubuntu
  18. command: ["sleep"]
  19. args: ["100000"]
  20. resources:
  21. limits:
  22. nvidia.com/gpu: 1 # 使用 GPU 卡的数量
  1. kubectl apply -f gpu.yml
  2. kubectl get pod
  3. NAME READY STATUS RESTARTS AGE
  4. nvidia-deployment-64589d94d-8m6rd 1/1 Running 0 119s # 获取到 GPU 卡 Running
  5. nvidia-deployment-64589d94d-lzfph 0/1 Pending 0 86s # 未获取到 GPU 卡 Pending
  6. kubectl exec -it nvidia-deployment-64589d94d-8m6rd nvidia-smi
  7. Mon Jan 13 08:16:36 2020
  8. +-----------------------------------------------------------------------------+
  9. | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: N/A |
  10. |-------------------------------+----------------------+----------------------+
  11. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  12. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  13. |===============================+======================+======================|
  14. | 0 Tesla P4 Off | 00000000:00:06.0 Off | 0 |
  15. | N/A 33C P8 7W / 75W | 0MiB / 7611MiB | 0% Default |
  16. +-------------------------------+----------------------+----------------------+
  17. +-----------------------------------------------------------------------------+
  18. | Processes: GPU Memory |
  19. | GPU PID Type Process name Usage |
  20. |=============================================================================|
  21. | No running processes found |
  22. +-----------------------------------------------------------------------------+
  23. # POD 中已经可以访问到 GPU 设备