一旦你的应用程序运行起来了,你将不可避免地需要对它进行调试。 之前我们介绍过如何使用 kubectl get pod 来检索有关您的 pod 的简单状态信息。但还有很多方法可以获得有关应用程序的更多信息。

使用 kubectl describe pod 来获取有关 pod 的详细信息

在这个例子中,我们将使用 Deployment 来创建两个 pod,与前面的示例类似。

nginx-dep.yaml kubernetes 应用程序自检和调试 - 图1
  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: nginx-deployment
  5. spec:
  6. selector:
  7. matchLabels:
  8. app: nginx
  9. replicas: 2
  10. template:
  11. metadata:
  12. labels:
  13. app: nginx
  14. spec:
  15. containers:
  16. - name: nginx
  17. image: nginx
  18. resources:
  19. limits:
  20. memory: "128Mi"
  21. cpu: "500m"
  22. ports:
  23. - containerPort: 80

使用如下命令来创建 deployment:

  1. $ kubectl create -f https://k8s.io/docs/tasks/debug-application-cluster/nginx-dep.yaml
  2. deployment "nginx-deployment" created
  1. $ kubectl get pods
  2. NAME READY STATUS RESTARTS AGE
  3. nginx-deployment-1006230814-6winp 1/1 Running 0 11s
  4. nginx-deployment-1006230814-fmgu3 1/1 Running 0 11s

我们可以使用 kubectl describe pod 获取每个 pod 的更多信息。例如:

  1. $ kubectl describe pod nginx-deployment-1006230814-6winp
  2. Name: nginx-deployment-1006230814-6winp
  3. Namespace: default
  4. Node: kubernetes-node-wul5/10.240.0.9
  5. Start Time: Thu, 24 Mar 2016 01:39:49 +0000
  6. Labels: app=nginx,pod-template-hash=1006230814
  7. Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind" :"ReplicaSet","namespace":"default","name":"nginx-deployment-1956810328","uid":"14e607e7-8ba1-11e7-b5cb-fa16" ...
  8. Status: Running
  9. IP: 10.244.0.6
  10. Controllers: ReplicaSet/nginx-deployment-1006230814
  11. Containers:
  12. nginx:
  13. Container ID: docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb1149
  14. Image: nginx
  15. Image ID: docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e5163707
  16. Port: 80/TCP
  17. QoS Tier:
  18. cpu: Guaranteed
  19. memory: Guaranteed
  20. Limits:
  21. cpu: 500m
  22. memory: 128Mi
  23. Requests:
  24. memory: 128Mi
  25. cpu: 500m
  26. State: Running
  27. Started: Thu, 24 Mar 2016 01:39:51 +0000
  28. Ready: True
  29. Restart Count: 0
  30. Environment: <none>
  31. Mounts:
  32. /var/run/secrets/kubernetes.io/serviceaccount from default-token-5kdvl (ro)
  33. Conditions:
  34. Type Status
  35. Initialized True
  36. Ready True
  37. PodScheduled True
  38. Volumes:
  39. default-token-4bcbi:
  40. Type: Secret (a volume populated by a Secret)
  41. SecretName: default-token-4bcbi
  42. Optional: false
  43. QoS Class: Guaranteed
  44. Node-Selectors: <none>
  45. Tolerations: <none>
  46. Events:
  47. FirstSeen LastSeen Count From SubobjectPath Type Reason Message
  48. --------- -------- ----- ---- ------------- -------- ------ -------
  49. 54s 54s 1 {default-scheduler } Normal Scheduled Successfully assigned nginx-deployment-1006230814-6winp to kubernetes-node-wul5
  50. 54s 54s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Pulling pulling image "nginx"
  51. 53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Pulled Successfully pulled image "nginx"
  52. 53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Created Created container with docker id 90315cc9f513
  53. 53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Started Started container with docker id 90315cc9f513

在这里您可以看到有关容器和 Pod 的配置信息(标签,资源需求等),以及有关容器和 Pod 的状态信息(状态,准备情况,重新启动次数,事件等)。

容器状态是 Waiting,Running 或 Terminated 之一。根据状态,可以获得更多信息 – 在这里您可以看到,对于处于运行状态的容器,系统会告诉您何时启动的容器。

Ready 告诉您容器是否通过了最后一次准备就绪探测。(在这种情况下,容器没有配置就绪探针;如果未配置准备就绪探针,则假定容器已准备就绪。)

重启数量会告诉您容器重新启动的次数; 此信息可用于检测重启策略为 ‘always’ 的容器的循环崩溃。

目前,与 Pod 相关的唯一条件是二进制 Ready 状态,这表明该 Pod 可以处理请求,并且应该添加到所有匹配服务的负载均衡池中。

最后,您会看到与您的 Pod 有关的最近事件日志。系统压缩多个相同的事件,只显示第一次和最后一次出现的时间以及出现的次数。”From” 表示记录事件的组件,”SubobjectPath” 告诉您哪个对象(例如容器内的容器)被引用,”Reason” 和 “Message” 告诉您发生了什么。

示例:调试 Pending 状态的 Pod

通过事件排查的一种常见情况是创建了不适合任何节点的 Pod。例如,Pod 可能会请求比任何节点上的空闲资源更多的资源,或者可能会指定一个不匹配任何节点的标签选择器。 假设我们在上面的 Deployment 例子中创建 5 个 replicas(而不是 2 个),并请求 600 millicores 而不是 500 millicores,集群拥有 4 个节点,每个(虚拟)机器有 1 个 CPU。 在这种情况下,其中一个 Pod 将无法调度。(请注意,由于在每个节点上运行了集群附加 pod,例如 fluentd 和 skydns 等,如果我们请求 1000 millicores,则没有任何一个 pod 可以成功调度。)

  1. $ kubectl get pods
  2. NAME READY STATUS RESTARTS AGE
  3. nginx-deployment-1006230814-6winp 1/1 Running 0 7m
  4. nginx-deployment-1006230814-fmgu3 1/1 Running 0 7m
  5. nginx-deployment-1370807587-6ekbw 1/1 Running 0 1m
  6. nginx-deployment-1370807587-fg172 0/1 Pending 0 1m
  7. nginx-deployment-1370807587-fz9sd 0/1 Pending 0 1m

要找出 nginx-deployment-1370807587-fz9sd pod 未运行的原因,我们可以在待处理的 Pod 上使用 kubectl describe pod 并查看其事件:

  1. $ kubectl describe pod nginx-deployment-1370807587-fz9sd
  2. Name: nginx-deployment-1370807587-fz9sd
  3. Namespace: default
  4. Node: /
  5. Labels: app=nginx,pod-template-hash=1370807587
  6. Status: Pending
  7. IP:
  8. Controllers: ReplicaSet/nginx-deployment-1370807587
  9. Containers:
  10. nginx:
  11. Image: nginx
  12. Port: 80/TCP
  13. QoS Tier:
  14. memory: Guaranteed
  15. cpu: Guaranteed
  16. Limits:
  17. cpu: 1
  18. memory: 128Mi
  19. Requests:
  20. cpu: 1
  21. memory: 128Mi
  22. Environment Variables:
  23. Volumes:
  24. default-token-4bcbi:
  25. Type: Secret (a volume populated by a Secret)
  26. SecretName: default-token-4bcbi
  27. Events:
  28. FirstSeen LastSeen Count From SubobjectPath Type Reason Message
  29. --------- -------- ----- ---- ------------- -------- ------ -------
  30. 1m 48s 7 {default-scheduler } Warning FailedScheduling pod (nginx-deployment-1370807587-fz9sd) failed to fit in any node
  31. fit failure on node (kubernetes-node-6ta5): Node didn't have enough resource: CPU, requested: 1000, used: 1420, capacity: 2000
  32. fit failure on node (kubernetes-node-wul5): Node didn't have enough resource: CPU, requested: 1000, used: 1100, capacity: 2000

在这里,您可以看到 scheduler 生成的事件,表明由于 FailedScheduling(可能还有其他原因),Pod 无法调度。该消息告诉我们没有任何节点能够满足 Pod 的需求。

要解决这种情况,可以使用 kubectl scale 来更新您的部署以指定 4 个或更少的 replicas。(或者您可以让一个Pod 保持 pending,这是无害的。)

在 etcd 中存储了类似于 kubectl describe pod 结尾处看到的事件,并提供有关集群中正在发生的事情的高级信息。您可以使用如下命令列出所有事件:

  1. kubectl get events

但是您需要记住事件是具有命名空间的。这意味着如果您对某些命名空间对象的事件感兴趣(例如,命名空间 my-namespace 中的 Pod 发生了什么),则需要明确地为命令提供一个命名空间:

  1. kubectl get events --namespace=my-namespace

要查看来自所有命名空间的事件,可以使用 —all-namespaces 参数。

除 kubectl describe pod 之外,另一种获得关于 pod 额外信息的方法(超出了 kubectl get pod 提供的内容)是将 -o yaml 输出格式标志传递给 kubectl get pod。 这会给你 YAML 格式的信息,甚至比 kubectl describe pod 更多的信息 – 基本上是系统拥有的 Pod 的所有信息。 在这里,您将看到类似注解(这是没有标签限制的键值元数据,给 Kubernetes 系统组件内部使用)、重新启动策略、端口和卷。

  1. $ kubectl get pod nginx-deployment-1006230814-6winp -o yaml
  2. apiVersion: v1
  3. kind: Pod
  4. metadata:
  5. annotations:
  6. kubernetes.io/created-by: |
  7. {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-deployment-1006230814","uid":"4c84c175-f161-11e5-9a78-42010af00005","apiVersion":"extensions","resourceVersion":"133434"}}
  8. creationTimestamp: 2016-03-24T01:39:50Z
  9. generateName: nginx-deployment-1006230814-
  10. labels:
  11. app: nginx
  12. pod-template-hash: "1006230814"
  13. name: nginx-deployment-1006230814-6winp
  14. namespace: default
  15. resourceVersion: "133447"
  16. selfLink: /api/v1/namespaces/default/pods/nginx-deployment-1006230814-6winp
  17. uid: 4c879808-f161-11e5-9a78-42010af00005
  18. spec:
  19. containers:
  20. - image: nginx
  21. imagePullPolicy: Always
  22. name: nginx
  23. ports:
  24. - containerPort: 80
  25. protocol: TCP
  26. resources:
  27. limits:
  28. cpu: 500m
  29. memory: 128Mi
  30. requests:
  31. cpu: 500m
  32. memory: 128Mi
  33. terminationMessagePath: /dev/termination-log
  34. volumeMounts:
  35. - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
  36. name: default-token-4bcbi
  37. readOnly: true
  38. dnsPolicy: ClusterFirst
  39. nodeName: kubernetes-node-wul5
  40. restartPolicy: Always
  41. securityContext: {}
  42. serviceAccount: default
  43. serviceAccountName: default
  44. terminationGracePeriodSeconds: 30
  45. volumes:
  46. - name: default-token-4bcbi
  47. secret:
  48. secretName: default-token-4bcbi
  49. status:
  50. conditions:
  51. - lastProbeTime: null
  52. lastTransitionTime: 2016-03-24T01:39:51Z
  53. status: "True"
  54. type: Ready
  55. containerStatuses:
  56. - containerID: docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb1149
  57. image: nginx
  58. imageID: docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e5163707
  59. lastState: {}
  60. name: nginx
  61. ready: true
  62. restartCount: 0
  63. state:
  64. running:
  65. startedAt: 2016-03-24T01:39:51Z
  66. hostIP: 10.240.0.9
  67. phase: Running
  68. podIP: 10.244.0.6
  69. startTime: 2016-03-24T01:39:49Z

示例:调试一个关闭(或者无法到达)的节点

有时,在调试时,查看节点的状态可能很有用 – 例如,您已经注意到节点上运行的 Pod 的奇怪行为,或想查明 Pod 不调度到节点上的原因。与 Pod 一样,可以使用 kubectl describe node 和 kubectl get node -o yaml 来检索有关节点的详细信息。例如,如果某个节点关闭(从网络断开连接,或 kubelet 死亡并不会重新启动等),您将看到以下内容。 注意显示节点为 NotReady 的事件,并且还注意到 Pod 不再运行(它们在 NotReady 状态五分钟后被驱逐)。

  1. $ kubectl get nodes
  2. NAME STATUS AGE VERSION
  3. kubernetes-node-861h NotReady 1h v1.6.0+fff5156
  4. kubernetes-node-bols Ready 1h v1.6.0+fff5156
  5. kubernetes-node-st6x Ready 1h v1.6.0+fff5156
  6. kubernetes-node-unaj Ready 1h v1.6.0+fff5156
  7.  
  8. $ kubectl describe node kubernetes-node-861h
  9. Name: kubernetes-node-861h
  10. Role
  11. Labels: beta.kubernetes.io/arch=amd64
  12. beta.kubernetes.io/os=linux
  13. kubernetes.io/hostname=kubernetes-node-861h
  14. Annotations: node.alpha.kubernetes.io/ttl=0
  15. volumes.kubernetes.io/controller-managed-attach-detach=true
  16. Taints: <none>
  17. CreationTimestamp: Mon, 04 Sep 2017 17:13:23 +0800
  18. Phase:
  19. Conditions:
  20. Type Status LastHeartbeatTime LastTransitionTime Reason Message
  21. ---- ------ ----------------- ------------------ ------ -------
  22. OutOfDisk Unknown Fri, 08 Sep 2017 16:04:28 +0800 Fri, 08 Sep 2017 16:20:58 +0800 NodeStatusUnknown Kubelet stopped posting node status.
  23. MemoryPressure Unknown Fri, 08 Sep 2017 16:04:28 +0800 Fri, 08 Sep 2017 16:20:58 +0800 NodeStatusUnknown Kubelet stopped posting node status.
  24. DiskPressure Unknown Fri, 08 Sep 2017 16:04:28 +0800 Fri, 08 Sep 2017 16:20:58 +0800 NodeStatusUnknown Kubelet stopped posting node status.
  25. Ready Unknown Fri, 08 Sep 2017 16:04:28 +0800 Fri, 08 Sep 2017 16:20:58 +0800 NodeStatusUnknown Kubelet stopped posting node status.
  26. Addresses: 10.240.115.55,104.197.0.26
  27. Capacity:
  28. cpu: 2
  29. hugePages: 0
  30. memory: 4046788Ki
  31. pods: 110
  32. Allocatable:
  33. cpu: 1500m
  34. hugePages: 0
  35. memory: 1479263Ki
  36. pods: 110
  37. System Info:
  38. Machine ID: 8e025a21a4254e11b028584d9d8b12c4
  39. System UUID: 349075D1-D169-4F25-9F2A-E886850C47E3
  40. Boot ID: 5cd18b37-c5bd-4658-94e0-e436d3f110e0
  41. Kernel Version: 4.4.0-31-generic
  42. OS Image: Debian GNU/Linux 8 (jessie)
  43. Operating System: linux
  44. Architecture: amd64
  45. Container Runtime Version: docker://1.12.5
  46. Kubelet Version: v1.6.9+a3d1dfa6f4335
  47. Kube-Proxy Version: v1.6.9+a3d1dfa6f4335
  48. ExternalID: 15233045891481496305
  49. Non-terminated Pods: (9 in total)
  50. Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
  51. --------- ---- ------------ ---------- --------------- -------------
  52. ......
  53. Allocated resources:
  54. (Total limits may be over 100 percent, i.e., overcommitted.)
  55. CPU Requests CPU Limits Memory Requests Memory Limits
  56. ------------ ---------- --------------- -------------
  57. 900m (60%) 2200m (146%) 1009286400 (66%) 5681286400 (375%)
  58. Events: <none>
  59.  
  60. $ kubectl get node kubernetes-node-861h -o yaml
  61. apiVersion: v1
  62. kind: Node
  63. metadata:
  64. creationTimestamp: 2015-07-10T21:32:29Z
  65. labels:
  66. kubernetes.io/hostname: kubernetes-node-861h
  67. name: kubernetes-node-861h
  68. resourceVersion: "757"
  69. selfLink: /api/v1/nodes/kubernetes-node-861h
  70. uid: 2a69374e-274b-11e5-a234-42010af0d969
  71. spec:
  72. externalID: "15233045891481496305"
  73. podCIDR: 10.244.0.0/24
  74. providerID: gce://striped-torus-760/us-central1-b/kubernetes-node-861h
  75. status:
  76. addresses:
  77. - address: 10.240.115.55
  78. type: InternalIP
  79. - address: 104.197.0.26
  80. type: ExternalIP
  81. capacity:
  82. cpu: "1"
  83. memory: 3800808Ki
  84. pods: "100"
  85. conditions:
  86. - lastHeartbeatTime: 2015-07-10T21:34:32Z
  87. lastTransitionTime: 2015-07-10T21:35:15Z
  88. reason: Kubelet stopped posting node status.
  89. status: Unknown
  90. type: Ready
  91. nodeInfo:
  92. bootID: 4e316776-b40d-4f78-a4ea-ab0d73390897
  93. containerRuntimeVersion: docker://Unknown
  94. kernelVersion: 3.16.0-0.bpo.4-amd64
  95. kubeProxyVersion: v0.21.1-185-gffc5a86098dc01
  96. kubeletVersion: v0.21.1-185-gffc5a86098dc01
  97. machineID: ""
  98. osImage: Debian GNU/Linux 7 (wheezy)
  99. systemUUID: ABE5F6B4-D44B-108B-C46A-24CCE16C8B6E

译者:tianshapjq / 原文链接

K8S中文社区微信公众号

原文: http://docs.kubernetes.org.cn/824.html