Troubleshooting commands

Big picture

Use command line tools to get status and troubleshoot.

Troubleshooting commands - 图1note

calico-system is used for operator-based commands and examples; for manifest-based install, use kube-system.

See Calico architecture and components for help with components.

Hosts

Verify number of nodes in a cluster

  1. kubectl get nodes
  1. NAME STATUS ROLES AGE VERSION
  2. ip-10-0-0-10 Ready master 27h v1.18.0
  3. ip-10-0-0-11 Ready <none> 27h v1.18.0
  4. ip-10-0-0-12 Ready <none> 27h v1.18.0

Verify calico-node pods are running on every node, and are in a healthy state

  1. kubectl get pods -n calico-system -o wide
  1. NAME READY STATUS RESTARTS AGE IP NODE
  2. calico-node-77zgj 1/1 Running 0 27h 10.0.0.10 ip-10-0-0-10
  3. calico-node-nz8k2 1/1 Running 0 27h 10.0.0.11 ip-10-0-0-11
  4. calico-node-7trv7 1/1 Running 0 27h 10.0.0.12 ip-10-0-0-12

Exec into pod for further troubleshooting

  1. kubectl run multitool --image=praqma/network-multitool
  2. kubectl exec -it multitool -- bash
  1. bash-5.0 ping 8.8.8.8
  2. PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
  3. 64 bytes from 8.8.8.8: icmp_seq=1 ttl=97 time=6.61 ms
  4. 64 bytes from 8.8.8.8: icmp_seq=2 ttl=97 time=6.64 ms

Collect Calico diagnostic logs

  1. sudo calicoctl node diags
  1. Collecting diagnostics
  2. Using temp dir: /tmp/calico194224816
  3. Dumping netstat
  4. Dumping routes (IPv4)
  5. Dumping routes (IPv6)
  6. Dumping interface info (IPv4)
  7. Dumping interface info (IPv6)
  8. Dumping iptables (IPv4)
  9. Dumping iptables (IPv6)
  10. Diags saved to /tmp/calico194224816/diags-20201127_010117.tar.gz

Kubernetes

Verify all pods are running

  1. kubectl get pods -A
  1. kube-system coredns-66bff467f8-dxbtl 1/1 Running 0 27h
  2. kube-system coredns-66bff467f8-n95vq 1/1 Running 0 27h
  3. kube-system etcd-ip-10-0-0-10 1/1 Running 0 27h
  4. kube-system kube-apiserver-ip-10-0-0-10 1/1 Running 0 27h

Verify Kubernetes API server is running

  1. kubectl cluster-info
  1. Kubernetes master is running at https://10.0.0.10:6443
  2. KubeDNS is running at https://10.0.0.10:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
  3. ubuntu@master:~$ kubectl get svc
  4. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  5. kubernetes ClusterIP 10.49.0.1 <none> 443/TCP 2d2h

Verify Kubernetes kube-dns is working

  1. kubectl get svc
  1. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  2. kubernetes ClusterIP 10.49.0.1 <none> 443/TCP 2d2h
  1. kubectl exec -it multitool bash
  1. bash-5.0 curl -I -k https://kubernetes
  2. HTTP/2 403
  3. cache-control: no-cache, private
  4. content-type: application/json
  5. x-content-type-options: nosniff
  6. content-length: 234
  1. bash-5.0 nslookup google.com
  1. Server: 10.49.0.10
  2. Address: 10.49.0.10#53
  3. Non-authoritative answer:
  4. Name: google.com
  5. Address: 172.217.14.238
  6. Name: google.com
  7. Address: 2607:f8b0:400a:804::200e

Verify that kubelet is running on the node with the correct flags

  1. systemctl status kubelet

If there is a problem, check the journal

  1. journalctl -u kubelet | head

Check the status of other system pods

Look especially at coredns; if they are not getting an IP, something is wrong with the CNI

  1. kubectl get pod -n kube-system -o wide

But if other pods fail, it is likely a different issue. Perform normal Kubernetes troubleshooting. For example:

  1. kubectl describe pod kube-scheduler-ip-10-0-1-20.eu-west-1.compute.internal -n kube-system | tail -15

Calico components

View Calico CNI configuration on a node

  1. cat /etc/cni/net.d/10-calico.conflist

Verify calicoctl matches cluster

The cluster version and type must match the calicoctl version.

  1. calicoctl version

For syntax:

  1. calicoctl version -help

Check tigera operator status

  1. kubectl get tigerastatus
  1. NAME AVAILABLE PROGRESSING DEGRADED SINCE
  2. calico True False False 27h

Check if operator pod is running

  1. kubectl get pod -n tigera-operator

View calico nodes

  1. kubectl get pod -n calico-system -o wide

View Calico installation parameters

  1. kubectl get installation -o yaml
  1. apiVersion: v1
  2. items:
  3. - apiVersion: operator.tigera.io/v1
  4. kind: Installation
  5. metadata:
  6. - apiVersion: operator.tigera.io/v1
  7. spec:
  8. calicoNetwork:
  9. bgp: Enabled
  10. hostPorts: Enabled
  11. ipPools:
  12. - blockSize: 26
  13. cidr: 10.48.0.0/16
  14. encapsulation: VXLANCrossSubnet
  15. natOutgoing: Enabled
  16. nodeSelector: all()
  17. multiInterfaceMode: None
  18. nodeAddressAutodetectionV4:
  19. firstFound: true
  20. cni:
  21. ipam:
  22. type: Calico
  23. type: Calico

Run commands across multiple nodes

  1. export THE_COMMAND_TO_RUN=date && for calinode in `kubectl get pod -o wide -n calico-system | grep calico-node | awk '{print $1}'`; do echo $calinode; echo "-----"; kubectl exec -n calico-system $calinode -- $THE_COMMAND_TO_RUN; printf "\n"; done
  1. calico-node-87lpx
  2. -----
  3. Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
  4. Thu Apr 28 13:48:06 UTC 2022
  5. calico-node-x5fmm
  6. -----
  7. Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
  8. Thu Apr 28 13:48:07 UTC 2022

View pod info

  1. kubectl describe pods `<pod_name>` -n `<namespace> `
  1. kubectl describe pods busybox -n default
  1. Events:
  2. Type Reason Age From Message
  3. ---- ------ ---- ---- -------
  4. Normal Scheduled 21s default-scheduler Successfully assigned default/busybox to ip-10-0-0-11
  5. Normal Pulling 20s kubelet, ip-10-0-0-11 Pulling image "busybox"
  6. Normal Pulled 19s kubelet, ip-10-0-0-11 Successfully pulled image "busybox"
  7. Normal Created 19s kubelet, ip-10-0-0-11 Created container busybox
  8. Normal Started 18s kubelet, ip-10-0-0-11 Started container busybox

View logs of a pod

  1. kubectl logs `<pod_name>` -n `<namespace>`
  1. kubectl logs busybox -n default

View kubelet logs

  1. journalctl -u kubelet

Routing

Verify routing table on the node

  1. ip route
  1. default via 10.0.0.1 dev eth0 proto dhcp src 10.0.0.10 metric 100
  2. 10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.10
  3. 10.0.0.1 dev eth0 proto dhcp scope link src 10.0.0.10 metric 100
  4. 10.48.66.128/26 via 10.0.0.12 dev eth0 proto 80 onlink
  5. 10.48.231.0/26 via 10.0.0.11 dev eth0 proto 80 onlink
  6. 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown

Verify BGP peer status

  1. sudo calicoctl node status
  1. Calico process is running.
  2. IPv4 BGP status
  3. +--------------+-------------------+-------+------------+-------------+
  4. | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
  5. +--------------+-------------------+-------+------------+-------------+
  6. | 10.0.0.12 | node-to-node mesh | up | 2020-11-25 | Established |
  7. | 10.0.0.11 | node-to-node mesh | up | 2020-11-25 | Established |
  8. +--------------+-------------------+-------+------------+-------------+

Verify overlay configuration

  1. kubectl get ippools default-ipv4-ippool -o yaml
  1. ---
  2. spec:
  3. ipipMode: Always
  4. vxlanMode: Never

Verify bgp learned routes

  1. ip r | grep bird
  1. 192.168.66.128/26 via 10.0.0.12 dev tunl0 proto bird onlink
  2. 192.168.180.192/26 via 10.0.0.10 dev tunl0 proto bird onlink
  3. blackhole 192.168.231.0/26 proto bird

Verify BIRD routing table

Note: The BIRD routing table gets pushed to node routing tables.

  1. kubectl exec -it -n calico-system calico-node-8cfc8 -- /bin/bash
  1. [root@ip-10-0-0-11 /] birdcl
  2. BIRD v0.3.3+birdv1.6.8 ready.
  3. bird> show route
  4. 0.0.0.0/0 via 10.0.0.1 on eth0 [kernel1 18:13:33] * (10)
  5. 10.0.0.0/24 dev eth0 [direct1 18:13:32] * (240)
  6. 10.0.0.1/32 dev eth0 [kernel1 18:13:33] * (10)
  7. 10.48.231.2/32 dev calieb874a8ef0b [kernel1 18:13:41] * (10)
  8. 10.48.231.1/32 dev caliaeaa173109d [kernel1 18:13:35] * (10)
  9. 10.48.231.0/26 blackhole [static1 18:13:32] * (200)
  10. 10.48.231.0/32 dev vxlan.calico [direct1 18:13:32] * (240)
  11. 10.48.180.192/26 via 10.0.0.10 on eth0 [Mesh_10_0_0_10 18:13:34] * (100/0) [i]
  12. via 10.0.0.10 on eth0 [Mesh_10_0_0_12 18:13:41 from 10.0.0.12] (100/0) [i]
  13. via 10.0.0.10 on eth0 [kernel1 18:13:33] (10)
  14. 10.48.66.128/26 via 10.0.0.12 on eth0 [Mesh_10_0_0_10 18:13:36 from 10.0.0.10] * (100/0) [i]
  15. via 10.0.0.12 on eth0 [Mesh_10_0_0_12 18:13:41] (100/0) [i]
  16. via 10.0.0.12 on eth0 [kernel1 18:13:36] (10)

Capture traffic

For example,

  1. sudo tcpdump -i calicofac0017c3 icmp

Network policy

Verify existing Kubernetes network policies

  1. kubectl get networkpolicy --all-namespaces
  1. NAMESPACE NAME POD-SELECTOR AGE
  2. client allow-ui <none> 20m
  3. client default-deny <none> 4h51m
  4. stars allow-ui <none> 20m
  5. stars backend-policy role=backend 20m
  6. stars default-deny <none> 4h51m

Verify existing Calico network policies

  1. calicoctl get networkpolicy --all-namespaces -o wide
  1. NAMESPACE NAME ORDER SELECTOR
  2. calico-demo allow-busybox 50 app == 'porter'
  3. client knp.default.allow-ui 1000 projectcalico.org/orchestrator == 'k8s'
  4. client knp.default.default-deny 1000 projectcalico.org/orchestrator == 'k8s'
  5. stars knp.default.allow-ui 1000 projectcalico.org/orchestrator == 'k8s'
  6. stars knp.default.backend-policy 1000 projectcalico.org/orchestrator == 'k8s'
  7. stars knp.default.default-deny 1000 projectcalico.org/orchestrator == 'k8s'

Verify existing Calico global network policies

  1. calicoctl get globalnetworkpolicy -o wide
  1. NAME ORDER SELECTOR
  2. default-app-policy 100
  3. egress-lockdown 600
  4. default-node-policy 100 has(kubernetes.io/hostname)
  5. nodeport-policy 100 has(kubernetes.io/hostname)

Check policy selectors and order

For example,

  1. calicoctl get np -n yaobank -o wide

If the selectors should match, check the endpoint IP and the node where it is running. For example,

  1. kubectl get pod -l app=customer -n yaobank