Health check and resource monitoring

In the MatrixOne distributed cluster, there are multiple components and objects. To ensure its regular operation and troubleshooting, we must perform a series of health checks and resource monitoring.

The health check and resource monitoring environment introduced in this document will be based on the MatrixOne Distributed Cluster Deployment environment.

The upgraded environment introduced in this document will be based on the environment of MatrixOne Distributed Cluster Deployment.

Objects to check

  • Physical resource layer: including the three virtual machines’ CPU, memory, and disk resources. For a mature solution to monitor these resources, see Monitoring Solution.

  • Logical resource layer: including the capacity usage of MinIO, the CPU and memory resource usage of each node and Pod of Kubernetes, the overall status of MatrixOne, and the status of each component (such as LogService, CN, TN).

Resource monitoring

MinIO capacity usage monitoring

MinIO has a management interface through which we can visually monitor its capacity usage, including the remaining space. For more information, see official documentation.

Health check and resource monitoring - 图1

Node/Pod resource monitoring

To determine whether the MatrixOne service needs to be scaled up or down, users often need to monitor the resources used by the Node where the MatrixOne cluster resides and the pods corresponding to the components.

You can use the kubectl top command to complete it. For more information, see Kubernetes official website document of

Node Monitoring

  1. Use the following command to view the details of MatrixOne cluster nodes:

    1. kubectl get node
    1. [root@master0 ~]# kubectl get node
    2. NAME STATUS ROLES AGE VERSION
    3. master0 Ready control-plane,master 22h v1.23.17
    4. node0 Ready <none> 22h v1.23.17
  2. According to the above-returned results, use the following command to view the resource usage of a specific node. According to the previous deployment scheme, it can be seen that the MatrixOne cluster is located on the node named node0:

    1. NODE="[node to be monitored]" # According to the above results, it may be IP, hostname, or alias, such as 10.0.0.1, host-10-0-0-1, node01
    2. kubectl top node ${NODE}
    1. [root@master0 ~]# kubectl top node
    2. NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
    3. master0 179m 9% 4632Mi 66%
    4. node0 292m 15% 4115Mi 56%
    5. [root@master0 ~]# kubectl top node node0
    6. NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
    7. node0 299m 15% 4079Mi 56%
  3. You can also view the node’s resource allocation and upper limit. Note that allocated resources are not equal to used resources.

  1. [root@master0 ~]# kubectl describe node node0
  2. Name: master0
  3. Roles: control-plane,master
  4. Labels: beta.kubernetes.io/arch=amd64
  5. beta.kubernetes.io/os=linux
  6. kubernetes.io/arch=amd64
  7. kubernetes.io/hostname=master0
  8. kubernetes.io/os=linux
  9. node-role.kubernetes.io/control-plane=
  10. node-role.kubernetes.io/master=
  11. node.kubernetes.io/exclude-from-external-load-balancers=
  12. Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
  13. node.alpha.kubernetes.io/ttl: 0
  14. projectcalico.org/IPv4Address: 10.206.134.8/24
  15. projectcalico.org/IPv4VXLANTunnelAddr: 10.234.166.0
  16. volumes.kubernetes.io/controller-managed-attach-detach: true
  17. CreationTimestamp: Sun, 07 May 2023 12:28:57 +0800
  18. Taints: node-role.kubernetes.io/master:NoSchedule
  19. Unschedulable: false
  20. Lease:
  21. HolderIdentity: master0
  22. AcquireTime: <unset>
  23. RenewTime: Mon, 08 May 2023 10:56:08 +0800
  24. Conditions:
  25. Type Status LastHeartbeatTime LastTransitionTime Reason Message
  26. ---- ------ ----------------- ------------------ ------ -------
  27. NetworkUnavailable False Sun, 07 May 2023 12:30:08 +0800 Sun, 07 May 2023 12:30:08 +0800 CalicoIsUp Calico is running on this node
  28. MemoryPressure False Mon, 08 May 2023 10:56:07 +0800 Sun, 07 May 2023 12:28:55 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
  29. DiskPressure False Mon, 08 May 2023 10:56:07 +0800 Sun, 07 May 2023 12:28:55 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
  30. PIDPressure False Mon, 08 May 2023 10:56:07 +0800 Sun, 07 May 2023 12:28:55 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
  31. Ready True Mon, 08 May 2023 10:56:07 +0800 Sun, 07 May 2023 20:47:39 +0800 KubeletReady kubelet is posting ready status
  32. Addresses:
  33. InternalIP: 10.206.134.8
  34. Hostname: master0
  35. Capacity:
  36. cpu: 2
  37. ephemeral-storage: 51473868Ki
  38. hugepages-1Gi: 0
  39. hugepages-2Mi: 0
  40. memory: 7782436Ki
  41. pods: 110
  42. Allocatable:
  43. cpu: 1800m
  44. ephemeral-storage: 47438316671
  45. hugepages-1Gi: 0
  46. hugepages-2Mi: 0
  47. memory: 7155748Ki
  48. pods: 110
  49. System Info:
  50. Machine ID: fb436be013b5415799d27abf653585d3
  51. System UUID: FB436BE0-13B5-4157-99D2-7ABF653585D3
  52. Boot ID: 552bd576-56c8-4d22-9549-d950069a5a77
  53. Kernel Version: 3.10.0-1160.88.1.el7.x86_64
  54. OS Image: CentOS Linux 7 (Core)
  55. Operating System: linux
  56. Architecture: amd64
  57. Container Runtime Version: docker://20.10.23
  58. Kubelet Version: v1.23.17
  59. Kube-Proxy Version: v1.23.17
  60. PodCIDR: 10.234.0.0/23
  61. PodCIDRs: 10.234.0.0/23
  62. Non-terminated Pods: (12 in total)
  63. Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
  64. --------- ---- ------------ ---------- --------------- ------------- ---
  65. default netchecker-agent-7xnwb 15m (0%) 30m (1%) 64M (0%) 100M (1%) 22h
  66. default netchecker-agent-hostnet-bw85f 15m (0%) 30m (1%) 64M (0%) 100M (1%) 22h
  67. kruise-system kruise-daemon-xvl8t 0 (0%) 50m (2%) 0 (0%) 128Mi (1%) 20h
  68. kube-system calico-node-sbzfc 150m (8%) 300m (16%) 64M (0%) 500M (6%) 22h
  69. kube-system dns-autoscaler-7874cf6bcf-l55q4 20m (1%) 0 (0%) 10Mi (0%) 0 (0%) 22h
  70. kube-system kube-apiserver-master0 250m (13%) 0 (0%) 0 (0%) 0 (0%) 22h
  71. kube-system kube-controller-manager-master0 200m (11%) 0 (0%) 0 (0%) 0 (0%) 22h
  72. kube-system kube-proxy-lfkhk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22h
  73. kube-system kube-scheduler-master0 100m (5%) 0 (0%) 0 (0%) 0 (0%) 22h
  74. kube-system metrics-server-7bd47f88c4-knh9b 100m (5%) 100m (5%) 200Mi (2%) 200Mi (2%) 22h
  75. kube-system nodelocaldns-dcffl 100m (5%) 0 (0%) 70Mi (1%) 170Mi (2%) 14h
  76. kuboard kuboard-v3-master0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22h
  77. Allocated resources:
  78. (Total limits may be over 100 percent, i.e., overcommitted.)
  79. Resource Requests Limits
  80. -------- -------- ------
  81. cpu 950m (52%) 510m (28%)
  82. memory 485601280 (6%) 1222190848 (16%)
  83. ephemeral-storage 0 (0%) 0 (0%)
  84. hugepages-1Gi 0 (0%) 0 (0%)
  85. hugepages-2Mi 0 (0%) 0 (0%)
  86. Events: <none>
  87. Name: node0
  88. Roles: <none>
  89. Labels: beta.kubernetes.io/arch=amd64
  90. beta.kubernetes.io/os=linux
  91. kubernetes.io/arch=amd64
  92. kubernetes.io/hostname=node0
  93. kubernetes.io/os=linux
  94. Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
  95. node.alpha.kubernetes.io/ttl: 0
  96. projectcalico.org/IPv4Address: 10.206.134.14/24
  97. projectcalico.org/IPv4VXLANTunnelAddr: 10.234.60.0
  98. volumes.kubernetes.io/controller-managed-attach-detach: true
  99. CreationTimestamp: Sun, 07 May 2023 12:29:46 +0800
  100. Taints: <none>
  101. Unschedulable: false
  102. Lease:
  103. HolderIdentity: node0
  104. AcquireTime: <unset>
  105. RenewTime: Mon, 08 May 2023 10:56:06 +0800
  106. Conditions:
  107. Type Status LastHeartbeatTime LastTransitionTime Reason Message
  108. ---- ------ ----------------- ------------------ ------ -------
  109. NetworkUnavailable False Sun, 07 May 2023 12:30:08 +0800 Sun, 07 May 2023 12:30:08 +0800 CalicoIsUp Calico is running on this node
  110. MemoryPressure False Mon, 08 May 2023 10:56:12 +0800 Sun, 07 May 2023 12:29:46 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
  111. DiskPressure False Mon, 08 May 2023 10:56:12 +0800 Sun, 07 May 2023 12:29:46 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
  112. PIDPressure False Mon, 08 May 2023 10:56:12 +0800 Sun, 07 May 2023 12:29:46 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
  113. Ready True Mon, 08 May 2023 10:56:12 +0800 Sun, 07 May 2023 20:48:36 +0800 KubeletReady kubelet is posting ready status
  114. Addresses:
  115. InternalIP: 10.206.134.14
  116. Hostname: node0
  117. Capacity:
  118. cpu: 2
  119. ephemeral-storage: 51473868Ki
  120. hugepages-1Gi: 0
  121. hugepages-2Mi: 0
  122. memory: 7782444Ki
  123. pods: 110
  124. Allocatable:
  125. cpu: 1900m
  126. ephemeral-storage: 47438316671
  127. hugepages-1Gi: 0
  128. hugepages-2Mi: 0
  129. memory: 7417900Ki
  130. pods: 110
  131. System Info:
  132. Machine ID: a6600151884b44fb9f0bc9af490e44b7
  133. System UUID: A6600151-884B-44FB-9F0B-C9AF490E44B7
  134. Boot ID: b7f3357f-44e6-425e-8c90-6ada14e92703
  135. Kernel Version: 3.10.0-1160.88.1.el7.x86_64
  136. OS Image: CentOS Linux 7 (Core)
  137. Operating System: linux
  138. Architecture: amd64
  139. Container Runtime Version: docker://20.10.23
  140. Kubelet Version: v1.23.17
  141. Kube-Proxy Version: v1.23.17
  142. PodCIDR: 10.234.2.0/23
  143. PodCIDRs: 10.234.2.0/23
  144. Non-terminated Pods: (20 in total)
  145. Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
  146. --------- ---- ------------ ---------- --------------- ------------- ---
  147. default netchecker-agent-6v8rl 15m (0%) 30m (1%) 64M (0%) 100M (1%) 22h
  148. default netchecker-agent-hostnet-fb2jn 15m (0%) 30m (1%) 64M (0%) 100M (1%) 22h
  149. default netchecker-server-645d759b79-v4bqm 150m (7%) 300m (15%) 192M (2%) 512M (6%) 22h
  150. kruise-system kruise-controller-manager-74847d59cf-295rk 100m (5%) 200m (10%) 256Mi (3%) 512Mi (7%) 20h
  151. kruise-system kruise-controller-manager-74847d59cf-854sq 100m (5%) 200m (10%) 256Mi (3%) 512Mi (7%) 20h
  152. kruise-system kruise-daemon-rz9pj 0 (0%) 50m (2%) 0 (0%) 128Mi (1%) 20h
  153. kube-system calico-kube-controllers-74df5cd99c-n9qsn 30m (1%) 1 (52%) 64M (0%) 256M (3%) 22h
  154. kube-system calico-node-brqrk 150m (7%) 300m (15%) 64M (0%) 500M (6%) 22h
  155. kube-system coredns-76b4fb4578-9cqc7 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 14h
  156. kube-system kube-proxy-rpxb5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22h
  157. kube-system nginx-proxy-node0 25m (1%) 0 (0%) 32M (0%) 0 (0%) 22h
  158. kube-system nodelocaldns-qkxhv 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 14h
  159. local-path-storage local-path-storage-local-path-provisioner-d5bb7f8c9-qfp8h 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h
  160. mo-hn matrixone-operator-f8496ff5c-fp6zm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 20h
  161. mo-hn mo-tn-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h
  162. mo-hn mo-log-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h
  163. mo-hn mo-log-1 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h
  164. mo-hn mo-log-2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h
  165. mo-hn mo-tp-cn-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13h
  166. mostorage minio-674ccf54f7-tdglh 0 (0%) 0 (0%) 512Mi (7%) 0 (0%) 20h
  167. Allocated resources:
  168. (Total limits may be over 100 percent, i.e., overcommitted.)
  169. Resource Requests Limits
  170. -------- -------- ------
  171. cpu 785m (41%) 2110m (111%)
  172. memory 1700542464 (22%) 3032475392 (39%)
  173. ephemeral-storage 0 (0%) 0 (0%)
  174. hugepages-1Gi 0 (0%) 0 (0%)
  175. hugepages-2Mi 0 (0%) 0 (0%)
  176. Events: <none>

Pod Monitoring

  1. Use the following command to view the Pods of the MatrixOne cluster:

    1. NS="mo-hn"
    2. kubectl get pod -n ${NS}
  2. According to the above-returned results, use the following command to view the resource usage of a specific Pod:

    1. POD="[pod name to be monitored]" # According to the above results, for example: TN is mo-tn-0, cn is mo-tp-cn-0, mo-tp-cn-1, ..., logservice is mo -log-0, mo-log-1, ...
    2. kubectl top pod ${POD} -n ${NS}

    The command will display the CPU and memory usage for the specified Pod, similar to the following output:

    1. [root@master0 ~]# kubectl top pod mo-tp-cn-0 -nmo-hn
    2. NAME CPU(cores) MEMORY(bytes)
    3. mo-tp-cn-0 20m 214Mi
    4. [root@master0 ~]# kubectl top pod mo-tn-0 -nmo-hn
    5. NAME CPU(cores) MEMORY(bytes)
    6. mo-tn-0 36m 161Mi
  3. you can also view the resource declaration of a specific Pod to compare it with the actual resource usage.

  1. kubectl describe pod ${POD_NAME} -n${NS}
  2. kubectl get pod ${POD_NAME} -n${NS} -oyaml
  1. [root@master0 ~]# kubectl describe pod mo-tp-cn-0 -nmo-hn
  2. Name: mo-tp-cn-0
  3. Namespace: mo-hn
  4. Priority: 0
  5. Node: node0/10.206.134.14
  6. Start Time: Sun, 07 May 2023 21:01:50 +0800
  7. Labels: controller-revision-hash=mo-tp-cn-8666cdfb56
  8. lifecycle.apps.kruise.io/state=Normal
  9. matrixorigin.io/cluster=mo
  10. matrixorigin.io/component=CNSet
  11. matrixorigin.io/instance=mo-tp
  12. matrixorigin.io/namespace=mo-hn
  13. statefulset.kubernetes.io/pod-name=mo-tp-cn-0
  14. Annotations: apps.kruise.io/runtime-containers-meta:
  15. {"containers":[{"name":"main","containerID":"docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f","restartCount":0,"...
  16. cni.projectcalico.org/containerID: 80b286789a2d6fa9e615c3edee79b57edb452eaeafddb9b7b82ec5fb2e339409
  17. cni.projectcalico.org/podIP: 10.234.60.53/32
  18. cni.projectcalico.org/podIPs: 10.234.60.53/32
  19. kruise.io/related-pub: mo
  20. lifecycle.apps.kruise.io/timestamp: 2023-05-07T13:01:50Z
  21. matrixone.cloud/cn-label: null
  22. matrixone.cloud/dns-based-identity: False
  23. Status: Running
  24. IP: 10.234.60.53
  25. IPs:
  26. IP: 10.234.60.53
  27. Controlled By: StatefulSet/mo-tp-cn
  28. Containers:
  29. main:
  30. Container ID: docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f
  31. Image: matrixorigin/matrixone:nightly-144f3be4
  32. Image ID: docker-pullable://matrixorigin/matrixone@sha256:288fe3d626c6aa564684099e4686a9d4b28e16fdd16512bd968a67bb41d5aaa3
  33. Port: <none>
  34. Host Port: <none>
  35. Command:
  36. /bin/sh
  37. /etc/matrixone/config/start.sh
  38. Args:
  39. -debug-http=:6060
  40. State: Running
  41. Started: Sun, 07 May 2023 21:01:54 +0800
  42. Ready: True
  43. Restart Count: 0
  44. Environment:
  45. POD_NAME: mo-tp-cn-0 (v1:metadata.name)
  46. NAMESPACE: mo-hn (v1:metadata.namespace)
  47. HEADLESS_SERVICE_NAME: mo-tp-cn-headless
  48. AWS_ACCESS_KEY_ID: <set to the key 'AWS_ACCESS_KEY_ID' in secret 'minio'> Optional: false
  49. AWS_SECRET_ACCESS_KEY: <set to the key 'AWS_SECRET_ACCESS_KEY' in secret 'minio'> Optional: false
  50. AWS_REGION: us-west-2
  51. Mounts:
  52. /etc/matrixone/config from config (ro)
  53. /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ngpcs (ro)
  54. Readiness Gates:
  55. Type Status
  56. InPlaceUpdateReady True
  57. KruisePodReady True
  58. Conditions:
  59. Type Status
  60. KruisePodReady True
  61. InPlaceUpdateReady True
  62. Initialized True
  63. Ready True
  64. ContainersReady True
  65. PodScheduled True
  66. Volumes:
  67. config:
  68. Type: ConfigMap (a volume populated by a ConfigMap)
  69. Name: mo-tp-cn-config-5abf454
  70. Optional: false
  71. kube-api-access-ngpcs:
  72. Type: Projected (a volume that contains injected data from multiple sources)
  73. TokenExpirationSeconds: 3607
  74. ConfigMapName: kube-root-ca.crt
  75. ConfigMapOptional: <nil>
  76. DownwardAPI: true
  77. QoS Class: BestEffort
  78. Node-Selectors: <none>
  79. Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
  80. node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
  81. Events: <none>
  82. [root@master0 ~]# kubectl get pod mo-tp-cn-0 -nmo-hn -oyaml
  83. apiVersion: v1
  84. kind: Pod
  85. metadata:
  86. annotations:
  87. apps.kruise.io/runtime-containers-meta: '{"containers":[{"name":"main","containerID":"docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f","restartCount":0,"hashes":{"plainHash":1670287891}}]}'
  88. cni.projectcalico.org/containerID: 80b286789a2d6fa9e615c3edee79b57edb452eaeafddb9b7b82ec5fb2e339409
  89. cni.projectcalico.org/podIP: 10.234.60.53/32
  90. cni.projectcalico.org/podIPs: 10.234.60.53/32
  91. kruise.io/related-pub: mo
  92. lifecycle.apps.kruise.io/timestamp: "2023-05-07T13:01:50Z"
  93. matrixone.cloud/cn-label: "null"
  94. matrixone.cloud/dns-based-identity: "False"
  95. creationTimestamp: "2023-05-07T13:01:50Z"
  96. generateName: mo-tp-cn-
  97. labels:
  98. controller-revision-hash: mo-tp-cn-8666cdfb56
  99. lifecycle.apps.kruise.io/state: Normal
  100. matrixorigin.io/cluster: mo
  101. matrixorigin.io/component: CNSet
  102. matrixorigin.io/instance: mo-tp
  103. matrixorigin.io/namespace: mo-hn
  104. statefulset.kubernetes.io/pod-name: mo-tp-cn-0
  105. name: mo-tp-cn-0
  106. namespace: mo-hn
  107. ownerReferences:
  108. - apiVersion: apps.kruise.io/v1beta1
  109. blockOwnerDeletion: true
  110. controller: true
  111. kind: StatefulSet
  112. name: mo-tp-cn
  113. uid: 891e0453-89a5-45d5-ad12-16ef048c804f
  114. resourceVersion: "72625"
  115. uid: 1e3e2df3-f1c2-4444-8694-8d23e7125d35
  116. spec:
  117. containers:
  118. - args:
  119. - -debug-http=:6060
  120. command:
  121. - /bin/sh
  122. - /etc/matrixone/config/start.sh
  123. env:
  124. - name: POD_NAME
  125. valueFrom:
  126. fieldRef:
  127. apiVersion: v1
  128. fieldPath: metadata.name
  129. - name: NAMESPACE
  130. valueFrom:
  131. fieldRef:
  132. apiVersion: v1
  133. fieldPath: metadata.namespace
  134. - name: HEADLESS_SERVICE_NAME
  135. value: mo-tp-cn-headless
  136. - name: AWS_ACCESS_KEY_ID
  137. valueFrom:
  138. secretKeyRef:
  139. key: AWS_ACCESS_KEY_ID
  140. name: minio
  141. - name: AWS_SECRET_ACCESS_KEY
  142. valueFrom:
  143. secretKeyRef:
  144. key: AWS_SECRET_ACCESS_KEY
  145. name: minio
  146. - name: AWS_REGION
  147. value: us-west-2
  148. image: matrixorigin/matrixone:nightly-144f3be4
  149. imagePullPolicy: Always
  150. name: main
  151. resources: {}
  152. terminationMessagePath: /dev/termination-log
  153. terminationMessagePolicy: File
  154. volumeMounts:
  155. - mountPath: /etc/matrixone/config
  156. name: config
  157. readOnly: true
  158. - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
  159. name: kube-api-access-ngpcs
  160. readOnly: true
  161. dnsPolicy: ClusterFirst
  162. enableServiceLinks: true
  163. hostname: mo-tp-cn-0
  164. nodeName: node0
  165. preemptionPolicy: PreemptLowerPriority
  166. priority: 0
  167. readinessGates:
  168. - conditionType: InPlaceUpdateReady
  169. - conditionType: KruisePodReady
  170. restartPolicy: Always
  171. schedulerName: default-scheduler
  172. securityContext: {}
  173. serviceAccount: default
  174. serviceAccountName: default
  175. subdomain: mo-tp-cn-headless
  176. terminationGracePeriodSeconds: 30
  177. tolerations:
  178. - effect: NoExecute
  179. key: node.kubernetes.io/not-ready
  180. operator: Exists
  181. tolerationSeconds: 300
  182. - effect: NoExecute
  183. key: node.kubernetes.io/unreachable
  184. operator: Exists
  185. tolerationSeconds: 300
  186. volumes:
  187. - configMap:
  188. defaultMode: 420
  189. name: mo-tp-cn-config-5abf454
  190. name: config
  191. - name: kube-api-access-ngpcs
  192. projected:
  193. defaultMode: 420
  194. sources:
  195. - serviceAccountToken:
  196. expirationSeconds: 3607
  197. path: token
  198. - configMap:
  199. items:
  200. - key: ca.crt
  201. path: ca.crt
  202. name: kube-root-ca.crt
  203. - downwardAPI:
  204. items:
  205. - fieldRef:
  206. apiVersion: v1
  207. fieldPath: metadata.namespace
  208. path: namespace
  209. status:
  210. conditions:
  211. - lastProbeTime: null
  212. lastTransitionTime: "2023-05-07T13:01:50Z"
  213. status: "True"
  214. type: KruisePodReady
  215. - lastProbeTime: null
  216. lastTransitionTime: "2023-05-07T13:01:50Z"
  217. status: "True"
  218. type: InPlaceUpdateReady
  219. - lastProbeTime: null
  220. lastTransitionTime: "2023-05-07T13:01:50Z"
  221. status: "True"
  222. type: Initialized
  223. - lastProbeTime: null
  224. lastTransitionTime: "2023-05-07T13:01:54Z"
  225. status: "True"
  226. type: Ready
  227. - lastProbeTime: null
  228. lastTransitionTime: "2023-05-07T13:01:54Z"
  229. status: "True"
  230. type: ContainersReady
  231. - lastProbeTime: null
  232. lastTransitionTime: "2023-05-07T13:01:50Z"
  233. status: "True"
  234. type: PodScheduled
  235. containerStatuses:
  236. - containerID: docker://679d672a330d7318f97a90835dacefcdd03e8a08062b8844d438f8cdd6bcdc8f
  237. image: matrixorigin/matrixone:nightly-144f3be4
  238. imageID: docker-pullable://matrixorigin/matrixone@sha256:288fe3d626c6aa564684099e4686a9d4b28e16fdd16512bd968a67bb41d5aaa3
  239. lastState: {}
  240. name: main
  241. ready: true
  242. restartCount: 0
  243. started: true
  244. state:
  245. running:
  246. startedAt: "2023-05-07T13:01:54Z"
  247. hostIP: 10.206.134.14
  248. phase: Running
  249. podIP: 10.234.60.53
  250. podIPs:
  251. - ip: 10.234.60.53
  252. qosClass: BestEffort
  253. startTime: "2023-05-07T13:01:50Z"

MatrixOne Monitoring

View cluster status

During Operator deployment, we defined matrixOnecluster as the custom resource name for the entire cluster. By checking MatrixOneCluster, we can tell if the cluster is functioning correctly. You can check with the following command:

  1. MO_NAME="mo"
  2. NS="mo-hn"
  3. kubectl get matrixonecluster -n ${NS} ${MO_NAME}

If the status is “Ready”, the cluster is healthy. If the status is “NotReady”, further troubleshooting is required.

  1. [root@master0 ~]# MO_NAME="mo"
  2. [root@master0 ~]# NS="mo-hn"
  3. [root@master0 ~]# kubectl get matrixonecluster -n${NS} ${MO_NAME}
  4. NAME LOG TN TP AP VERSION PHASE AGE
  5. mo 3 1 1 nightly-144f3be4 Ready 13h

To view details of the MatrixOne cluster status, you can run the following command:

  1. kubectl describe matrixonecluster -n${NS} ${MO_NAME}
  1. [root@master0 ~]# kubectl describe matrixonecluster -n${NS} ${MO_NAME}
  2. Name: mo
  3. Namespace: mo-hn
  4. Labels: <none>
  5. Annotations: <none>
  6. API Version: core.matrixorigin.io/v1alpha1
  7. Kind: MatrixOneCluster
  8. Metadata:
  9. Creation Timestamp: 2023-05-07T12:54:17Z
  10. Finalizers:
  11. matrixorigin.io/matrixonecluster
  12. Generation: 2
  13. Managed Fields:
  14. API Version: core.matrixorigin.io/v1alpha1
  15. Fields Type: FieldsV1
  16. fieldsV1:
  17. f:metadata:
  18. f:annotations:
  19. .:
  20. f:kubectl.kubernetes.io/last-applied-configuration:
  21. f:spec:
  22. .:
  23. f:Tn:
  24. .:
  25. f:config:
  26. f:replicas:
  27. f:imagePullPolicy:
  28. f:imageRepository:
  29. f:logService:
  30. .:
  31. f:config:
  32. f:pvcRetentionPolicy:
  33. f:replicas:
  34. f:sharedStorage:
  35. .:
  36. f:s3:
  37. .:
  38. f:endpoint:
  39. f:secretRef:
  40. f:type:
  41. f:volume:
  42. .:
  43. f:size:
  44. f:tp:
  45. .:
  46. f:config:
  47. f:nodePort:
  48. f:replicas:
  49. f:serviceType:
  50. f:version:
  51. Manager: kubectl-client-side-apply
  52. Operation: Update
  53. Time: 2023-05-07T12:54:17Z
  54. API Version: core.matrixorigin.io/v1alpha1
  55. Fields Type: FieldsV1
  56. fieldsV1:
  57. f:metadata:
  58. f:finalizers:
  59. .:
  60. v:"matrixorigin.io/matrixonecluster":
  61. Manager: manager
  62. Operation: Update
  63. Time: 2023-05-07T12:54:17Z
  64. API Version: core.matrixorigin.io/v1alpha1
  65. Fields Type: FieldsV1
  66. fieldsV1:
  67. f:spec:
  68. f:logService:
  69. f:sharedStorage:
  70. f:s3:
  71. f:path:
  72. Manager: kubectl-edit
  73. Operation: Update
  74. Time: 2023-05-07T13:00:53Z
  75. API Version: core.matrixorigin.io/v1alpha1
  76. Fields Type: FieldsV1
  77. fieldsV1:
  78. f:status:
  79. .:
  80. f:cnGroups:
  81. .:
  82. f:desiredGroups:
  83. f:readyGroups:
  84. f:syncedGroups:
  85. f:conditions:
  86. f:credentialRef:
  87. f:Tn:
  88. .:
  89. f:availableStores:
  90. f:conditions:
  91. f:logService:
  92. .:
  93. f:availableStores:
  94. f:conditions:
  95. f:discovery:
  96. .:
  97. f:address:
  98. f:port:
  99. f:phase:
  100. Manager: manager
  101. Operation: Update
  102. Subresource: status
  103. Time: 2023-05-07T13:02:12Z
  104. Resource Version: 72671
  105. UID: be2355c0-0c69-4f0f-95bb-9310224200b6
  106. Spec:
  107. Tn:
  108. Config:
  109. [dn]
  110. [dn.Ckp]
  111. flush-interval = "60s"
  112. global-interval = "100000s"
  113. incremental-interval = "60s"
  114. min-count = 100
  115. scan-interval = "5s"
  116. [dn.Txn]
  117. [dn.Txn.Storage]
  118. backend = "TAE"
  119. log-backend = "logservice"
  120. [log]
  121. format = "json"
  122. level = "error"
  123. max-size = 512
  124. Replicas: 1
  125. Resources:
  126. Service Args:
  127. -debug-http=:6060
  128. Shared Storage Cache:
  129. Memory Cache Size: 0
  130. Image Pull Policy: Always
  131. Image Repository: matrixorigin/matrixone
  132. Log Service:
  133. Config:
  134. [log]
  135. format = "json"
  136. level = "error"
  137. max-size = 512
  138. Initial Config:
  139. TN Shards: 1
  140. Log Shard Replicas: 3
  141. Log Shards: 1
  142. Pvc Retention Policy: Retain
  143. Replicas: 3
  144. Resources:
  145. Service Args:
  146. -debug-http=:6060
  147. Shared Storage:
  148. s3:
  149. Endpoint: http://minio.mostorage:9000
  150. Path: minio-mo
  151. s3RetentionPolicy: Retain
  152. Secret Ref:
  153. Name: minio
  154. Type: minio
  155. Store Failure Timeout: 10m0s
  156. Volume:
  157. Size: 1Gi
  158. Tp:
  159. Config:
  160. [cn]
  161. [cn.Engine]
  162. type = "distributed-tae"
  163. [log]
  164. format = "json"
  165. level = "debug"
  166. max-size = 512
  167. Node Port: 31474
  168. Replicas: 1
  169. Resources:
  170. Service Args:
  171. -debug-http=:6060
  172. Service Type: NodePort
  173. Shared Storage Cache:
  174. Memory Cache Size: 0
  175. Version: nightly-144f3be4
  176. Status:
  177. Cn Groups:
  178. Desired Groups: 1
  179. Ready Groups: 1
  180. Synced Groups: 1
  181. Conditions:
  182. Last Transition Time: 2023-05-07T13:02:14Z
  183. Message: the object is synced
  184. Reason: empty
  185. Status: True
  186. Type: Synced
  187. Last Transition Time: 2023-05-07T13:02:14Z
  188. Message:
  189. Reason: AllSetsReady
  190. Status: True
  191. Type: Ready
  192. Credential Ref:
  193. Name: mo-credential
  194. Tn:
  195. Available Stores:
  196. Last Transition: 2023-05-07T13:01:48Z
  197. Phase: Up
  198. Pod Name: mo-tn-0
  199. Conditions:
  200. Last Transition Time: 2023-05-07T13:01:48Z
  201. Message: the object is synced
  202. Reason: empty
  203. Status: True
  204. Type: Synced
  205. Last Transition Time: 2023-05-07T13:01:48Z
  206. Message:
  207. Reason: empty
  208. Status: True
  209. Type: Ready
  210. Log Service:
  211. Available Stores:
  212. Last Transition: 2023-05-07T13:01:25Z
  213. Phase: Up
  214. Pod Name: mo-log-0
  215. Last Transition: 2023-05-07T13:01:25Z
  216. Phase: Up
  217. Pod Name: mo-log-1
  218. Last Transition: 2023-05-07T13:01:25Z
  219. Phase: Up
  220. Pod Name: mo-log-2
  221. Conditions:
  222. Last Transition Time: 2023-05-07T13:01:25Z
  223. Message: the object is synced
  224. Reason: empty
  225. Status: True
  226. Type: Synced
  227. Last Transition Time: 2023-05-07T13:01:25Z
  228. Message:
  229. Reason: empty
  230. Status: True
  231. Type: Ready
  232. Discovery:
  233. Address: mo-log-discovery.mo-hn.svc
  234. Port: 32001
  235. Phase: Ready
  236. Events:
  237. Type Reason Age From Message
  238. ---- ------ ---- ---- -------
  239. Normal ReconcileSuccess 29m (x2 over 13h) matrixonecluster object is synced

View component status

The current MatrixOne cluster includes the following components: TN, CN, and Log Service, which correspond to the custom resource types TNSet, CNSet, and LogSet, respectively, and these objects are generated by the MatrixOneCluster controller.

To check whether each component is standard, take TN as an example; you can run the following command:

  1. SET_TYPE="tnset"
  2. NS="mo-hn"
  3. kubectl get ${SET_TYPE} -n ${NS}

This will display status information for the TN component as follows:

  1. [root@master0 ~]# SET_TYPE="tnset"
  2. [root@master0 ~]# NS="mo-hn"
  3. [root@master0 ~]# kubectl get ${SET_TYPE} -n${NS}
  4. NAME IMAGE REPLICAS AGE
  5. mo matrixorigin/matrixone:nightly-144f3be4 1 13h
  6. [root@master0 ~]# SET_TYPE="cnset"
  7. [root@master0 ~]# kubectl get ${SET_TYPE} -n${NS}
  8. NAME IMAGE REPLICAS AGE
  9. mo-tp matrixorigin/matrixone:nightly-144f3be4 1 13h
  10. [root@master0 ~]# SET_TYPE="logset"
  11. [root@master0 ~]# kubectl get ${SET_TYPE} -n${NS}
  12. NAME IMAGE REPLICAS AGE
  13. mo matrixorigin/matrixone:nightly-144f3be4 3 13h

View Pod Status

In addition, you can directly check the native k8s objects generated in the MO cluster to confirm the cluster’s health, usually by querying the pod.

  1. NS="mo-hn"
  2. kubectl get pod -n ${NS}

This will display status information for the Pod.

  1. [root@master0 ~]# NS="mo-hn"
  2. [root@master0 ~]# kubectl get pod -n${NS}
  3. NAME READY STATUS RESTARTS AGE
  4. matrixone-operator-f8496ff5c-fp6zm 1/1 Running 0 19h
  5. mo-tn-0 1/1 Running 0 13h
  6. mo-log-0 1/1 Running 0 13h
  7. mo-log-1 1/1 Running 0 13h
  8. mo-log-2 1/1 Running 0 13h
  9. mo-tp-cn-0 1/1 Running 0 13h

Typically, the Running state indicates that the Pod is functioning normally. But there are also some special cases where the Pod status may be Running, but the MatrixOne cluster is abnormal. For example, connecting to a MatrixOne cluster through a MySQL client is impossible. In this case, you can look further into the Pod’s logs to check for unusual output.

  1. NS="mo-hn"
  2. POD_NAME="[the name of the pod returned above]" # For example, mo-tp-cn-0
  3. kubectl logs ${POD_NAME} -n ${NS}

If the Pod status is not Running, such as Pending, you can check the Pod events (Events) to confirm the cause of the exception. Taking the previous example, because the cluster resources cannot satisfy the request of mo-tp-cn-3, the Pod cannot be scheduled and is Pending. In this case, you can solve it by expanding node resources.

  1. kubectl describe pod ${POD_NAME} -n${NS}

Health check and resource monitoring - 图2