Troubleshooting OpenEBS - Overview

General guidelines for troubleshooting

Kubernetes related

Kubernetes node reboots because of increase in memory consumed by Kubelet

Application and OpenEBS pods terminate/restart under heavy I/O load

Others

Nodes in the cluster reboots frequently almost everyday in openSUSE CaaS

Kubernetes related

Kubernetes node reboots because of increase in memory consumed by Kubelet

Sometime it is observed that iscsiadm is continuously fails and repeats rapidly and for some reason this causes the memory consumption of kubelet to grow until the node goes out-of-memory and needs to be rebooted. Following type of error can be observed in journalctl and cstor-istgt container.

journalctl logs

  1. Feb 06 06:11:38 <hostname> kubelet[1063]: iscsiadm: failed to send SendTargets PDU
  2. Feb 06 06:11:38 <hostname> kubelet[1063]: iscsiadm: connection login retries (reopen_max) 5 exceeded
  3. Feb 06 06:11:38 <hostname> kubelet[1063]: iscsiadm: Connection to Discovery Address 10.233.46.76 failed
  4. Feb 06 06:11:38 <hostname> kubelet[1063]: iscsiadm: failed to send SendTargets PDU
  5. Feb 06 06:11:38 <hostname> kubelet[1063]: iscsiadm: connection login retries (reopen_max) 5 exceeded
  6. Feb 06 06:11:38 <hostname> kubelet[1063]: iscsiadm: Connection to Discovery Address 10.233.46.76 failed
  7. Feb 06 06:11:38 <hostname> kubelet[1063]: iscsiadm: failed to send SendTargets PDU
  8. Feb 06 06:11:38 <hostname> kubelet[1063]: iscsiadm: connection login retries (reopen_max) 5 exceeded
  9. Feb 06 06:11:38 <hostname> kubelet[1063]: iscsiadm: Connection to Discovery Address 10.233.46.76 failed
  10. Feb 06 06:11:38 <hostname> kubelet[1063]: iscsiadm: failed to send SendTargets PDU

cstor-istgt container logs

  1. 2019-02-05/15:43:30.250 worker :6088: c#0.140005771040512.: iscsi_read_pdu() EOF
  2. 2019-02-05/15:43:30.250 sender :5852: s#0.140005666154240.: sender loop ended (0:14:43084)
  3. 2019-02-05/15:43:30.251 worker :6292: c#0.140005771040512.: worker 0/-1/43084 end (c#0.140005771040512/s#0.140005666154240)
  4. 2019-02-05/15:43:30.264 worker :5885: c#1.140005666154240.: con:1/16 [8d614b93:43088->10.233.45.100:3260,1]
  5. 2019-02-05/15:43:30.531 istgt_iscsi_op_log:1923: c#1.140005666154240.: login failed, target not ready
  6. 2019-02-05/15:43:30.782 worker :6088: c#1.140005666154240.: iscsi_read_pdu() EOF
  7. 2019-02-05/15:43:30.782 sender :5852: s#1.140005649413888.: sender loop ended (1:16:43088)
  8. 2019-02-05/15:43:30.783 worker :6292: c#1.140005666154240.: worker 1/-1/43088 end (c#1.140005666154240/s#1.140005649413888)
  9. 2019-02-05/15:43:33.285 worker :5885: c#2.140005649413888.: con:2/18 [8d614b93:43092->10.233.45.100:3260,1]
  10. 2019-02-05/15:43:33.536 istgt_iscsi_op_log:1923: c#2.140005649413888.: login failed, target not ready
  11. 2019-02-05/15:43:33.787 worker :6088: c#2.140005649413888.: iscsi_read_pdu() EOF
  12. 2019-02-05/15:43:33.787 sender :5852: s#2.140005632636672.: sender loop ended (2:18:43092)
  13. 2019-02-05/15:43:33.788 worker :6292: c#2.140005649413888.: worker 2/-1/43092 end (c#2.140005649413888/s#2.140005632636672)
  14. 2019-02-05/15:43:35.251 istgt_remove_conn :7039: c#0.140005771040512.: remove_conn->initiator:147.75.97.141(iqn.2019-02.net.packet:device.7c8ad781) Target: 10.233.109.82(dummy LU0) conn:0x7f55a4c18000:0 tsih:1 connections:0 IOPending=0
  15. 2019-02-05/15:43:36.291 worker :5885: c#0.140005666154240.: con:0/14 [8d614b93:43094->10.233.45.100:3260,1]
  16. 2019-02-05/15:43:36.540 istgt_iscsi_op_log:1923: c#0.140005666154240.: login failed, target not ready

Troubleshooting

The cause of high memory consumption of kubelet is mainly due to the following.

There are 3 modules are involved - cstor-isgt, kubelet and iscsiInitiator(iscsiadm). kubelet runs iscsiadm command to do discovery on cstor-istgt. If there is any delay in receiving response of discovery opcode (either due to network or delay in processing on target side), iscsiadm retries few times, and, gets into infinite loop dumping error messages as below:

  1. iscsiadm: Connection to Discovery Address 127.0.0.1 failed
  2. iscsiadm: failed to send SendTargets PDU
  3. iscsiadm: connection login retries (reopen_max) 5 exceeded
  4. iscsiadm: Connection to Discovery Address 127.0.0.1 failed
  5. iscsiadm: failed to send SendTargets PDU

kubelet keeps taking this response and accumulates the memory. More details can be seen here.

Workaround

Restart the corresponding istgt pod to avoid memory consumption.

Application and OpenEBS pods terminate/restart under heavy I/O load

This is caused due to lack of resources on the Kubernetes nodes, which causes the pods to evict under loaded conditions as the node becomes unresponsive. The pods transition from Running state to unknown state followed by Terminating before restarting again.

Troubleshooting

The above cause can be confirmed from the kubectl describe pod which displays the termination reason as NodeControllerEviction. You can get more information from the kube-controller-manager.log on the Kubernetes master.

Workaround:

You can resolve this issue by upgrading the Kubernetes cluster infrastructure resources (Memory, CPU).

Others

Nodes in the cluster reboots frequently almost everyday in openSUSE CaaS

Setup the cluster using RKE with openSUSE CaaS MicroOS using CNI Plugin Cilium. Install OpenEBS, create a PVC and allocate to a fio job/ busybox. Run FIO test on the same. Observed nodes in the cluster getting restarted on a schedule basis.

Troubleshooting

Check journalctl logs of each nodes and check if similar logs are observed. In the following log snippets, showing the corresponding logs of 3 nodes.

Node1:

  1. Apr 12 00:21:01 mos2 transactional-update[7302]: /.snapshots/8/snapshot/root/.bash_history
  2. Apr 12 00:21:01 mos2 transactional-update[7302]: /.snapshots/8/snapshot/var/run/reboot-needed
  3. Apr 12 00:21:01 mos2 transactional-update[7302]: transactional-update finished - rebooting machine
  4. Apr 12 00:21:01 mos2 systemd-logind[1045]: System is rebooting.
  5. Apr 12 00:21:01 mos2 systemd[1]: transactional-update.service: Succeeded.
  6. Apr 12 00:21:01 mos2 systemd[1]: Stopped Update the system.

Node2:

  1. 01:44:19 mos3 transactional-update[17442]: other mounts and will not be visible to the system:
  2. Apr 12 01:44:19 mos3 transactional-update[17442]: /.snapshots/8/snapshot/root/.bash_history
  3. Apr 12 01:44:19 mos3 transactional-update[17442]: /.snapshots/8/snapshot/var/run/reboot-needed
  4. Apr 12 01:44:19 mos3 transactional-update[17442]: transactional-update finished - rebooting machine
  5. Apr 12 01:44:20 mos3 systemd-logind[1056]: System is rebooting.
  6. Apr 12 01:44:20 mos3 systemd[1]: transactional-update.service: Succeeded.
  7. Apr 12 01:44:20 mos3 systemd[1]: Stopped Update the system.

Node3:

  1. Apr 12 03:00:13 mos4 systemd[1]: snapper-timeline.service: Succeeded.
  2. Apr 12 03:30:00 mos4 rebootmgrd[1612]: rebootmgr: reboot triggered now!
  3. Apr 12 03:30:00 mos4 systemd[1]: rebootmgr.service: Succeeded.
  4. Apr 12 03:30:00 mos4 systemd-logind[1064]: System is rebooting.
  5. Apr 12 03:30:00 mos4 systemd[1]: Stopping open-vm-tools: vmtoolsd service for virtual machines hosted on VMware...
  6. Apr 12 03:30:00 mos4 systemd[1]: Stopped target Timers.
  7. Apr 12 03:30:00 mos4 systemd[1]: btrfs-scrub.timer: Succeeded.
  8. Apr 12 03:30:00 mos4 systemd[1]: Stopped Scrub btrfs filesystem, verify block checksums.
  9. Apr 12 03:30:00 mos4 systemd[1]: transactional-update.timer: Succeeded.

You can get more details to see if the cause of reboot is due to transactional update using below command outputs.

  1. systemctl status rebootmgr
  2. systemctl status transactional-update.timer
  3. cat /etc/rebootmgr.conf
  4. cat /usr/lib/systemd/system/transactional-update.timer
  5. cat /usr/etc/transactional-update.conf

Workaround:

There are 2 possible solutions.

Approach1:

DO the following on each nodes to stop the transactional update.

  1. systemctl disable --now rebootmgr.service
  2. systemctl disable --now transactional-update.timer

This is the preferred approach.

Approach2:

Set the reboot timer schedule at different time i.e staggered at various interval of the day, so that only one nodes get rebooted at a time.

See Also:

Troubleshooting Install

Troubleshooting Uninstall

Troubleshooting NDM

Troubleshooting Jiva

Troubleshooting cStor

FAQs

Seek support or help

Latest release notes

Feedback

Was this page helpful?

YesNo

Thanks for the feedback. Open an issue in the GitHub repo if you want to report a problem or suggest an improvement. Engage and get additional help on https://kubernetes.slack.com/messages/openebs/.