Host-based installation
Rainbond cluster initialization Exception analysis

Host-based installation

When users install Rainbond from the host on the Web interface, problems encountered can be resolved based on the content of the current documentation.

Installation procedure

Based on the host installation process, through Rainbond console/Adaptor (https://github.com/goodrain/cloud-adaptor) and Rainbond cluster service installation Kubernetes clusters.

Process for installing a Kubernetes cluster

Based on RKE encapsulates the page operation, simplifies the installation process.
This process installs a complete set of Kubernetes clusters through the definition of a graphical interface. An error message may be displayed on the GUI during the installation. You need to manually install the Kubectl command line tool after the installation is complete to ensure that the current cluster is available.

Process for installing the Rainbond cluster

The Rainbond components are installed based on the Kubernetes cluster that has been installed in the previous step. Installation process:

First, the Rainbond Operator is deployed, which manages the components of the Rainbond cluster.
All subsequent Pods are created by Operator in turn.

Kubernetes installation FAQ

Cluster must have at least one etcd plane host

In this case, the node IP address or SSH port configured by you is incorrect or the port has a firewall policy, so the console cannot connect to the specified node. Reconfigure the correct node IP address and SSH port, or enable the firewall policy for the SSH port.

Another possible scenario is that the owner and owner group of the /home/docker/.ssh directory on the host node where Rainbond is installed are not docker users. Run the following command to correct the error and try again:

chown docker:docker /home/docker/.ssh

node 192.168.1.11 not found

If node 192.168.1.11 not found is displayed when you install a Kubernetes cluster on the Web page, check the kubelet logs of this node and check whether the following error occurs:

$ docker logs -f kubelet
E0329 13:07:24.125847    1061 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://127.0.0.1:6443/api/v1/nodes\": x509: certificate has expired or is not yet valid: current time 2023-03-29T13:07:24Z is before 2023-03-29T20:24:14Z" node="192.168.1.191"
E0329 13:07:24.141600    1061 kubelet.go:2466] "Error getting node" err="node \"192.168.1.11\" not found"
E0329 13:07:24.242506    1061 kubelet.go:2466] "Error getting node" err="node \"192.168.1.11\" not found"

If the error message is the same as the preceding, check whether the time on each node is the same. If no, synchronize the time on all nodes.

# 同步时间
ntpdate -u ntp.aliyun.com
# 硬件时间同步
hwclock -w

tip

You are advised to restart the server after synchronizing the time.

rejected: administratively prohibited

In this case, the sshd service configuration of the host server is limited. Edit the ‘/etc/ssh/sshd_config’ file of the host to confirm that the following configuration exists:

AllowTcpForwarding yes

After the modification, restart the sshd service:

systemctl restart sshd

Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node

This is usually due to the SSH connection is not caused, the default re-initialization script will add the SSH key, for non-secret login, you can solve the problem by the following command:

# 进入 rainbond-allinone 容器
docker exec -it rainbond-allinone bash
# 使用 ssh 登陆节点
ssh docker@192.168.x.xs

If no secret can’t log in, please check the installation of the cluster nodes /home/docker/.ssh/authorized_keys file, Verify that the key is the same as the key in the /root/.ssh/id_rsa.pub file in the rainbond-allinone container.

Failed to bring up Etcd Plane: etcd cluster is unhealthy

This error indicates that the Etcd service is unhealthy and may be caused by repeated installation. You can view Etcd logs to troubleshoot the problem:

docker logs -f etcd

Or you can choose to clear cluster

Check the health of Kubernetes

The Kubernetes cluster is not necessarily available after the GUI installation process is complete. Please determine their health status in the following ways.

Determine whether all nodes in the Kubernetes cluster are in a healthy state by using the following command

# 检查节点
kubectl get node 
# 如果某个节点处于 `NotReady` 状态，通过以下命令在对应节点查询 `kubelet` 日志，根据日志输出解决节点问题
docker logs -f kubelet

Determine whether flannel and coredns in the kube-system namespace work properly by using the following command.

$ kubectl get pod -n kube-system 
NAME                                  READY   STATUS      RESTARTS   AGE
coredns-8644d6bd8c-s2888              1/1     Running     5          30d
coredns-autoscaler-74cd6f74d9-dc4vm   1/1     Running     5          30d
kube-flannel-75nfv                    2/2     Running     14         30d
rke-coredns-addon-deploy-job-kk7kq    0/1     Completed   0          30d
rke-network-plugin-deploy-job-xb9bw   0/1     Completed   0          30d

If the pod corresponding to kube-flannel and coredns is found to be in other states, it needs to query the log to solve it:

# 查询 coredns 日志
kubectl logs -f coredns-8644d6bd8c-s2888 -n kube-system
# 查询 kube-flannel 日志
kubectl logs -f kube-flannel-75nfv -n kube-system -c kube-flannel

Common problems

coredns error：plugin/loop: Loop (127.0.0.1:58477 -> :53) detected for zone “.”, see https://coredns.io/plugins/loop#troubleshooting.

This problem indicates that coredns is experiencing a loop resolution problem, causing it to not work properly. Pay attention to whether the nameserver defined in the /etc/resolv.conf file of each node in Kubernetes is the local loopback address of 127.0.0.1. coredns references this file content as the upstream DNS server by default. Writing the above address will cause circular resolution, which may cause the server to crash. The solution:

Simply change /etc/resolv.conf to nameserver followed by an available dns server address (beware that this file may be maintained by some random file that overwrites your custom values).
Directly modify coredns configmap by defining forward-114.114.114.114 instead of “/etc/resolv.conf”.

flannel error：Failed to find any valid interface to use: failed to get default interface: Unable to find default route

This error means that the host node where the flannel is located does not have a default route. As a result, the flannel cannot work properly. This is common in offline environments. The solution is to add a default route to the operating system.

The kernel version is too high or too low

Operating system kernel versions below 3.10.0-514 will not be supported by the docker overlay2 storage engine.

Some kernels with versions higher than 5.16 will cause containers not to be created.

Recommended reference/upgrade the kernel version to install the kernel - lt branch of long-term support version of the kernel.

Rainbond cluster initialization Exception analysis

Prompts in the Web page

In the process of docking, you can click ‘view components’ to observe the installation progress, if you encounter problems, it will be prompted in the page.

Common problem

rainbond-operator error：open /run/flannel/subnet.env: no such file or directory

This problem means that flannel is not working properly. Refer to the previous section to find out how to check the corresponding log and resolve it.

Troubleshoot faults on the cluster

During the Rainbond installation process, it is also necessary to know the status of all Pods in the ‘RDB-system’ namespace.

After confirming that the Kubernetes cluster is healthy, you can check the status of each Pod in the Rainbond cluster.

View the status of all Rainbond components. All Rainbond components are in the ‘rbd-system’ namespace

$ kubectl get pods -n rbd-system
NAME                              READY   STATUS      RESTARTS   AGE
mysql-operator-7c858d698d-g6xvt   1/1     Running     0          3d2h
nfs-provisioner-0                 1/1     Running     0          4d2h
rainbond-operator-0               2/2     Running     0          3d23h
rbd-api-7db9df75bc-dbjn4          1/1     Running     1          4d2h
rbd-app-ui-75c5f47d87-p5spp       1/1     Running     0          3d5h
rbd-app-ui-migrations-6crbs       0/1     Completed   0          4d2h
rbd-chaos-nrlpl                   1/1     Running     0          3d22h
rbd-db-0                          2/2     Running     0          4d2h
rbd-etcd-0                        1/1     Running     0          4d2h
rbd-eventlog-8bd8b988-ntt6p       1/1     Running     0          4d2h
rbd-gateway-4z9x8                 1/1     Running     0          4d2h
rbd-hub-5c4b478d5b-j7zrf          1/1     Running     0          4d2h
rbd-monitor-0                     1/1     Running     0          4d2h
rbd-mq-57b4fc595b-ljsbf           1/1     Running     0          4d2h
rbd-node-tpxjj                    1/1     Running     0          4d2h
rbd-repo-0                        1/1     Running     0          4d2h
rbd-webcli-5755745bbb-kmg5t       1/1     Running     0          4d2h
rbd-worker-68c6c97ddb-p68tx       1/1     Running     3          4d2h

If the Pod status is not Running, you need to view Pod logs to locate the problem.

example：

Viewing component logs

kubectl logs -f  <pod name>  -n rbd-system

Common Pod abnormal status

Pending

This status usually indicates that the Pod has not yet been scheduled on a Node. You can view the current Pod event by using the following command to determine why it has not been scheduled.

kubectl describe pod <pod name>  -n  rbd-system

View the Events field and analyze the cause.

Waiting or ContainerCreating

This state usually indicates that the Pod is in a waiting or creation state, and if it is in this state for a long time, view the events of the current Pod by using the following command.

 kubectl describe pod <pod name>  -n  rbd-system

View the Events field and analyze the cause.

imagePullBackOff

This status usually indicates that the mirror failed to be pulled. Run the following command to check which mirror failed to be pulled, and then check whether the mirror exists locally.

kubectl describe pod <pod name>  -n  rbd-system

View the Events field to view the image name.

Check whether the image exists locally

docker images | grep <image name>

CrashLoopBackOff

The CrashLoopBackOff status indicates that the container has been started but exits abnormally. Generally, the problem is the application itself, so you should check the container log first.

kubectl logs --previous <pod name> -n  rbd-system

Evicted

In general, it is due to insufficient system memory or disk resources. df -Th can view the resource usage of the docker data directory. If the percentage is greater than 85%, it is necessary to clean up the resources in time, especially some large files and docker images.

To clear a pod whose status is Evicted, use the following command:

kubectl get pods | grep Evicted | awk '{print $1}' | xargs kubectl delete pod

FailedMount

If the volume fails to be mounted, check whether the specified file system clients are installed on all host nodes. For example, Rainbond installs nfs as cluster shared storage by default, and you might see the following error in Events: Unable to attach or mount volumes: unmount volmes=[grdata access region-api-ssl rainbond-operator-token-xxxx]: timed out waiting for the condition. This is usually because the host does not have an nfs client package such as NFs-client or NFS-common installed.