Recover failing disk

YugabyteDB can be configured to use multiple storage disks by setting the —fs_data_dirs configuration option.This introduces the possibility of disk failure and recovery issues.

Cluster replication recovery

The yb-tserver service automatically detects disk failures and attempts to spread the data from the failed disk to other healthy nodes in the cluster.In a single-zone setup with a replication factor (RF) of 3: if you started with four nodes or more,then there would be at least three nodes left after one failed.In this case, rereplication is automatically started if a YB-TServer or disk is down for 10 minutes.

In a multi-zone setup with a replication factor (RF) of 3: YugabyteDB will try to keep one copy of data per zone.In this case, for automatic rereplication of data, a zone needs to have at least two YB-TServers so that if one fails,its data can be rereplicated to the other. Thus, this would mean at least a six-node cluster.

Failed disk replacement

The steps to replace a failed disk are:

  • Stop the YB-TServer node.
  • Replace the disks that have failed.
  • Restart the yb-tserver service.On restart, the YB-TServer will see the new empty disk and start replicating tablets from other nodes.