Maintaining a cluster’s disk subsystem

Tackling cluster failures

The cluster may become inoperable for a number of reasons:

  • It may go beyond the failure model, stopping data reads and writes to the storage group completely.

  • Unbalanced workload on disks may strongly affect the request processing latency. Load balancing methods are described in this article.

  • Writing can also stop if multiple physical disks run out of space. This can be solved by freeing up space or adding block store volumes to expand the cluster.

Unauthorized withdrawal of nodes can result in issues described above. To prevent the issues, make sure to drain the nodes correctly for maintenance.

Enabling Scrubbing and SelfHeal would also be a good preventative measure.

Editing the cluster configuration

A YDB cluster lets you:

  • Expand block store volumes and nodes.
  • Configure the actor system on your nodes.
  • Edit configs via CMS.
  • Add new storage groups.