Data replica management

Beginning with version 0.9.0, Doris introduced an optimized replica management strategy and supported a richer replica status viewing tool. This document focuses on Doris data replica balancing, repair scheduling strategies, and replica management operations and maintenance methods. Help users to more easily master and manage the replica status in the cluster.

Repairing and balancing copies of tables with Collocation attributes can be referred to docs/documentation/cn/administrator-guide/colocation-join.md'.

Noun Interpretation

  1. Tablet: The logical fragmentation of a Doris table, where a table has multiple fragmentations.
  2. Replica: A sliced copy, defaulting to three copies of a slice.
  3. Healthy Replica: A healthy copy that survives at Backend and has a complete version.
  4. Tablet Checker (TC): A resident background thread that scans all Tablets regularly, checks the status of these Tablets, and decides whether to send them to Tablet Scheduler based on the results.
  5. Tablet Scheduler (TS): A resident background thread that handles Tablets sent by Tablet Checker that need to be repaired. At the same time, cluster replica balancing will be carried out.
  6. Tablet SchedCtx (TSC): is a tablet encapsulation. When TC chooses a tablet, it encapsulates it as a TSC and sends it to TS.
  7. Storage Medium: Storage medium. Doris supports specifying different storage media for partition granularity, including SSD and HDD. The replica scheduling strategy is also scheduled for different storage media.
  1. +--------+ +-----------+
  2. | Meta | | Backends |
  3. +---^----+ +------^----+
  4. | | | 3. Send clone tasks
  5. 1. Check tablets | | |
  6. +--------v------+ +-----------------+
  7. | TabletChecker +--------> TabletScheduler |
  8. +---------------+ +-----------------+
  9. 2. Waiting to be scheduled

The figure above is a simplified workflow.

Duplicate status

Multiple copies of a Tablet may cause state inconsistencies due to certain circumstances. Doris will attempt to automatically fix the inconsistent copies of these states so that the cluster can recover from the wrong state as soon as possible.

The health status of a Replica is as follows:

  1. BAD

    That is, the copy is damaged. Includes, but is not limited to, the irrecoverable damaged status of copies caused by disk failures, BUGs, etc.

  2. VERSION_MISSING

    Version missing. Each batch of imports in Doris corresponds to a data version. A copy of the data consists of several consecutive versions. However, due to import errors, delays and other reasons, the data version of some copies may be incomplete.

  3. HEALTHY

    Health copy. That is, a copy of the normal data, and the BE node where the copy is located is in a normal state (heartbeat is normal and not in the offline process).

The health status of a Tablet is determined by the status of all its copies. There are the following categories:

  1. REPLICA_MISSING

    The copy is missing. That is, the number of surviving copies is less than the expected number of copies.

  2. VERSION_INCOMPLETE

    The number of surviving copies is greater than or equal to the number of expected copies, but the number of healthy copies is less than the number of expected copies.

  3. REPLICA_RELOCATING

    Have a full number of live copies of the replication num version, but the BE nodes where some copies are located are in unavailable state (such as decommission)

  4. REPLICA_MISSING_IN_CLUSTER

    When using multi-cluster, the number of healthy replicas is greater than or equal to the expected number of replicas, but the number of replicas in the corresponding cluster is less than the expected number of replicas.

  5. REDUNDANT

    Duplicate redundancy. Healthy replicas are in the corresponding cluster, but the number of replicas is larger than the expected number. Or there’s a spare copy of unavailable.

  6. FORCE_REDUNDANT

    This is a special state. It only occurs when the number of expected replicas is greater than or equal to the number of available nodes, and when the Tablet is in the state of replica missing. In this case, you need to delete a copy first to ensure that there are available nodes for creating a new copy.

  7. COLOCATE_MISMATCH

    Fragmentation status of tables for Collocation attributes. Represents that the distribution of fragmented copies is inconsistent with the specified distribution of Colocation Group.

  8. COLOCATE_REDUNDANT

    Fragmentation status of tables for Collocation attributes. Represents the fragmented copy redundancy of the Colocation table.

  9. HEALTHY

    Healthy fragmentation, that is, conditions [1-5] are not satisfied.

Replica Repair

As a resident background process, Tablet Checker regularly checks the status of all fragments. For unhealthy fragmentation, it will be sent to Tablet Scheduler for scheduling and repair. The actual operation of repair is accomplished by clone task on BE. FE is only responsible for generating these clone tasks.

Note 1: The main idea of replica repair is to make the number of fragmented replicas reach the desired value by creating or completing them first. Then delete the redundant copy.

Note 2: A clone task is to complete the process of copying specified data from a specified remote end to a specified destination.

For different states, we adopt different repair methods:

  1. REPLICA_MISSING/REPLICA_RELOCATING

    Select a low-load, available BE node as the destination. Choose a healthy copy as the source. Clone tasks copy a complete copy from the source to the destination. For replica completion, we will directly select an available BE node, regardless of the storage medium.

  2. VERSION_INCOMPLETE

    Select a relatively complete copy as the destination. Choose a healthy copy as the source. The clone task attempts to copy the missing version from the source to the destination.

  3. REPLICA_MISSING_IN_CLUSTER

    This state processing method is the same as REPLICAMISSING.

  4. REDUNDANT

    Usually, after repair, there will be redundant copies in fragmentation. We select a redundant copy to delete it. The selection of redundant copies follows the following priorities:

    1. The BE where the copy is located has been offline.
    2. The copy is damaged
    3. The copy is lost in BE or offline
    4. The replica is in the CLONE state (which is an intermediate state during clone task execution)
    5. The copy has version missing
    6. The cluster where the copy is located is incorrect
    7. The BE node where the replica is located has a high load
  5. FORCE_REDUNDANT

    Unlike REDUNDANT, because at this point Tablet has a copy missing, because there are no additional available nodes for creating new copies. So at this point, a copy must be deleted to free up a available node for creating a new copy. The order of deleting copies is the same as REDUNDANT.

  6. COLOCATE_MISMATCH

    Select one of the replica distribution BE nodes specified in Colocation Group as the destination node for replica completion.

  7. COLOCATE_REDUNDANT

    Delete a copy on a BE node that is distributed by a copy specified in a non-Colocation Group.

    Doris does not deploy a copy of the same Tablet on a different BE of the same host when selecting a replica node. It ensures that even if all BEs on the same host are deactivated, all copies will not be lost.

Scheduling priority

Waiting for the scheduled fragments in Tablet Scheduler gives different priorities depending on the status. High priority fragments will be scheduled first. There are currently several priorities.

  1. VERY_HIGH

    • REDUNDANT. For slices with duplicate redundancy, we give priority to them. Logically, duplicate redundancy is the least urgent, but because it is the fastest to handle and can quickly release resources (such as disk space, etc.), we give priority to it.
    • FORCE_REDUNDANT. Ditto.
  2. HIGH

    • REPLICA_MISSING and most copies are missing (for example, 2 copies are missing in 3 copies)
    • VERSION_INCOMPLETE and most copies are missing
    • COLOCATE_MISMATCH We hope that the fragmentation related to the Collocation table can be repaired as soon as possible.
    • COLOCATE_REDUNDANT
  3. NORMAL

    • REPLICA_MISSING, but most survive (for example, three copies lost one)
    • VERSION_INCOMPLETE, but most copies are complete
    • REPLICA_RELOCATING and relocate is required for most replicas (e.g. 3 replicas with 2 replicas)
  4. LOW

    • REPLICA_MISSING_IN_CLUSTER
    • REPLICA_RELOCATING most copies stable

Manual priority

The system will automatically determine the scheduling priority. Sometimes, however, users want the fragmentation of some tables or partitions to be repaired faster. So we provide a command that the user can specify that a slice of a table or partition is repaired first:

ADMIN REPAIR TABLE tbl [PARTITION (p1, p2, ...)];

This command tells TC to give VERY HIGH priority to the problematic tables or partitions that need to be repaired first when scanning Tablets.

Note: This command is only a hint, which does not guarantee that the repair will be successful, and the priority will change with the scheduling of TS. And when Master FE switches or restarts, this information will be lost.

Priority can be cancelled by the following commands:

ADMIN CANCEL REPAIR TABLE tbl [PARTITION (p1, p2, ...)];

Priority scheduling

Priority ensures that severely damaged fragments can be repaired first, and improves system availability. But if the high priority repair task fails all the time, the low priority task will never be scheduled. Therefore, we will dynamically adjust the priority of tasks according to the running status of tasks, so as to ensure that all tasks have the opportunity to be scheduled.

  • If the scheduling fails for five consecutive times (e.g., no resources can be obtained, no suitable source or destination can be found, etc.), the priority will be lowered.
  • If not scheduled for 30 minutes, priority will be raised.
  • The priority of the same tablet task is adjusted at least five minutes apart.

At the same time, in order to ensure the weight of the initial priority, we stipulate that the initial priority is VERY HIGH, and the lowest is lowered to NORMAL. When the initial priority is LOW, it is raised to HIGH at most. The priority adjustment here also adjusts the priority set manually by the user.

Duplicate Equilibrium

Doris automatically balances replicas within the cluster. The main idea of balancing is to create a replica of some fragments on low-load nodes, and then delete the replicas of these fragments on high-load nodes. At the same time, because of the existence of different storage media, there may or may not exist one or two storage media on different BE nodes in the same cluster. We require that fragments of storage medium A be stored in storage medium A as far as possible after equalization. So we divide the BE nodes of the cluster according to the storage medium. Then load balancing scheduling is carried out for different BE node sets of storage media.

Similarly, replica balancing ensures that a copy of the same table will not be deployed on the BE of the same host.

BE Node Load

We use Cluster LoadStatistics (CLS) to represent the load balancing of each backend in a cluster. Tablet Scheduler triggers cluster equilibrium based on this statistic. We currently calculate a load Score for each BE as the BE load score by using disk usage and number of copies. The higher the score, the heavier the load on the BE.

Disk usage and number of copies have a weight factor, which is capacityCoefficient and replicaNumCoefficient, respectively. The sum of them is constant to 1. Among them, capacityCoefficient will dynamically adjust according to actual disk utilization. When the overall disk utilization of a BE is below 50%, the capacityCoefficient value is 0.5, and if the disk utilization is above 75% (configurable through the FE configuration item capacity_used_percent_high_water), the value is 1. If the utilization rate is between 50% and 75%, the weight coefficient increases smoothly. The formula is as follows:

capacityCoefficient = 2 * Disk Utilization - 0.5

The weight coefficient ensures that when disk utilization is too high, the backend load score will be higher to ensure that the BE load is reduced as soon as possible.

Tablet Scheduler updates CLS every 1 minute.

Equilibrium strategy

Tablet Scheduler uses Load Balancer to select a certain number of healthy fragments as candidate fragments for balance in each round of scheduling. In the next scheduling, balanced scheduling will be attempted based on these candidate fragments.

Resource control

Both replica repair and balancing are accomplished by replica copies between BEs. If the same BE performs too many tasks at the same time, it will bring a lot of IO pressure. Therefore, Doris controls the number of tasks that can be performed on each node during scheduling. The smallest resource control unit is the disk (that is, a data path specified in be.conf). By default, we configure two slots per disk for replica repair. A clone task occupies one slot at the source and one slot at the destination. If the number of slots is zero, no more tasks will be assigned to this disk. The number of slots can be configured by FE’s schedule_slot_num_per_path parameter.

In addition, by default, we provide two separate slots per disk for balancing tasks. The purpose is to prevent high-load nodes from losing space by balancing because slots are occupied by repair tasks.

Duplicate Status View

Duplicate status view mainly looks at the status of the duplicate, as well as the status of the duplicate repair and balancing tasks. Most of these states exist only in Master FE nodes. Therefore, the following commands need to be executed directly to Master FE.

Duplicate status

  1. Global state checking

    Through SHOW PROC'/ statistic';commands can view the replica status of the entire cluster.

    1. +----------+-----------------------------+----------+--------------+----------+-----------+------------+--------------------+-----------------------+
    2. | DbId | DbName | TableNum | PartitionNum | IndexNum | TabletNum | ReplicaNum | UnhealthyTabletNum | InconsistentTabletNum |
    3. +----------+-----------------------------+----------+--------------+----------+-----------+------------+--------------------+-----------------------+
    4. | 35153636 | default_cluster:DF_Newrisk | 3 | 3 | 3 | 96 | 288 | 0 | 0 |
    5. | 48297972 | default_cluster:PaperData | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
    6. | 5909381 | default_cluster:UM_TEST | 7 | 7 | 10 | 320 | 960 | 1 | 0 |
    7. | Total | 240 | 10 | 10 | 13 | 416 | 1248 | 1 | 0 |
    8. +----------+-----------------------------+----------+--------------+----------+-----------+------------+--------------------+-----------------------+

    The UnhealthyTabletNum column shows how many Tablets are in an unhealthy state in the corresponding database. The Inconsistent Tablet Num column shows how many Tablets are in an inconsistent replica state in the corresponding database. The last Total line counts the entire cluster. Normally Unhealth Tablet Num and Inconsistent Tablet Num should be 0. If it’s not zero, you can further see which Tablets are there. As shown in the figure above, one table in the UM_TEST database is not healthy, you can use the following command to see which one is.

    SHOW PROC '/statistic/5909381';

    Among them `5909381’is the corresponding DbId.

    1. +------------------+---------------------+
    2. | UnhealthyTablets | InconsistentTablets |
    3. +------------------+---------------------+
    4. | [40467980] | [] |
    5. +------------------+---------------------+

    The figure above shows the specific unhealthy Tablet ID (40467980). Later we’ll show you how to view the status of each copy of a specific Tablet.

  2. Table (partition) level status checking

    Users can view the status of a copy of a specified table or partition through the following commands and filter the status through a WHERE statement. If you look at table tbl1, the state on partitions P1 and P2 is a copy of NORMAL:

    ADMIN SHOW REPLICA STATUS FROM tbl1 PARTITION (p1, p2) WHERE STATUS = "NORMAL";

    1. +----------+-----------+-----------+---------+-------------------+--------------------+------------------+------------+------------+-------+--------+--------+
    2. | TabletId | ReplicaId | BackendId | Version | LastFailedVersion | LastSuccessVersion | CommittedVersion | SchemaHash | VersionNum | IsBad | State | Status |
    3. +----------+-----------+-----------+---------+-------------------+--------------------+------------------+------------+------------+-------+--------+--------+
    4. | 29502429 | 29502432 | 10006 | 2 | -1 | 2 | 1 | -1 | 2 | false | NORMAL | OK |
    5. | 29502429 | 36885996 | 10002 | 2 | -1 | -1 | 1 | -1 | 2 | false | NORMAL | OK |
    6. | 29502429 | 48100551 | 10007 | 2 | -1 | -1 | 1 | -1 | 2 | false | NORMAL | OK |
    7. | 29502433 | 29502434 | 10001 | 2 | -1 | 2 | 1 | -1 | 2 | false | NORMAL | OK |
    8. | 29502433 | 44900737 | 10004 | 2 | -1 | -1 | 1 | -1 | 2 | false | NORMAL | OK |
    9. | 29502433 | 48369135 | 10006 | 2 | -1 | -1 | 1 | -1 | 2 | false | NORMAL | OK |
    10. +----------+-----------+-----------+---------+-------------------+--------------------+------------------+------------+------------+-------+--------+--------+

    The status of all copies is shown here. Where IsBad is listed as true, the copy is damaged. The Status column displays other states. Specific status description, you can see help through HELP ADMIN SHOW REPLICA STATUS.

    The ADMIN SHOW REPLICA STATUScommand is mainly used to view the health status of copies. Users can also view additional information about copies of a specified table by using the following commands:

    SHOW TABLET FROM tbl1;

    1. +----------+-----------+-----------+------------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+----------+----------+--------+-------------------------+--------------+----------------------+--------------+----------------------+----------------------+----------------------+
    2. | TabletId | ReplicaId | BackendId | SchemaHash | Version | VersionHash | LstSuccessVersion | LstSuccessVersionHash | LstFailedVersion | LstFailedVersionHash | LstFailedTime | DataSize | RowCount | State | LstConsistencyCheckTime | CheckVersion | CheckVersionHash | VersionCount | PathHash | MetaUrl | CompactionStatus |
    3. +----------+-----------+-----------+------------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+----------+----------+--------+-------------------------+--------------+----------------------+--------------+----------------------+----------------------+----------------------+
    4. | 29502429 | 29502432 | 10006 | 1421156361 | 2 | 0 | 2 | 0 | -1 | 0 | N/A | 784 | 0 | NORMAL | N/A | -1 | -1 | 2 | -5822326203532286804 | url | url |
    5. | 29502429 | 36885996 | 10002 | 1421156361 | 2 | 0 | -1 | 0 | -1 | 0 | N/A | 784 | 0 | NORMAL | N/A | -1 | -1 | 2 | -1441285706148429853 | url | url |
    6. | 29502429 | 48100551 | 10007 | 1421156361 | 2 | 0 | -1 | 0 | -1 | 0 | N/A | 784 | 0 | NORMAL | N/A | -1 | -1 | 2 | -4784691547051455525 | url | url |
    7. +----------+-----------+-----------+------------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+----------+----------+--------+-------------------------+--------------+----------------------+--------------+----------------------+----------------------+----------------------+

    The figure above shows some additional information, including copy size, number of rows, number of versions, where the data path is located.

    Note: The contents of the `State’column shown here do not represent the health status of the replica, but the status of the replica under certain tasks, such as CLONE, SCHEMA CHANGE, ROLLUP, etc.

    In addition, users can check the distribution of replicas in a specified table or partition by following commands.

    ADMIN SHOW REPLICA DISTRIBUTION FROM tbl1;

    1. +-----------+------------+-------+---------+
    2. | BackendId | ReplicaNum | Graph | Percent |
    3. +-----------+------------+-------+---------+
    4. | 10000 | 7 | | 7.29 % |
    5. | 10001 | 9 | | 9.38 % |
    6. | 10002 | 7 | | 7.29 % |
    7. | 10003 | 7 | | 7.29 % |
    8. | 10004 | 9 | | 9.38 % |
    9. | 10005 | 11 | > | 11.46 % |
    10. | 10006 | 18 | > | 18.75 % |
    11. | 10007 | 15 | > | 15.62 % |
    12. | 10008 | 13 | > | 13.54 % |
    13. +-----------+------------+-------+---------+

    Here we show the number and percentage of replicas of table tbl1 on each BE node, as well as a simple graphical display.

  3. Tablet level status checking

    When we want to locate a specific Tablet, we can use the following command to view the status of a specific Tablet. For example, check the tablet with ID 2950253:

    SHOW TABLET 29502553;

    1. +------------------------+-----------+---------------+-----------+----------+----------+-------------+----------+--------+---------------------------------------------------------------------------+
    2. | DbName | TableName | PartitionName | IndexName | DbId | TableId | PartitionId | IndexId | IsSync | DetailCmd |
    3. +------------------------+-----------+---------------+-----------+----------+----------+-------------+----------+--------+---------------------------------------------------------------------------+
    4. | default_cluster:test | test | test | test | 29502391 | 29502428 | 29502427 | 29502428 | true | SHOW PROC '/dbs/29502391/29502428/partitions/29502427/29502428/29502553'; |
    5. +------------------------+-----------+---------------+-----------+----------+----------+-------------+----------+--------+---------------------------------------------------------------------------+

    The figure above shows the database, tables, partitions, roll-up tables and other information corresponding to this tablet. The user can copy the command in the DetailCmd command to continue executing:

    Show Proc'/DBS/29502391/29502428/Partitions/29502427/29502428/29502553;

    1. +-----------+-----------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+------------+----------+----------+--------+-------+--------------+----------------------+
    2. | ReplicaId | BackendId | Version | VersionHash | LstSuccessVersion | LstSuccessVersionHash | LstFailedVersion | LstFailedVersionHash | LstFailedTime | SchemaHash | DataSize | RowCount | State | IsBad | VersionCount | PathHash |
    3. +-----------+-----------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+------------+----------+----------+--------+-------+--------------+----------------------+
    4. | 43734060 | 10004 | 2 | 0 | -1 | 0 | -1 | 0 | N/A | -1 | 784 | 0 | NORMAL | false | 2 | -8566523878520798656 |
    5. | 29502555 | 10002 | 2 | 0 | 2 | 0 | -1 | 0 | N/A | -1 | 784 | 0 | NORMAL | false | 2 | 1885826196444191611 |
    6. | 39279319 | 10007 | 2 | 0 | -1 | 0 | -1 | 0 | N/A | -1 | 784 | 0 | NORMAL | false | 2 | 1656508631294397870 |
    7. +-----------+-----------+---------+-------------+-------------------+-----------------------+------------------+----------------------+---------------+------------+----------+----------+--------+-------+--------------+----------------------+

    The figure above shows all replicas of the corresponding Tablet. The content shown here is the same as SHOW TABLET FROM tbl1;. But here you can clearly see the status of all copies of a specific Tablet.

Duplicate Scheduling Task

  1. View tasks waiting to be scheduled

    SHOW PROC '/cluster_balance/pending_tablets';

    1. +----------+--------+-----------------+---------+----------+----------+-------+---------+--------+----------+---------+---------------------+---------------------+---------------------+----------+------+-------------+---------------+---------------------+------------+---------------------+--------+---------------------+-------------------------------+
    2. | TabletId | Type | Status | State | OrigPrio | DynmPrio | SrcBe | SrcPath | DestBe | DestPath | Timeout | Create | LstSched | LstVisit | Finished | Rate | FailedSched | FailedRunning | LstAdjPrio | VisibleVer | VisibleVerHash | CmtVer | CmtVerHash | ErrMsg |
    3. +----------+--------+-----------------+---------+----------+----------+-------+---------+--------+----------+---------+---------------------+---------------------+---------------------+----------+------+-------------+---------------+---------------------+------------+---------------------+--------+---------------------+-------------------------------+
    4. | 4203036 | REPAIR | REPLICA_MISSING | PENDING | HIGH | LOW | -1 | -1 | -1 | -1 | 0 | 2019-02-21 15:00:20 | 2019-02-24 11:18:41 | 2019-02-24 11:18:41 | N/A | N/A | 2 | 0 | 2019-02-21 15:00:43 | 1 | 0 | 2 | 0 | unable to find source replica |
    5. +----------+--------+-----------------+---------+----------+----------+-------+---------+--------+----------+---------+---------------------+---------------------+---------------------+----------+------+-------------+---------------+---------------------+------------+---------------------+--------+---------------------+-------------------------------+

    The specific meanings of each column are as follows:

    • TabletId: The ID of the Tablet waiting to be scheduled. A scheduling task is for only one Tablet
    • Type: Task type, which can be REPAIR (repair) or BALANCE (balance)
    • Status: The current status of the Tablet, such as REPLICAMISSING (copy missing)
    • State: The status of the scheduling task may be PENDING/RUNNING/FINISHED/CANCELLED/TIMEOUT/UNEXPECTED
    • OrigPrio: Initial Priority
    • DynmPrio: Current dynamically adjusted priority
    • SrcBe: ID of the BE node at the source end
    • SrcPath: hash value of the path of the BE node at the source end
    • DestBe: ID of destination BE node
    • DestPath: hash value of the path of the destination BE node
    • Timeout: When the task is scheduled successfully, the timeout time of the task is displayed here in units of seconds.
    • Create: The time when the task was created
    • LstSched: The last time a task was scheduled
    • LstVisit: The last time a task was accessed. Here “accessed” refers to the processing time points associated with the task, including scheduling, task execution reporting, and so on.
    • Finished: Task End Time
    • Rate: Clone Task Data Copy Rate
    • Failed Sched: Number of Task Scheduling Failures
    • Failed Running: Number of task execution failures
    • LstAdjPrio: Time of last priority adjustment
    • CmtVer/CmtVerHash/VisibleVer/VisibleVerHash: version information for clone tasks
    • ErrMsg: Error messages that occur when tasks are scheduled and run
  2. View running tasks

    SHOW PROC '/cluster_balance/running_tablets';

    The columns in the result have the same meaning as pending_tablets.

  3. View completed tasks

    SHOW PROC '/cluster_balance/history_tablets';

    By default, we reserve only the last 1,000 completed tasks. The columns in the result have the same meaning as pending_tablets. If State is listed as FINISHED, the task is normally completed. For others, you can see the specific reason based on the error information in the ErrMsg column.

Viewing Cluster Load and Scheduling Resources

  1. Cluster load

    You can view the current load of the cluster by following commands:

    SHOW PROC '/cluster_balance/cluster_load_stat';

    First of all, we can see the division of different storage media:

    1. +---------------+
    2. | StorageMedium |
    3. +---------------+
    4. | HDD |
    5. | SSD |
    6. +---------------+

    Click on a storage medium to see the equilibrium state of the BE node that contains the storage medium:

    SHOW PROC '/cluster_balance/cluster_load_stat/HDD';

    1. +----------+-----------------+-----------+---------------+----------------+-------------+------------+----------+-----------+--------------------+-------+
    2. | BeId | Cluster | Available | UsedCapacity | Capacity | UsedPercent | ReplicaNum | CapCoeff | ReplCoeff | Score | Class |
    3. +----------+-----------------+-----------+---------------+----------------+-------------+------------+----------+-----------+--------------------+-------+
    4. | 10003 | default_cluster | true | 3477875259079 | 19377459077121 | 17.948 | 493477 | 0.5 | 0.5 | 0.9284678149967587 | MID |
    5. | 10002 | default_cluster | true | 3607326225443 | 19377459077121 | 18.616 | 496928 | 0.5 | 0.5 | 0.948660871419998 | MID |
    6. | 10005 | default_cluster | true | 3523518578241 | 19377459077121 | 18.184 | 545331 | 0.5 | 0.5 | 0.9843539990641831 | MID |
    7. | 10001 | default_cluster | true | 3535547090016 | 19377459077121 | 18.246 | 558067 | 0.5 | 0.5 | 0.9981869446537612 | MID |
    8. | 10006 | default_cluster | true | 3636050364835 | 19377459077121 | 18.764 | 547543 | 0.5 | 0.5 | 1.0011489897614072 | MID |
    9. | 10004 | default_cluster | true | 3506558163744 | 15501967261697 | 22.620 | 468957 | 0.5 | 0.5 | 1.0228319835582569 | MID |
    10. | 10007 | default_cluster | true | 4036460478905 | 19377459077121 | 20.831 | 551645 | 0.5 | 0.5 | 1.057279369420761 | MID |
    11. | 10000 | default_cluster | true | 4369719923760 | 19377459077121 | 22.551 | 547175 | 0.5 | 0.5 | 1.0964036415787461 | MID |
    12. +----------+-----------------+-----------+---------------+----------------+-------------+------------+----------+-----------+--------------------+-------+

    Some of these columns have the following meanings:

    • Available: True means that BE heartbeat is normal and not offline.
    • UsedCapacity: Bytes, the size of disk space used on BE
    • Capacity: Bytes, the total disk space size on BE
    • UsedPercent: Percentage, disk space utilization on BE
    • ReplicaNum: Number of copies on BE
    • CapCoeff/ReplCoeff: Weight Coefficient of Disk Space and Copy Number
    • Score: Load score. The higher the score, the heavier the load.
    • Class: Classified by load, LOW/MID/HIGH. Balanced scheduling moves copies from high-load nodes to low-load nodes

    Users can further view the utilization of each path on a BE, such as the BE with ID 10001:

    SHOW PROC '/cluster_balance/cluster_load_stat/HDD/10001';

    1. +------------------+------------------+---------------+---------------+---------+--------+----------------------+
    2. | RootPath | DataUsedCapacity | AvailCapacity | TotalCapacity | UsedPct | State | PathHash |
    3. +------------------+------------------+---------------+---------------+---------+--------+----------------------+
    4. | /home/disk4/palo | 498.757 GB | 3.033 TB | 3.525 TB | 13.94 % | ONLINE | 4883406271918338267 |
    5. | /home/disk3/palo | 704.200 GB | 2.832 TB | 3.525 TB | 19.65 % | ONLINE | -5467083960906519443 |
    6. | /home/disk1/palo | 512.833 GB | 3.007 TB | 3.525 TB | 14.69 % | ONLINE | -7733211489989964053 |
    7. | /home/disk2/palo | 881.955 GB | 2.656 TB | 3.525 TB | 24.65 % | ONLINE | 4870995507205544622 |
    8. | /home/disk5/palo | 694.992 GB | 2.842 TB | 3.525 TB | 19.36 % | ONLINE | 1916696897889786739 |
    9. +------------------+------------------+---------------+---------------+---------+--------+----------------------+

    The disk usage of each data path on the specified BE is shown here.

  2. Scheduling resources

    Users can view the current slot usage of each node through the following commands:

    SHOW PROC '/cluster_balance/working_slots';

    1. +----------+----------------------+------------+------------+-------------+----------------------+
    2. | BeId | PathHash | AvailSlots | TotalSlots | BalanceSlot | AvgRate |
    3. +----------+----------------------+------------+------------+-------------+----------------------+
    4. | 10000 | 8110346074333016794 | 2 | 2 | 2 | 2.459007474009069E7 |
    5. | 10000 | -5617618290584731137 | 2 | 2 | 2 | 2.4730105014001578E7 |
    6. | 10001 | 4883406271918338267 | 2 | 2 | 2 | 1.6711402709780257E7 |
    7. | 10001 | -5467083960906519443 | 2 | 2 | 2 | 2.7540126380326536E7 |
    8. | 10002 | 9137404661108133814 | 2 | 2 | 2 | 2.417217089806745E7 |
    9. | 10002 | 1885826196444191611 | 2 | 2 | 2 | 1.6327378456676323E7 |
    10. +----------+----------------------+------------+------------+-------------+----------------------+

    In this paper, data path is used as granularity to show the current use of slot. Among them, `AvgRate’is the copy rate of clone task in bytes/seconds on the path of historical statistics.

  3. Priority repair view

    The following command allows you to view the priority repaired tables or partitions set by the `ADMIN REPAIR TABLE’command.

    SHOW PROC '/cluster_balance/priority_repair';

    Among them, `Remaining TimeMs’indicates that these priority fixes will be automatically removed from the priority fix queue after this time. In order to prevent resources from being occupied due to the failure of priority repair.

Scheduler Statistical Status View

We have collected some statistics of Tablet Checker and Tablet Scheduler during their operation, which can be viewed through the following commands:

SHOW PROC '/cluster_balance/sched_stat';

  1. +---------------------------------------------------+-------------+
  2. | Item | Value |
  3. +---------------------------------------------------+-------------+
  4. | num of tablet check round | 12041 |
  5. | cost of tablet check(ms) | 7162342 |
  6. | num of tablet checked in tablet checker | 18793506362 |
  7. | num of unhealthy tablet checked in tablet checker | 7043900 |
  8. | num of tablet being added to tablet scheduler | 1153 |
  9. | num of tablet schedule round | 49538 |
  10. | cost of tablet schedule(ms) | 49822 |
  11. | num of tablet being scheduled | 4356200 |
  12. | num of tablet being scheduled succeeded | 320 |
  13. | num of tablet being scheduled failed | 4355594 |
  14. | num of tablet being scheduled discard | 286 |
  15. | num of tablet priority upgraded | 0 |
  16. | num of tablet priority downgraded | 1096 |
  17. | num of clone task | 230 |
  18. | num of clone task succeeded | 228 |
  19. | num of clone task failed | 2 |
  20. | num of clone task timeout | 2 |
  21. | num of replica missing error | 4354857 |
  22. | num of replica version missing error | 967 |
  23. | num of replica relocating | 0 |
  24. | num of replica redundant error | 90 |
  25. | num of replica missing in cluster error | 0 |
  26. | num of balance scheduled | 0 |
  27. +---------------------------------------------------+-------------+

The meanings of each line are as follows:

  • num of tablet check round:Tablet Checker 检查次数
  • cost of tablet check(ms):Tablet Checker 检查总耗时
  • num of tablet checked in tablet checker:Tablet Checker 检查过的 tablet 数量
  • num of unhealthy tablet checked in tablet checker:Tablet Checker 检查过的不健康的 tablet 数量
  • num of tablet being added to tablet scheduler:被提交到 Tablet Scheduler 中的 tablet 数量
  • num of tablet schedule round:Tablet Scheduler 运行次数
  • cost of tablet schedule(ms):Tablet Scheduler 运行总耗时
  • num of tablet being scheduled:被调度的 Tablet 总数量
  • num of tablet being scheduled succeeded:被成功调度的 Tablet 总数量
  • num of tablet being scheduled failed:调度失败的 Tablet 总数量
  • num of tablet being scheduled discard:调度失败且被抛弃的 Tablet 总数量
  • num of tablet priority upgraded:优先级上调次数
  • num of tablet priority downgraded:优先级下调次数
  • num of clone task: number of clone tasks generated
  • num of clone task succeeded:clone 任务成功的数量
  • num of clone task failed:clone 任务失败的数量
  • num of clone task timeout:clone 任务超时的数量
  • num of replica missing error: the number of tablets whose status is checked is the missing copy
  • num of replica version missing error:检查的状态为版本缺失的 tablet 的数量(该统计值包括了 num of replica relocating 和 num of replica missing in cluster error) *num of replica relocation 29366; 24577;*replica relocation tablet *
  • num of replica redundant error: Number of tablets whose checked status is replica redundant
  • num of replica missing in cluster error:检查的状态为不在对应 cluster 的 tablet 的数量
  • num of balance scheduled:均衡调度的次数

Note: The above states are only historical accumulative values. We also print these statistics regularly in the FE logs, where the values in parentheses represent the number of changes in each statistical value since the last printing dependence of the statistical information.

Relevant configuration instructions

Adjustable parameters

The following adjustable parameters are all configurable parameters in fe.conf.

  • use_new_tablet_scheduler

    • Description: Whether to enable the new replica scheduling mode. The new replica scheduling method is the replica scheduling method introduced in this document. If turned on, disable_colocate_join must be true. Because the new scheduling strategy does not support data fragmentation scheduling of co-locotion tables for the time being.
    • Default value:true
    • Importance: High
  • tablet_repair_delay_factor_second

    • Note: For different scheduling priorities, we will delay different time to start repairing. In order to prevent a large number of unnecessary replica repair tasks from occurring in the process of routine restart and upgrade. This parameter is a reference coefficient. For HIGH priority, the delay is the reference coefficient * 1; for NORMAL priority, the delay is the reference coefficient * 2; for LOW priority, the delay is the reference coefficient * 3. That is, the lower the priority, the longer the delay waiting time. If the user wants to repair the copy as soon as possible, this parameter can be reduced appropriately.
    • Default value: 60 seconds
    • Importance: High
  • schedule_slot_num_per_path

    • Note: The default number of slots allocated to each disk for replica repair. This number represents the number of replica repair tasks that a disk can run simultaneously. If you want to repair the copy faster, you can adjust this parameter appropriately. The higher the single value, the greater the impact on IO.
    • Default value: 2
    • Importance: High
  • balance_load_score_threshold

    • Description: Threshold of Cluster Equilibrium. The default is 0.1, or 10%. When the load core of a BE node is not higher than or less than 10% of the average load core, we think that the node is balanced. If you want to make the cluster load more even, you can adjust this parameter appropriately.
    • Default value: 0.1
    • Importance:
  • storage_high_watermark_usage_percent 和 storage_min_left_capacity_bytes

    • Description: These two parameters represent the upper limit of the maximum space utilization of a disk and the lower limit of the minimum space remaining, respectively. When the space utilization of a disk is greater than the upper limit or the remaining space is less than the lower limit, the disk will no longer be used as the destination address for balanced scheduling.
    • Default values: 0.85 and 1048576000 (1GB)
    • Importance:
  • disable_balance

    • Description: Control whether to turn off the balancing function. When replicas are in equilibrium, some functions, such as ALTER TABLE, will be banned. Equilibrium can last for a long time. Therefore, if the user wants to do the prohibited operation as soon as possible. This parameter can be set to true to turn off balanced scheduling.
    • Default value:true
    • Importance:

Unadjustable parameters

The following parameters do not support modification for the time being, just for illustration.

  • Tablet Checker scheduling interval

    Tablet Checker schedules checks every 20 seconds.

  • Tablet Scheduler scheduling interval

    Tablet Scheduler schedules every five seconds

  • Number of Tablet Scheduler Schedules per Batch

    Tablet Scheduler schedules up to 50 tablets at a time.

  • Tablet Scheduler Maximum Waiting Schedule and Number of Tasks in Operation

    The maximum number of waiting tasks and running tasks is 2000. When over 2000, Tablet Checker will no longer generate new scheduling tasks to Tablet Scheduler.

  • Tablet Scheduler Maximum Balanced Task Number

    The maximum number of balanced tasks is 500. When more than 500, there will be no new balancing tasks.

  • Number of slots per disk for balancing tasks

    The number of slots per disk for balancing tasks is 2. This slot is independent of the slot used for replica repair.

  • Update interval of cluster equilibrium

    Tablet Scheduler recalculates the load score of the cluster every 20 seconds.

  • Minimum and Maximum Timeout for Clone Tasks

    A clone task timeout time range is 3 minutes to 2 hours. The specific timeout is calculated by the size of the tablet. The formula is (tablet size)/ (5MB/s). When a clone task fails three times, the task terminates.

  • Dynamic Priority Adjustment Strategy

    The minimum priority adjustment interval is 5 minutes. When a tablet schedule fails five times, priority is lowered. When a tablet is not scheduled for 30 minutes, priority is raised.

Relevant issues

  • In some cases, the default replica repair and balancing strategy may cause the network to be full (mostly in the case of gigabit network cards and a large number of disks per BE). At this point, some parameters need to be adjusted to reduce the number of simultaneous balancing and repair tasks.

  • Current balancing strategies for copies of Colocate Table do not guarantee that copies of the same Tablet will not be distributed on the BE of the same host. However, the repair strategy of the copy of Colocate Table detects this distribution error and corrects it. However, it may occur that after correction, the balancing strategy regards the replicas as unbalanced and rebalances them. As a result, the Colocate Group can not achieve stability because of the continuous alternation between the two states. In view of this situation, we suggest that when using Colocate attribute, we try to ensure that the cluster is isomorphic, so as to reduce the probability that replicas are distributed on the same host.