Data Replication with Raft Consensus

AttentionThis page documents an earlier version. Go to the latest (v2.1)version.

YugabyteDB replicates data in order to survive failures while continuing to maintain consistency ofdata and not requiring operator intervention. Replication Factor (or RF) is the number of copiesof data in a Yugabyte universe. Fault Tolerance (or FT) of a Yugabyte universe is the maximumnumber of node failures it can survive while continuing to preserve correctness of data. FT and RFare highly correlated. To achieve a FT of k nodes, the universe has to be configured with a RF of(2k + 1).

Strong write consistency

Replication of data in YugabyteDB is achieved at the level of tablets, using tablet-peers. Eachtablet comprises of a set of tablet-peers - each of which stores one copy of the data belonging tothe tablet. There are as many tablet-peers for a tablet as the replication factor, and they form aRaft group. The tablet-peers are hosted on different nodes to allow data redundancy on nodefailures. Note that the replication of data between the tablet-peers is strongly consistent.

The figure below illustrates three tablet-peers that belong to a tablet (tablet 1). The tablet-peersare hosted on different YB-TServers and form a Raft group for leader election, failure detection andreplication of the write-ahead logs.

raft_replication

The first thing that happens when a tablet starts up is to elect one of the tablet-peers as theleader using the Raft protocol. This tablet leader now becomesresponsible for processing user-facing write requests. It translates the user-issued writes intoDocDB updates, which it replicates among the tablet-peers using Raft to achieve strong consistency.The set of DocDB updates depends on the user-issued write, and involves locking a set of keys toestablish a strict update order, and optionally reading the older value to modify and update in caseof a read-modify-write operation. The Raft log is used to ensure that the database state-machine ofa tablet is replicated amongst the tablet-peers with strict ordering and correctness guarantees evenin the face of failures or membership changes. This is essential to achieving strong consistency.

Once the Raft log is replicated to a majority of tablet-peers and successfully persisted on themajority, the write is applied into DocDB and is subsequently available for reads. Details of DocDB,which is a log structured merge-tree (LSM) database, is covered in a subsequent section. Once thewrite is persisted on disk by DocDB, the write entries can be purged from the Raft log. This isperformed as a controlled background operation without any impact to the foreground operations.

Tunable read consistency

Only the tablet leader can process user-facing write and read requests. Note that while this is thecase for strongly consistent reads, YugabyteDB offers reading from followers with relaxedguarantees which is desired in some deployment models. All other tablet-peers are called followersand merely replicate data, and are available as hot standbys that can take over quickly in case theleader fails.

read replicas

In addition to the core distributed consensus based replication, YugabyteDB extends Raft to addread replicas (aks observer nodes) that do not participate in writes but get a timeline consistentcopy of the data in an asynchronous manner. Nodes in remote datacenters can thus be added in “read-only”mode. This is primarily for cases where latency of doing a distributed consensus based write is nottolerable for some workloads. This read-only node (or timeline-consistent node) is still strictly better thaneventual consistency, because with the latter the application’s view of the data can move back andforth in time and is hard to program to.