Replication with Raft consensus

DocDB replicates data in order to survive failures while continuing to maintain consistency ofdata and not requiring operator intervention. The replication factor (RF) is the number of copiesof data in a YugabyteDB universe. The fault tolerance (FT) of a YugabyteDB universe is the maximumnumber of node failures it can survive while continuing to preserve correctness of data. FT and RFare highly correlated. To achieve a FT of k nodes, the universe has to be configured with a RF of(2k + 1).

Strong write consistency

Replication of data in DocDB is achieved at the level of tablets, using tablet-peers. Eachtablet comprises of a set of tablet-peers - each of which stores one copy of the data belonging tothe tablet. There are as many tablet-peers for a tablet as the replication factor, and they form aRaft group. The tablet-peers are hosted on different nodes to allow data redundancy on nodefailures. Note that the replication of data between the tablet-peers is strongly consistent.

The figure below illustrates three tablet-peers that belong to a tablet (tablet 1). The tablet-peersare hosted on different YB-TServers and form a Raft group for leader election, failure detection andreplication of the write-ahead logs.

raft_replication

The first thing that happens when a tablet starts up is to elect one of the tablet-peers as theleader using the Raft protocol. This tablet leader now becomesresponsible for processing user-facing write requests. It translates the user-issued writes intothe document storage layer of DocDB and also replicates among the tablet-peers using Raft to achieve strong consistency.

The set of DocDB updates depends on the user-issued write, and involves locking a set of keys toestablish a strict update order, and optionally reading the older value to modify and update in caseof a read-modify-write operation. The Raft log is used to ensure that the database state-machine ofa tablet is replicated amongst the tablet-peers with strict ordering and correctness guarantees evenin the face of failures or membership changes. This is essential to achieving strong consistency.

Once the Raft log is replicated to a majority of tablet-peers and successfully persisted on themajority, the write is applied into DocDB document storage layer and is subsequently available for reads. Once the write is persisted on disk by the document storage layer, the write entries can be purged from the Raft log. This is performed as a controlled background operation without any impact to the foreground operations.

Tunable read consistency

Only the tablet leader can process user-facing write and read requests. Note that while this is thecase for strongly consistent reads, YugabyteDB offers reading from followers with relaxedguarantees which is desired in some deployment models. All other tablet-peers are called followersand merely replicate data, and are available as hot standbys that can take over quickly in case theleader fails.

Read replicas

In addition to the core distributed consensus based replication, DocDB extends Raft to addread replicas (aka observer nodes) that do not participate in writes but get a timeline consistentcopy of the data in an asynchronous manner. Nodes in remote data centers can thus be added in “read-only”mode. This is primarily for cases where latency of doing a distributed consensus-based write is nottolerable for some workloads. This read-only node (or timeline-consistent node) is still strictly better than eventual consistency, because with the latter the application’s view of the data can move back andforth in time and is hard to program to.