Single Row ACID Transactions

AttentionThis page documents an earlier version. Go to the latest (v2.1)version.

YugabyteDB currently offers ACID semantics for mutations involving a single row or rows that fallwithin the same shard (partition, tablet). These mutations incur only one network roundtrip betweenthe distributed consensus peers.

Even read-modify-write operations within a single row or single shard, such as the following incuronly one round trip in YugabyteDB.

  1. UPDATE table SET x = x + 1 WHERE ...
  2. INSERT ... IF NOT EXISTS
  3. UPDATE ... IF EXISTS

This is unlike Apache Cassandra, which uses a concept called lightweight transactions to achievecorrectness for these read-modify-write operations and incurs 4-network round triplatency.

Hybrid time as an MVCC timestamp

YugabyteDB implements MVCC(multiversion concurrency control) and internally keeps track of multiple versions of valuescorresponding to the same key, e.g. of a particular column in a particular row. The details of howmultiple versions of the same key are stored in each replica’s DocDB are described in the EncodingDetails section. The last part of each key isa timestamp, which allows to quickly navigate to a particular version of a key in the RocksDBkey-value store.

The timestamp that we are using for MVCC comes from the HybridTime algorithm, adistributed timestamp assignment algorithm that combines the advantages of local realtime (physical)clocks and Lamport clocks. The Hybrid Time algorithm ensures that events connected by a causalchain of the form “A happens before B on the same server” or “A happens on one server, which thensends an RPC to another server, where B happens”, always get assigned hybrid timestamps in anincreasing order. This is achieved by propagating a hybrid timestamp with most RPC requests, andalways updating the hybrid time on the receiving server to the highest value seen, including thecurrent physical time on the server. Multiple aspects of YugabyteDB’s transaction model rely onthese properties of Hybrid Time, e.g.:

  • Hybrid timestamps assigned to committed Raft log entries in the same tablet always keepincreasing, even if there are leader changes. This is because the new leader always has allcommitted entries from previous leaders, and it makes sure to update its hybrid clock with thetimestamp of the last committed entry before appending new entries. This property simplifies thelogic of selecting a safe hybrid time to pick for single-tablet read requests.

  • A request trying to read data from a tablet at a particular hybrid time needs to make sure that nochanges happen in the tablet with timestamps lower than the read timestamp, which could lead to aninconsistent result set. The need to read from a tablet at a particular timestamp arises duringtransactional reads across multiple tablets. This condition becomes easier to satisfy due to thefact that the read timestamp is chosen as the current hybrid time on the YB-TServer processing theread request, so hybrid time on the leader of the tablet we’re reading from immediately getsupdated to a value that is at least as high as than the read timestamp. Then the read requestonly has to wait for any relevant entries in the Raft queue with timestamps lower than the readtimestamp to get replicated and applied to RocksDB, and it can proceed with processing the readrequest after that.

Reading the latest data from a recently elected leader

In a steady state, when the leader is appending and replicating log entries, the latestmajority-replicated entry is exactly the committed one. However, it is a bit more complicated rightafter a leader change. When a new leader is elected in a tablet, it appends a no-op entry to thetablet’s Raft log and replicates it, as described in the Raft protocol. Before this no-op entry isreplicated, we consider the tablet unavailable for reading up-to-date values and acceptingread-modify-write operations. This is because the new tablet leader needs to be able to guaranteethat all previous Raft-committed entries are applied to RocksDB and other persisent and in-memorydata structures, and it is only possible after we know that all entries in the new leader’s log arecommitted.

Leader leases: reading the latest data in case of a network partition

Leader leases are a mechanism for a tablet leader to establish its authority for a certain shorttime period in order to avoid the following inconsistency:

  • The leader is network-partitioned away from its followers
  • A new leader is elected
  • The client writes a new value and the new leader replicates it
  • The client reads a stale value from the old leader.

A diagram showing a potential inconsistency in case of a network partition if leader leases arenot present

The leader lease mechanism in Yugabyte prevents this inconsistency. It works as follows:

  • With every leader-to-follower message (AppendEntries in Raft’s terminology), whether replicatingnew entries or even an empty heartbeat message, the leader sends a “leader lease” request as atime interval, e.g. could be “I want a 2-second lease”. The lease duration is usually asystem-wide parameter. For each peer, the leader also keeps track of the lease expiration timecorresponding to each pending request (i.e. time when the request was sent + lease duration),which is stored in terms of local monotonic time(CLOCK_MONOTONIC in Linux). The leader considersitself as a special case of a “peer” for this purpose. Then, as it receives responses fromfollowers, it maintains the majority-replicated watermark of these expiration times as stored atrequest sending time. The leader adopts this majority-replicated watermark as its leaseexpiration time, and uses it when deciding whether it can serve consistent read requests or acceptwrites.

  • When a follower receives the above Raft RPC, it reads the value of its current monotonicclock, adds the provided lease interval to that, and remembers this lease expiration time, alsoin terms of its local monotonic time. If this follower becomes the new leader, it is not allowedto serve consistent reads or accept writes until any potential old leader’s lease expires.

  • To guarantee that any new leader is aware of any old leader’s lease expiration, another bit oflogic is necessary. Each Raft group member records the latest expiration time of an old leaderthat it knows about (in terms of this server’s local monotonic time). Whenever a server respondsto a RequestVote RPC, it includes the largest remaining amount of time of any known oldleader’s lease in its response. This is handled similarly to the lease duration in a leader’sAppendEntries request on the receiving server: at least this amount of time has to pass since thereceipt of this request before the recipient can service up-to-date requests in case it becomes aleader. This part of the algorithm is needed so that we can prove that a new leader will alwaysknow about any old leader’s majority-replicated lease. This is analogous to Raft’s correctnessproof: there is always a server (“the voter”) that received a lease request from the old leaderand voted for the new leader, because the two majorities must overlap.

Note that we are not relying on any kind of clock synchronization for this leader leaseimplementation, as we’re only sending time intervals over the network, and each server operates interms of its local monotonic clock. The only two requirements to the clock implementation are:

  • Bounded monotonic clock drift rate between different servers. E.g. if we use the standard Linuxassumption of less than 500µs per second drift rate, we could account for it by multiplyingall delays mentioned above by 1.001.

  • The monotonic clock does not freeze. E.g. if we’re running on a VM which freezes temporarily, thehypervisor needs to refresh the VM’s clock from the hardware clock when it starts running again.

The leader lease mechanism guarantees that at any point in time there is at most one server in anytablet’s Raft group that considers itself to be an up-to-date leader that is allowed to serviceconsistent reads or accept write requests.

Safe timestamp assignment for a read request

Every read request is assigned a particular MVCC timestamp / hybrid time (let’s call itht_read), which allows write operations to the same set of keys to happen in parallel withreads. It is crucial, however, that the view of of the database as of this timestamp is notupdated by concurrently happening writes. In other words, once we’ve picked ht_read for a readrequest, no further writes to the same set of keys can be assigned timestamps lower than or equal toht_read. As we mentioned above, we assign strictly increasing hybrid times to Raft log entriesof any given tablet. Therefore, one way to assign ht_read safely would be to use the hybridtime of the last committed record. As committed Raft log records are never overwritten by futureleaders, and each new leader reads the last log entry and updates its hybrid time, all futurerecords will have strictly higher hybrid times.

However, with this conservative timestamp assignment approach, ht_read can stay the same ifthere is no write workload on this particular tablet. This will result in a client-observed anomalyif TTL(time-to-live)is being used: no expired values will disappear, as far as the client is concerned, until a newrecord is written to the tablet. Then, a lot of old expired values could suddenly disappear. Tocombat this anomaly, we need to assign the read timestamp to be close to the current hybrid time(which is in its turn close to the physical time) to preserve natural TTL semantics. We shouldtherefore try to choose ht_read to be the highest possible timestamp for which we canguarantee that all future write operations in the tablet will have a strictly higher hybrid timethan that, even across leader changes.

For this, we need to introduce a concept of “hybrid time leader leases”, similar to absolute-timeleader leases discussed in the previous section. With every Raft AppendEntries request to afollower, whether it is a regular request or an empty / heartbeat request, a tablet leader computesa “hybrid time lease expiration time”, or ht_lease_exp for short, and sends that to thefollower. ht_lease_exp is usually computed as current hybrid time plus a fixed configuredduration (e.g. 2 seconds). By replying, followers acknowledge the old leader’s exclusive authorityover assigning any hybrid times up to and including ht_lease_exp. Similarly to regular leases,these hybrid time leases are propagated on votes. The leader maintains a majority-replicatedwatermark, and considers itself to have replicated a particular value of a hybrid time leader leaseexpiration if it sent that or a higher ht_lease_exp value to a majority of Raft group members.For this purpose, the leader is always considered to have replicated an infinite leader lease toitself.

Definition of safe time

Now, suppose the current majority-replicated hybrid time leader lease expiration isreplicated_ht_lease_exp. Then the safe timestamp for a read request can be computed as themaximum of:

  • Last committed Raft entry’s hybrid time
  • One of:
    • If there are uncommitted entries in the Raft log: the minimum of the first uncommitted entry’shybrid time - ε (where ε is the smallest possible difference in hybrid time)and replicated_ht_lease_exp.
    • If there are no uncommitted entries in the Raft log: the minimum of the current hybrid timeand replicated_ht_lease_exp.

In other words, the last committed entry’s hybrid time is always safe to read at, but for higherhybrid times, the majority-replicated hybrid time leader lease is an upper bound. That is because wecan only guarantee that no future leader will commit an entry with hybrid time less than ht ifht < replicated_ht_lease_exp.

Note that when reading from a single tablet, we never have to wait for the chosen ht_read tobecome safe to read at because it is chosen as such already. However, if we decide to read aconsistent view of data across multiple tablets, ht_read could be chosen on one of them, andwe’ll have to wait for that timestamp to become safe to read at on the second tablet. This willtypically happen very quickly, as the hybrid time on the second tablet’s leader will be instantlyupdated with the propagated hybrid time from the first tablet’s leader, and in the common case wewill just have to wait for pending Raft log entries with hybrid times less than ht_read to becommitted.