Distributed ACID transactions

Distributed ACID transactions

Distributed ACID transactions are transactions that modify multiple rows in more than one shard. YugabyteDB supports distributed transactions, enabling features such as strongly consistent secondary indexes and multi-table/row ACID operations in both the YCQL context as well as in the YSQL context. This section provides some common concepts and notions used in Yugabyte’s approach to implementing distributed transactions. Once you are familiar with these concepts, see Transactional IO Path for a walk-through of a distributed transaction’s life cycle.

Provisional records

Just as YugabyteDB stores values written by single-shard ACID transactions intoDocDB, it needs to store uncommitted values written bydistributed transactions in a similar persistent data structure. However, we cannot just write themto DocDB as regular values, because they would then become visible at different times to clientsreading through different tablet servers, allowing a client to see a partially applied transactionand thus breaking atomicity. YugabyteDB therefore writes provisional records to all tabletsresponsible for the keys the transaction is trying to modify. We call them “provisional” as opposedto “regular” (“permanent”) records, because they are invisible to readers until the transactioncommits.

Provisional records are stored in a separate part of the RocksDB key space in the same RocksDB /DocDB instance as regular records, but with a separate prefix that puts all provisional records’RocksDB keys before those of regular records. Compared to other possible design options, such asstoring provisional records inline with the regular records or putting them in a separate RocksDBinstance altogether, the approach we have chosen has the following benefits:

It is easy to scan all provisional records. As we will see, this is very helpful in cleaning upaborted / abandoned transactions.
During the read path, we need to handle provisional records very differently compared to regularrecords, and putting them in a separate section of the RocksDB key space allows to simplify theread path.
Storing provisional records in the same RocksDB instance allows to atomically delete provisionalrecords and write regular records as one RocksDB operation after the transaction is committed.

Encoding details of provisional records

There are three types of RocksDB key/value pairs corresponding to provisional records, omittingthe one-byte prefix that puts these records before all regular records in RocksDB.

1. Primary provisional records

  DocumentKey, SubKey1, ..., SubKeyN, LockType, ProvisionalRecordHybridTime -> TxnId, Value

The DocumentKey, SubKey1, …, SubKey components exactly match those in DocDB’sencoding of “paths” toa particular subdocument (e.g. a row, a column, or an element in a collection-type column) toRocksDB keys.

Each of these primary provisional records also acts as a persistent revocable lock. There are somesimilarities as well as differences when compared to blocking in-memorylocks maintained by every tablet’s lockmanager. These persistent locks can be of any of the same types as for in-memory leader-only locks(SI write, serializable write/read, and a separate “strong”/“weak” classification for handlingnested document changes). However, unlike the leader-side in-memory locks, the locks representedby provisional records can be revoked by another conflicting transaction. The conflict resolutionsubsystem makes sure that for any two conflicting transactions, at least one of them is aborted.

As an example, suppose a snapshot isolation transaction is setting column col1 in row row1 tovalue1. Then DocumentKey is row1 and SubKey1 is col1. Suppose the provisional record waswritten into the tablet with hybrid 1516847525206000, and the transaction ID is7c98406e-2373-499d-88c2-25d72a4a178c. In that case we will end up with the following provisionalrecord values in RocksDB:

  row1, WeakSIWrite, 1516847525206000 -> 7c98406e-2373-499d-88c2-25d72a4a178c
  row1, col1, StrongSIWrite, 1516847525206000 -> 7c98406e-2373-499d-88c2-25d72a4a178c, value1

We can see that we are using WeakSIWrite lock type for the row (the “parent” of the column weare writing), and StrongSIWrite for the column itself. The provisional record for the column isalso where the column’s value being written by the transaction is stored.

2. Transaction metadata records

TxnId -> StatusTabletId, IsolationLevel, Priority
- StatusTabletId is the ID of the tablet that keeps track of this transaction’s status.Unlike the case of tables/tablets holding user data, where we are using a hash-basedmapping from keys to tablets, there is no deterministic wayto compute the transaction status tablet ID by transaction ID, so this information must beexplicitly passed to all components handling a particular transaction.
- Isolation Level Snapshot Isolation orSerializable Isolation.
- Priority This priority is assigned randomly during transaction creation. When a conflictis detected between two transactions, the transaction with lower priority isaborted and restarted.

3. Provisional record keys indexed by transaction ID** (“reverse index”)

TxnId, HybridTime -> primary provisional record key

This mapping allows us to find all RocksDB records belonging to a particular transaction. This isbeing used when cleaning up committed or aborted transactions. Note that because multiple RocksDBkey/value pairs belonging to primary provisional records can we written for the same transactionwith the same hybrid time, we need to use an increasing counter (which we call a write id) atthe end of the encoded representation of hybrid time in order to obtain unique RocksDB keys forthis reverse index. This write ID is shown as .0, .1, etc. in T130.0, T130.1 in the figureabove.

Transaction status tracking

Atomicity (the “A” in “ACID”) means that either all values written by a transaction are visible, ornone are visible at all. YugabyteDB already provides atomicity of single-shard updates byreplicating them via Raft and applying them as one write batch to the underlying RocksDB / DocDBstorage engine. The same approach could be reused to make transaction status changes atomic. The status of transactions is tracked in a “transaction status” table. This table, under the covers, is just another elastic/sharded table in the system. The transaction id (a globally unique id) serves as the key in the table, and updates to a transaction’s status are simple single-shard ACID operations. This allows us to atomically make all values written as part of that transaction visible by setting the status to “committed” in that transaction’s status record in the table.

A transaction status record contains the following fields for a particular transaction id:

Status (pending, committed, or aborted).All transactions start in the “pending” status, and progress to “committed” or “aborted” status,in which they remain permanently until cleaned up.

After a transaction is committed, two more fields are set:

Commit hybrid timestamp. This timestamp is chosen as the current hybrid time at thetransaction status tablet at the moment of appending the “transaction committed” entry to itsRaft log. It is then used as the final MVCC timestamp for regular records that replace thetransaction’s provisional records when provisional records are being applied and cleaned up.
List of IDs of participating tablets. After a transaction commits, we know the final set oftablets that the transaction has written to. The tablet server managing the transaction sendsthis list to the transaction status tablet as part of the commit message, and the transactionstatus tablet makes sure that all participating tablets are notified of the transaction’scommitted status. This process might take multiple retries, and the transaction record can onlybe cleaned up after this is done.

Distributed transactions