Transactional IO Path

AttentionThis page documents an earlier version. Go to the latest (v2.1)version.

Introduction

Review the Distributed ACID Transactions sectionfor an overview of some common concepts used in YugabyteDB’s implementation of distributedtransactions. In this section, we will go over the write path of a transaction modifying multiplekeys, and the read path for reading a consistent combination of values from multiple tablets.

Write path overview

Let us walk through the lifecycle of a single distributed write-only transaction. Suppose we aretrying to modify rows with keys k1 and k2. If they belong to the same tablet, we could executethis transaction as a single-shard transaction, in whichcase atomicity would be ensured by the fact that both updates would be replicated as part of thesame Raft log record. However, in the most general case, these keys would belong to differenttablets, and that is what we’ll assume from this point on.

The diagram below shows the high-level steps of a distributed write-only transaction, not includingany conflict resolution.

distributed_txn_write_path

1. Client’s request

The client sends a request to a YB tablet server that requires a distributed transaction. Wecurrently only support transactions that can be expressed by a single client request. Here is anexample using our extension to CQL:

  1. START TRANSACTION;
  2. UPDATE t1 SET v = 'v1' WHERE k = 'k1';
  3. UPDATE t2 SET v = 'v2' WHERE k = 'k2';
  4. COMMIT;

The tablet server that receives the transactional write request becomes responsible for driving allthe steps involved in this transaction, as described below. This orchestration of transaction stepsis performed by a component we call a transaction manager. Every transaction is handled by exactlyone transaction manager.

2. Creating a transaction record

We assign a transaction id and select a transaction status tablet that would keep track ofa transaction status record that has the following fields:

  • Status (pending, committed, or aborted)
  • Commit hybrid timestamp, if committed
  • List of ids of participating tablets, if committed

It makes sense to select a transaction status tablet in a way such that the transaction manager’stablet server is also the leader of its Raft group, because this allows to cut the RPC latency onquerying and updating the transaction status. But in the most general case, the transaction statustablet might not be hosted on the same tablet server that initiates the transaction.

3. Writing provisional records

We start writing provisional records to tablets containing the rows we need to modify. Theseprovisional records contain the transaction id, the values we are trying to write, and theprovisional hybrid timestamp. This provisional hybrid timestamp is not the final commit timestamp,and will in general be different for different provisional records within the same transaction.In contranst, there is only one commit hybrid timestamp for the entire transaction.

As we write the provisional records, we might encounter conflicts with other transactions. Inthis case we would have to abort and restart the transaction. These restarts still happentransparently to the client up to a certain number of retries.

4. Committing the transaction

When the transaction manager has written all the provisional records, it commits the transaction bysending an RPC request to the transaction status tablet. The commit operation will only succeed ifthe transaction has not yet been aborted due to conflicts. The atomicity and durability of thecommit operation is guaranteed by the transaction status tablet’s Raft group. Once the commitoperation is complete, all provisional records immediately become visible to clients.

The commit request the transaction manager sends to the status tablet includes the list of tabletids of all tablets that participate in the transaction. Clearly, no new tablets can be added to thisset by this point. The status tablet needs this information to orchestrate cleaning up provisionalrecords in participating tablets.

5. Sending the response back to client

The YQL engine sends the response back to the client. If any client (either the same one ordifferent) sends a read request for the keys that were written, the new values are guaranteed to bereflected in the response, because the transaction is already committed. This property of a databaseis sometimes called the “read your own writes” guarantee.

6. Asynchronously applying and cleaning up provisional records

This step is coordinated by the transaction status tablet after it receives the commit message forour transaction and successfully replicates a change to the transaction’s status in its Raft group.The transaction status tablet already knows what tablets are participating in this transaction, soit sends cleanup requests to them. Each participating tablet records a special “apply” record intoits Raft log, containing the transaction id and commit timestamp. When this record isRaft-replicated in the participating tablet, the tablet remove the provisional records belonging tothe transaction, and writes regular records with the correct commit timestamp to its RocksDBdatabases. These records now are virtually indistinguishable from those written by regularsingle-row operations.

Once all participating tablets have successfully processed these “apply” requests, the status tabletcan delete the transaction status record, because all replicas of participating tablets that havenot yet cleaned up provisional records (e.g. slow followers) will do so based on informationavailable locally within those tablets. The deletion of the status record happens by writing aspecial “applied everywhere” entry to the Raft log of the status tablet. Raft log entries belongingto this transaction will be cleaned up from the status tablet’s Raft log as part of regulargarbage-collection of old Raft logs soon after this point.

Read path overview

YugabyteDB is an MVCC database,which means it internally keeps track of multiple versions of the same value. Read operations don’ttake any locks, and rely on the MVCC timestamp in order to read a consistent snapshot of the data. Along-running read operation, either single-shard or cross-shard, can proceed concurrently with writeoperations modifying the same key.

In the ACID Transactions section, we talked about howup-to-date reads are performed from a single shard (tablet). In that case, the most recent value ofa key is simply the value written by the last committed Raft log record that the Raft leader knowsabout. For reading multiple keys from different tablets, though, we have to make sure that thevalues we read come from a recent consistent snapshot of the database. Here is a clarificationof these two properties of the snapshot we have to choose:

  • Consistent snapshot. The snapshot must show any transaction’s records fully, or not showthem at all. It cannot contain half of the values written by a transaction and omit the otherhalf. We ensure the snapshot is consistent by performing all reads at a particular hybridtime (we will call it ht_read), and ignoring any records with higher hybrid time.

  • Recent snapshot. The snapshot includes any value that any client might have already seen,which means all values that were written or read before this read operation was initiated. Thisalso includes all previously written values that other components of the client applicationmight have written to or read from the database. The client performing the current read mightrely on presence of those values in the result set, because those other components of the clientapplication might have communicated this data to the current client through asynchronouscommunication channels. We ensure the snapshot is recent by restarting the read operationwhen it is determined that the chosen hybird time was too low, i.e. that there are some recordsthat could have been written before the read operation was initiated but have a hybrid timehigher than the currently chosen ht_read.

Distributed transaction read path diagram

1. Client’s request handling and read transaction initialization

The client’s request (Cassandra, Redis, or PostgreSQL(beta)) arrives at the YQL engine of a tablet server. The YQL enginedetects that the query requests rows from multiple tablets and starts a read-only transaction. Ahybrid time ht_read is selected for the request, which could be either the current hybrid timeon the YQL engine’s tablet server, or the safe time on one ofthe involved tablets. The latter case would reduce waiting for safe time for at least that tabletand is therefore better for performance. Typically, due to our load-balancing policy, the YQL enginereceiving the request will also host some of the tablets that the request is reading, allowing toimplement the more performant second option without an additional RPC round-trip.

We also select a point in time we call global_limit, computed as physical_time +max_clock_skew, which allows us to determine whether a particular record was written definitelyafter our read request started. max_clock_skew is a globally configured bound on clock skewbetween different YugabyteDB servers. (We’ve also designed an adaptive clock skew tracking algorithmthat allows to avoid the need to specify a global clock skew bound, which is part of YugabyteDBEnterprise Edition).

2. Read from all tablets at the chosen hybrid time

The YQL engine sends requests to all tablets the transaction needs to read from. Each tablet waitsfor ht_read to become a safe time to read at according to our definition of safe time, and then starts executing its partof the read request from its local DocDB.

When a tablet server sees a relevant record with a hybrid time ht_record, it executes thefollowing logic:

  • If ht_record ≤ ht_read, include the record in the result.
  • If ht_record > definitely_future_ht, exclude the record from the result. The meaning ofdefinitely_future_ht is that it it is a hybrid time such that a record with a higher hybridtime than that was definitely written after our read request started. You can assumedefinitely_future_ht above to simply be global_limit for now. We will clarify how exactlyit is computed in a moment.
  • If ht_read < ht_record ≤ definitely_future_ht, we don’t know if this record was writtenbefore or after our read request started. But we can’t just omit it from the result, because ifit was in fact written before the read request, this may produce a client-observed inconsistency.Therefore, we have to restart the entire read operation with an updated value of ht_read =ht_record.

To prevent an infinite loop of these read restarts, we also return a tablet-dependent hybrid timevalue local_limittablet to the YQL engine, computed as the current safe time in thistablet. We now know that any record (regular or provisional) written to this tablet with a hybridtime higher than local_limittablet could not have possibly been written before ourread request started. Therefore, we won’t have to restart the read transaction if we see a recordwith a hybrid time higher than local_limittablet in a later attempt to read from thistablet within the same transaction, and we set definitely_future_ht = min(global_limit,local_limittablet) on future attempts.

3. Tablets query the transaction status

As each participating tablet reads from its local DocDB, it might encounter provisional records forwhich it does not yet know the final transaction status and commit time. In these cases, it wouldsend a transaction status request to the transaction status tablet. If a transaction is committed,it is treated as if DocDB already contained permanent records with hybrid time equal to thetransaction’s commit time. Thecleanupof provisional records happens independently and asynchronously.

4. Tablets respond to the YQL engine

Each tablet’s response to the YQL engine contains the following:

  • Whether or not read restart is required.
  • local_limittablet to be used to restrict future read restarts caused by thistablet.
  • The actual values that have been read from this tablet.

5. The YQL engine sends the response to the client

As soon as all read operations from all participating tablets succeed and it has been determinedthat there is no need to restart the read transaction, a response is sent to the client using theappropriate wire protocol (e.g. Cassandra, Redis, or PostgreSQL(beta)).

See also

See the Distributed ACID Transactions sectionto review some common concepts relevant to YugabyteDB’s implementation of distributed transactions.