RocksDB

RocksDB is a persistent key-value store for fast storage environment.Here are some highlight features from RocksDB:

  • RocksDB uses a log structured database engine, written entirely inC++, for maximum performance. Keys and values are justarbitrarily-sized byte streams.
  • RocksDB is optimized for fast, low latency storage such as flashdrives and high-speed disk drives. RocksDB exploits the fullpotential of high read/write rates offered by flash or RAM.
  • RocksDB is adaptable to different workloads. From database storageengines such as MyRocks to application data caching to embeddedworkloads, RocksDB can be used for a variety of data needs.
  • RocksDB provides basic operations such as opening and closing adatabase, reading and writing to more advanced operations such asmerging and compaction filters.
    TiKV uses RocksDB because RocksDB is mature and high-performance. Inthis section, we will explore how TiKV uses RocksDB. We won’t talkabout basic features like Get, Put, Delete, and Iterate herebecause their usage is simple and clear and works well too. Instead,we’ll focus some special features used in TiKV below.

Prefix Bloom Filter

A Bloom Filter is a magical data structure that uses a little resourcebut helps a lot. We won’t explain the whole algorithm here. If you arenot familiar with Bloom Filters, you can think it as a black boxinside a dataset, which can tell you if a key probably exists ordefinitely does not without actually searching thedataset. Sometimes Bloom Filter gives you a false-positive answeralthough it rarely happens.

TiKV uses a Bloom Filter as well as a variant which is called PrefixBloom Filter (PBF). Instead of telling you if a whole key exists in adataset or not, PBF tells you if there are some other keys with thesame prefix exists. Since PBF only stores the unique prefixes insteadof all unique whole keys, it can save some memory too with the downside of having larger false positive rate.

TiKV supports MVCC, which means that there can be multiple versionsfor the same row stored in RocksDB. All versions of the same row sharethe same prefix (the row key) but have different timestamps as a suffix. Whenwe want to read a row, we usually don’t know about the exact versionto read, but only want to read the latest version at a specifictimestamp. This is where PBF shines. PBF can filter out data which isimpossible to contain keys with the same prefix as the row key weprovided. Then we just need to search in the data that may containdifferent versions of the row key and locate the specific version wewant.

TableProperties

RocksDB allows us to register some table properties collectors. WhenRocksDB builds an SST file, it passes the sorted key-values one by oneto the callback of each collector so that we can collect whatever wewant. Then when the SST file is finished, the collected propertieswill be stored inside the SST file too.

We use this feature to optimize two functionalities.

The first one is for Split Check. Split check is a worker to check ifregions are large enough to split. We have to scan all the datawithin a region to calculate the size of the region at the originalimplementation, which is resource consuming. With the TablePropertiesfeature, we record the size of small sub-ranges in each SST file sothat we can calculate the approximate size of a region from the tableproperties without scanning any data at all.

Another one is for MVCC Garbage Collection (GC). GC is a process toclean up garbage versions (versions older than the configuredlifetime) of each row. If we have no idea whether a region containssome garbage versions or not, we have to check all regionsperiodically. To skip unnecessary garbage collection, we record someMVCC statistics (e.g. the number of rows and the number of versions)in each SST file. So before checking every region row by row, we checkthe table properties to see if it is necessary to do garbagecollection on a region.

CompactRange

From time to time, some regions may contain a lot of tombstone entriesbecause of GC or other delete operations. Tombstone entries are notgood for scan performance and waste disk space as well.

So with the TableProperties feature, we can check every regionperiodically to see if it contains a lot of tombstones. If it does, wewill compact the region range manually to drop tombstone entries andrelease disk space.

We also use CompactRange to recover RocksDB from some mistakes likeincompatible table properties across different TiKV versions.

EventListener

EventListener allows us to listen to some special events, likeflush, compaction or write stall condition change. When a specificevent is triggered or finished, RocksDB will invoke our callbacks withsome information about the event.

TiKV listens to the compaction event to observe the region sizechanges. As mentioned above, we calculate the approximate size of eachregion from the table properties. The approximate size will berecorded in the memory so that we don’t need to calculate it again andagain if nothing has changed. However, during compactions, someentries are dropped so the approximate size of some regions should beupdated. That’s why we listen to the compaction events and recalculatethe approximate size of some regions when necessary.

IngestExternalFile

RocksDB allows us to generate an SST file outside and then ingest thefile into RocksDB directly. This feature can potentially save a lot ofIO because RocksDB is smart enough to ingest a file to the bottomlevel if possible, which can reduce write amplification because theingested file doesn’t need to be compacted again and again.

We use this feature to handle Raft snapshot. For example, when we wantto add a replica to a new server. We can first generate a snapshotfile from another server and then send the file to the newserver. Then the new server can ingest that file into its RocksDBdirectly, which saves lots of work!

We also use this feature to import a huge amount of data into TiKV. Wehave some tools to generate sorted SST files from different datasources and then ingest those files into different TiKV servers. Thisis super fast compared to writing key-values to a TiKV cluster in theusual way.

DeleteFilesInRange

Previously, TiKV used the straightforward way to delete a range of data,which is scanning all the keys in the range and then delete them oneby one. However, disk space would not release until the tombstones havebeen compacted. Even worse, disk space usage will actually increasetemporarily because of newly written tombstones.

As time goes on, users store more and more data in TiKV until theirdisk space is insufficient. Then users will try to drop some tables oradd more stores and expect the disk space usage to decrease in a shorttime. But TiKV didn’t meet expectations with this method. We first triedto solve this by using the DeleteRange feature in RocksDB. However,DeleteRange turns out to be unstable and can not release disk spacefast enough.

A faster way to release disk space is to delete some files directly,which leads us to the DeleteFilesInRange feature. But this featureis not perfect, it is quite dangerous because it breaks the snapshotconsistency. If you acquire a snapshot from RocksDB,use DeleteFilesInRange to delete some files, then try to read that datayou will find that some of it is missing. So weshould use this feature carefully.

TiKV uses DeleteFilesInRange to destroy tombstone regions and GCdropped tables. Both cases have a prerequisite that the dropped rangemust not be accessed anymore.