Iterator: iterators are used by users to query keys in a range in sorted order. See

    Point lookup: In RocksDB, point lookup means reading one key using Get() or MultiGet().

    Range lookup: Range lookup means reading a range of keys using an Iterator.

    SST File (Data file / SST table): SST stands for Sorted Sequence Table. They are persistent files storing data. In the file keys are usually organized in sorted order so that a key or iterating position can be identified through a binary search.

    Index The index on the data blocks in a SST file. It is persisted as an index block in the SST file. The default index format is the binary search index.

    Partitioned Index The binary search index block partitioned to multiple smaller blocks. See

    LSM-tree: See the definition in RocksDB is LSM-tree-based storage engine.

    Write-Ahead-Log (WAL) or log: A log file used to recover data that is not yet flushed to SST files, during DB recovery. See

    memtable / write buffer: the in-memory data structure that stores the most recent updates of the database. Usually it is organized in sorted order and includes a binary searchable index. See

    memtable switch: During this process, the current active memtable (the one current writes go to) is closed and turned into an immutable memtable. At the same time, we will close the current WAL file and start a new one.

    immutable memtable: A closed memtable that is waiting to be flushed.

    sequence number (SeqNum / Seqno): each write to the database will be assigned an auto-incremented ID number. The number is attached with the key-value pair in WAL file, memtable, and SST files. The sequence number is used to implement snapshot read, garbage collection in compactions, MVCC in transactions, and some other purposes.

    recovery: the process of restarting a database after it failed or was closed.

    flush: background jobs that write out data in mem tables into SST files.

    compaction: background jobs that merge some SST files into some other SST files. LevelDB’s compaction also includes flush. In RocksDB, we further distinguished the two. See and

    (LSM) level: a logical organization of the DB physical data for maintaining desired LSM-tree shape and structure. See, particularly

    Leveled Compaction or Level-Based Compaction Style: the default compaction style of RocksDB. Leveled Compaction

    Universal Compaction Style: an alternative compaction algorithm. See

    Comparator: A plug-in class which can define the order of keys. See

    Column Family: column family is a separate key space in one DB. In spite of the misleading name, it has nothing to do with the “column family” concept in other storage systems. RocksDB doesn’t even have the concept of “column”. See

    snapshot: a snapshot is a logical consistent point-in-time view, in a running DB. See

    checkpoint: A checkpoint is a physical mirror of the database in another directory in the file system. See

    backup: RocksDB has a backup tool to help users backup the DB state to a different location, like HDFS. See

    Version: An internal concept of RocksDB. A version consists of all the live SST files in one point of the time. Once a flush or compaction finishes, a new “version” will be created because the list of live SST files has changed. An old “version” can continue being used by on-going read requests or compaction jobs. Old versions will eventually be garbage collected.

    Super Version: An internal concept of RocksDB. A super version consists the list of SST files (a “version”) and the list of live mem tables in one point of the time. Either a compaction or flush, or a mem table switch will cause a new “super version” to be created. An old “super version” can continue being used by on-going read requests. Old super versions will eventually be garbage collected after it is not needed anymore.

    block cache: in-memory data structure that cache the hot data blocks from the SST files. See

    Statistics: an in-memory data structure that contains cumulative stats of live databases.

    perf context: an in-memory data structure to measure thread-local stats. It is usually used to measure per-query stats. See

    DB properties: some running status that can be returned by the function DB::GetProperty().

    Table Properties: metadata stored in each SST file. It includes system properties that are generated by RocksDB and user defined table properties calculated by user defined call-backs. See

    write stall: When flush or compaction is backlogged, RocksDB may actively slowdown writes to make sure flush and compaction can catch up. See

    bloom filter: See

    prefix bloom filter: a special bloom filter that can be limitly used in iterators. Some file reads are avoided if an SST file or memtable doesn’t contain the prefix of the lookup key extracted by the prefix extractor. See

    prefix extractor: a callback class that can extract prefix part of a key. This is most frequently used as the prefix used in prefix bloom filter. See

    block-based bloom filter or full bloom filter: Two different approaches of storing bloom filters in SST files. Both are features of the block-based table format. See

    Partitioned Filters: Partitioning a full bloom filter into multiple smaller blocks. See

    compaction filter: a user plug-in that can modify or drop existing keys during a compaction. See

    merge operator: RocksDB supports a special operator Merge(), which is a delta record, merge operand, to the existing value. Merge operator is a user defined call-back class which can merge the merge operands. See

    Block-Based Table: The default SST file format. See

    Block: data block of SST files. In block-based table SST files, a block is always checksummed and usually compressed for storage.

    PlainTable: An alternative format of SST file format, optimized for ramfs. See

    forward iterator / tailing iterator: A special iterator option that optimizes for very specific use cases. See

    single delete: a special delete operation which only works when users never update an existing key:

    rate limiter: it is used to limit the rate of bytes written to the file system by flush and compaction. See

    Pessimistic Transactions Using locks to provide isolation between multiple concurrent transactions. The default write policy is WriteCommitted.

    2PC (Two-phase commit) The pessimistic transactions could commit in two phases: first Prepare and then the actual Commit. See

    WriteCommitted The default write policy in pessimistic transactions, which buffers the writes in memory and write them into the DB upon commit of the transaction.

    WritePrepared A write policy in pessimistic transactions that buffers the writes in memory and write them into the DB upon prepare if it is a 2PC transaction or commit otherwise. See

    WriteUnprepared A write policy in pessimistic transactions that avoid the need for larger memory buffers by writing data to the DB as they are sent by the transaction. See