Overview

RocksDB is file system and storage medium agnostic. File system operations are not atomic, and are susceptible to inconsistencies in the event of system failure. Even with journaling turned on, file systems do not guarantee consistency on unclean restart. POSIX file system does not support atomic batching of operations either. Hence, it is not possible to rely on metadata embedded in RocksDB datastore files to reconstruct the last consistent state of the RocksDB on restart.

RocksDB has a built-in mechanism to overcome these limitations of POSIX file system by keeping a transactional log of RocksDB state changes called the MANIFEST. MANIFEST is used to restore RocksDB to the latest known consistent state on a restart.

Terminology

  • MANIFEST refers to the system that keeps track of RocksDB state changes in a transactional log
  • Manifest log refers to an individual log file that contains RocksDB state snapshot/edits
  • CURRENT refers to the latest manifest log

How does it work ?

MANIFEST is a transactional log of the RocksDB state changes. MANIFEST consists of - manifest log files and latest manifest file pointer. Manifest logs are rolling log files named MANIFEST-(seq number). The sequence number is always increasing. CURRENT is a special file that identifies the latest manifest log file.

On system (re)start, the latest manifest log contains the consistent state of RocksDB. Any subsequent change to RocksDB state is logged to the manifest log file. When a manifest log file exceeds a certain size, a new manifest log file is created with the snapshot of the RocksDB state. The latest manifest file pointer is updated and the file system is synced. Upon successful update to CURRENT file, the redundant manifest logs are purged.

  1. MANIFEST = { CURRENT, MANIFEST-<seq-no>* }
  2. CURRENT = File pointer to the latest manifest log
  3. MANIFEST-<seq no> = Contains snapshot of RocksDB state and subsequent modifications

Version Edit

A certain state of RocksDB at any given time is referred to as a version (aka snapshot). Any modification to the version is considered a version edit. A version (or RocksDB state snapshot) is constructed by joining a sequence of version-edits. Essentially, a manifest log file is a sequence of version edits.

  1. version-edit = Any RocksDB state change
  2. version = { version-edit* }
  3. manifest-log-file = { version, version-edit* }
  4. = { version-edit* }

Version Edit Layout

Manifest log is a sequence of version edit records. The version edit record type is identified by the edit identification number.

We use the following datatypes for encoding/decoding.

Data Types

Simple data types

  1. VarX - Variable character encoding of intX
  2. FixedX - Fixed character encoding of intX

Complex data types

  1. String - Length prefixed string data
  2. +-----------+--------------------+
  3. | size (n) | content of string |
  4. +-----------+--------------------+
  5. |<- Var32 ->|<-- n -->|

Version Edit Record Format

Version edit records have the following format. The decoder identifies the record type using the record identification number.

  1. +-------------+------ ......... ----------+
  2. | Record ID | Variable size record data |
  3. +-------------+------ .......... ---------+
  4. <-- Var32 --->|<-- varies by type -->

Version Edit Record Types and Layout

There are variety of edit record corresponding to different state changes of RocksDB.

Comparator edit record:

  1. Captures the comparator name
  2. +-------------+----------------+
  3. | kComparator | data |
  4. +-------------+----------------+
  5. <-- Var32 --->|<-- String -->|

Log number edit record:

  1. Latest WAL log file number
  2. +-------------+----------------+
  3. | kLogNumber | log number |
  4. +-------------+----------------+
  5. <-- Var32 --->|<-- Var64 -->|

Previous File Number edit record:

  1. Previous manifest file number
  2. +------------------+----------------+
  3. | kPrevFileNumber | log number |
  4. +------------------+----------------+
  5. <-- Var32 --->|<-- Var64 -->|

Next File Number edit record:

  1. Next manifest file number
  2. +------------------+----------------+
  3. | kNextFileNumber | log number |
  4. +------------------+----------------+
  5. <-- Var32 --->|<-- Var64 -->|

Last Sequence Number edit record:

  1. Last sequence number of RocksDB
  2. +------------------+----------------+
  3. | kLastSequence | log number |
  4. +------------------+----------------+
  5. <-- Var32 --->|<-- Var64 -->|

Max Column Family edit record:

  1. Adjust the maximum number of family columns allowed.
  2. +---------------------+----------------+
  3. | kMaxColumnFamily | log number |
  4. +---------------------+----------------+
  5. <-- Var32 --->|<-- Var32 -->|

Deleted File edit record:

  1. Mark a file as deleted from database.
  2. +-----------------+-------------+--------------+
  3. | kDeletedFile | level | file number |
  4. +-----------------+-------------+--------------+
  5. <-- Var32 --->|<-- Var32 -->|<-- Var64 -->|

New File edit record:

Mark a file as newly added to the database and provide RocksDB meta information.

  • File edit record with compaction information
  1. +--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
  2. | kNewFile4 | level | file number | file size | smallest_key | largest_key | smallest_seqno | largest_seq_no |
  3. +--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
  4. |<-- var32 -->|<-- var32 -->|<-- var64 -->|<- var64 ->|<-- String -->|<-- String -->|<-- var64 -->|<-- var64 -->|
  5. +--------------+------------------+---------+------+----------------+--------------------+---------+------------+
  6. | CustomTag1 | Field 1 size n1 | field1 | ... | CustomTag(m) | Field m size n(m) | field(m)| kTerminate |
  7. +--------------+------------------+---------+------+----------------+--------------------+---------+------------+
  8. <-- var32 -->|<-- var32 -->|<- n1 ->| |<-- var32 - ->|<-- var32 -->|<- n(m)->|<- var32 -->|

Several Optional customized fields can be written there.The field has a special bit indicating that whether it can be safely ignored. This is for compatibility reason. A RocksDB older release may see a field it can't identify. Checking the bit, RocksDB knows whether it should stop opening the DB, or ignore the field.

Several optional customized fields are supported:kNeedCompaction: Whether the file should be compacted to the next level.kMinLogNumberToKeepHack: WAL file number that is still in need for recovery after this entry.kPathId: The Path ID in which the file lives. This can't be ignored by an old release.

  • File edit record backward compatible
  1. +--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
  2. | kNewFile2 | level | file number | file size | smallest_key | largest_key | smallest_seqno | largest_seq_no |
  3. +--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
  4. <-- var32 -->|<-- var32 -->|<-- var64 -->|<- var64 ->|<-- String -->|<-- String -->|<-- var64 -->|<-- var64 -->|
  • File edit record with path information
  1. +--------------+-------------+--------------+-------------+-------------+----------------+--------------+
  2. | kNewFile3 | level | file number | Path ID | file size | smallest_key | largest_key |
  3. +--------------+-------------+--------------+-------------+-------------+----------------+--------------+
  4. |<-- var32 -->|<-- var32 -->|<-- var64 -->|<-- var32 -->|<-- var64 -->|<-- String -->|<-- String -->|
  5. +----------------+----------------+
  6. | smallest_seqno | largest_seq_no |
  7. +----------------+----------------+
  8. <-- var64 -->|<-- var64 -->|

Column family status edit record:

  1. Note the status of column family feature (enabled/disabled)
  2. +------------------+----------------+
  3. | kColumnFamily | 0/1 |
  4. +------------------+----------------+
  5. <-- Var32 --->|<-- Var32 -->|

Column family add edit record:

  1. Add a column family
  2. +---------------------+----------------+
  3. | kColumnFamilyAdd | cf name |
  4. +---------------------+----------------+
  5. <-- Var32 --->|<-- String -->|

Column family drop edit record:

  1. Drop all column family
  2. +---------------------+
  3. | kColumnFamilyDrop |
  4. +---------------------+
  5. <-- Var32 --->|

Record as part of an atomic group (since RocksDB 5.16):

There are cases in which 'all-or-nothing', multi-column-family version change is desirable. For example, atomic flush ensures either all or none of the column families get flushed successfully, multiple column families external SST ingestion guarantees that either all or none of the column families ingest SSTs successfully. Since writing multiple version edits is not atomic, we need to take extra measure to achieve atomicity (not necessarily instantaneity from the user's perspective). Therefore we introduce a new record field kInAtomicGroup to indicate that this record is part of a group of version edits that follow the 'all-or-none' property. The format is as follows.

  1. +-----------------+--------------------------------------------+
  2. | kInAtomicGroup | #remaining version edits in the same group |
  3. +-----------------+--------------------------------------------+
  4. |<--- Var32 ----->|<----------------- Var32 ------------------>|

During recovery, RocksDB buffers version edits of an atomic group without applying them until the last version edit of the atomic group is decoded successfully from the MANIFEST file. Then RocksDB applies all the version edits in this atomic group. RocksDB never applies partial atomic groups.

Version Edit ignorable record types

We reserved a special bit in record type. If the bit is set, it can be safely ignored. And the safely ignorable record has a standard general format:

  1. +---------+----------------+----------------+
  2. | kTag | field length n | fields ... |
  3. +--------------------------+----------------+
  4. <- Var32->|<-- var32 -->|<--- n >|

This is introduced in 6.0 and no customized ignoreable record created yet.