RocksDB provide the user with APIs that can be used to create SST files that can be ingested later. This can be useful if you have a use case that needs to load the data quickly, but the process of creating the data can be done offline.

Creating SST file

rocksdb::SstFileWriter can be used to create SST file. After creating a SstFileWriter object you can open a file, insert rows into it and finish.

This is an example of how to create SST file in /home/usr/file1.sst

  1. Options options;
  2. SstFileWriter sst_file_writer(EnvOptions(), options);
  3. // Path to where we will write the SST file
  4. std::string file_path = "/home/usr/file1.sst";
  5. // Open the file for writing
  6. Status s = sst_file_writer.Open(file_path);
  7. if (!s.ok()) {
  8. printf("Error while opening file %s, Error: %s\n", file_path.c_str(),
  9. s.ToString().c_str());
  10. return 1;
  11. }
  12. // Insert rows into the SST file, note that inserted keys must be
  13. // strictly increasing (based on options.comparator)
  14. for (...) {
  15. s = sst_file_writer.Put(key, value);
  16. if (!s.ok()) {
  17. printf("Error while adding Key: %s, Error: %s\n", key.c_str(),
  18. s.ToString().c_str());
  19. return 1;
  20. }
  21. }
  22. // Close the file
  23. s = sst_file_writer.Finish();
  24. if (!s.ok()) {
  25. printf("Error while finishing file %s, Error: %s\n", file_path.c_str(),
  26. s.ToString().c_str());
  27. return 1;
  28. }
  29. return 0;

Now we have our SST file located at /home/usr/file1.sst.

Please note that:

  • Options passed to SstFileWriter will be used to figure out the table type, compression options, etc that will be used to create the SST file.
  • The Comparator that is passed to the SstFileWriter must be exactly the same as the Comparator used in the DB that this file will be ingested into.
  • Rows must be inserted in a strictly increasing order.

You can learn more about the SstFileWriter by checking include/rocksdb/sst_file_writer.h

Ingesting SST files

Ingesting an SST files is simple, all you need to do is to call DB::IngestExternalFile() and pass the file paths as a vector of std::string

  1. IngestExternalFileOptions ifo;
  2. // Ingest the 2 passed SST files into the DB
  3. Status s = db_->IngestExternalFile({"/home/usr/file1.sst", "/home/usr/file2.sst"}, ifo);
  4. if (!s.ok()) {
  5. printf("Error while adding file %s and %s, Error %s\n",
  6. file_path1.c_str(), file_path2.c_str(), s.ToString().c_str());
  7. return 1;
  8. }

You can learn more by checking DB::IngestExternalFile() and DB::IngestExternalFiles() in include/rocksdb/db.h. DB::IngestExternalFiles() ingests a collection of external SST files for multiple column families following the ‘all-or-nothing’ property. If the function returns Status::OK, then all files are ingested successfully for all column families of interest. If the function returns non-OK status, then none of the files are ingested into none of the column families.

What happens when you ingest a file

When you call DB::IngestExternalFile() We will

  • Copy or link the file into the DB directory
  • block (not skip) writes to the DB because we have to keep a consistent db state so we have to make sure we can safely assign the right sequence number to all the keys in the file we are going to ingest
  • If file key range overlap with memtable key range, flush memtable
  • Assign the file to the best level possible in the LSM-tree
  • Assign the file a global sequence number
  • Resume writes to the DB

We pick the lowest level in the LSM-Tree that satisfies these conditions

  • The file can fit in the level
  • The file key range don’t overlap with any keys in upper layers
  • The file don’t overlap with the outputs of running compactions going to this level

Global sequence number

Files created using SstFileWriter have a special field in their metablock called global sequence number, when this field is used, all the keys inside this file start acting as if they have such sequence number. When we ingest a file, we assign a sequence number to all the keys in this file. Before RocksDB 5.16, RocksDB always updates this global sequence number field in the metablock of the SST file using a random write. From RocksDB 5.16, RocksDB enables user to choose whether to update this field via IngestExternalFileOptions::write_global_seqno. If this field is false during ingestion, then RocksDB uses the information in MANIFEST to deduce the global sequence number when accessing the file. This can be useful if the underlying file system does not support random write or if users wish to minimize sync operations. If backward compatibility is the concern, set this option to true so that external SST files ingested by RocksDB 5.16 or newer can be opened by RocksDB 5.15 or older.

Ingestion Behind

Starting from 5.5, IngestExternalFile() will load a list of external SST files with ingestion behind supported, which means duplicate keys will be skipped if ingest_behind==true. In this mode we will always ingest in the bottom mode level. Duplicate keys in the file being ingested to be skipped rather than overwriting existing data under that key.

Use case

Back-fill of some historical data in the database without over-writing existing newer version of data. This option could only be used if the DB has been running with allow_ingest_behind=true since the dawn of time. All files will be ingested at the bottommost level with seqno=0.