RocksDB may configure a certain amount of main memory as a block cache to accelerate data access. Understanding the efficiency of block cache is very important. The block cache analysis and simulation tools help a user to collect block cache access traces, analyze its access pattern, and evaluate alternative caching policies.

Table of Contents

Quick Start

db_bench supports tracing block cache accesses. This section demonstrates how to trace accesses when running db_bench. It also shows how to analyze and evaluate caching policies using the generated trace file.

Create a database:

  1. ./db_bench --benchmarks="fillseq" \
  2. --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
  3. --cache_index_and_filter_blocks --cache_size=1048576 \
  4. --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
  5. --min_level_to_compress=-1 --compression_ratio=1 --num=10000000

To trace block cache accesses when running readrandom benchmark:

  1. ./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \
  2. --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
  3. --cache_index_and_filter_blocks --cache_size=1048576 \
  4. --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
  5. --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \
  6. --threads=16 \
  7. -block_cache_trace_file="/tmp/binary_trace_test_example" \
  8. -block_cache_trace_max_trace_file_size_in_bytes=1073741824 \
  9. -block_cache_trace_sampling_frequency=1

Convert the trace file to human readable format:

  1. ./block_cache_trace_analyzer \
  2. -block_cache_trace_path=/tmp/binary_trace_test_example \
  3. -human_readable_trace_file_path=/tmp/human_readable_block_trace_test_example

Evaluate alternative caching policies:

  1. bash /tmp/human_readable_block_trace_test_example /tmp/sim_results/bench 1 0 30

Plot graphs:

  1. python /tmp/sim_results /tmp/sim_results_graphs

Tracing Block Cache Accesses

RocksDB supports block cache tracing APIs StartBlockCacheTrace and EndBlockCacheTrace. When tracing starts, RocksDB logs detailed information of block cache accesses into a trace file. A user must specify a trace option and trace file path when start tracing block cache accesses.

A trace option contains max_trace_file_size and sampling_frequency.

  • max_trace_file_size specifies the maximum size of the trace file. The tracing stops when the trace file size exceeds the specified max_trace_file_size.
  • sampling_frequency determines how frequent should RocksDB trace an access. RocksDB uses spatial downsampling such that it traces all accesses to sampled blocks. A sampling_frequency of 1 means tracing all block cache accesses. A sampling_frequency of 100 means tracing all accesses on ~1% blocks.

An example to start tracing block cache accesses:

  1. Env* env = rocksdb::Env::Default();
  2. EnvOptions env_options;
  3. std::string trace_path = "/tmp/binary_trace_test_example"
  4. std::unique_ptr<TraceWriter> trace_writer;
  5. DB* db = nullptr;
  6. std::string db_name = "/tmp/rocksdb"
  7. /*Create the trace file writer*/
  8. NewFileTraceWriter(env, env_options, trace_path, &trace_writer);
  9. DB::Open(options, dbname, &db);
  10. /*Start tracing*/
  11. db->StartBlockCacheTrace(trace_opt, std::move(trace_writer));
  12. /* Your call of RocksDB APIs */
  13. /*End tracing*/
  14. db->EndBlockCacheTrace()

Trace Format

We can convert the generated binary trace file into human readable trace file in csv format. It contains the following columns:

Column NameValuesComment
Access timestamp in microsecondsunsigned long
Block IDunsigned longA unique block ID.
Block type7: Index block
8: Filter block
9: Data block
10: Uncompressed dictionary block
11: Range deletion block
Block sizeunsigned longBlock size may be 0 when
1) compaction observes cache misses and does not insert the missing blocks into the cache.
2) IO error when fetching a block.
3) prefetching filter blocks but the SST file does not have filter blocks.
Column family IDunsigned longA unique column family ID.
Column family namestring
Levelunsigned longThe LSM tree level of this block.
SST file numberunsigned longThe SST file this block belongs to.
CallerSee CallerThe caller that accesses this block, e.g., Get, Iterator, Compaction, etc.
No insert0: do not insert the block upon a miss
1: insert the block upon a cache miss
Get IDunsigned longA unique ID associated with the Get request.
Get key IDunsigned longThe referenced key of the Get request.
Get referenced data sizeunsigned longThe referenced data (key+value) size of the Get request.
Is a cache hit0: A cache hit
1: A cache miss
The running RocksDB instance observes a cache hit/miss on this block.
Get Does get referenced key exist in this block0: Does not exist
1: Exist
Data block only: Whether the referenced key is found in this block.
Get Approximate number of keys in this blockunsigned longData block only.
Get table IDunsigned longThe table ID of the Get request. We treat the first four bytes of the Get request as table ID.
Get sequence numberunsigned longThe sequence number associated with the Get request.
Block key sizeunsigned long
Get referenced key sizeunsigned long
Block offset in the SST fileunsigned long

Cache Simulations

We support running cache simulators using both RocksDB built-in caches and caching policies written in python. The cache simulator replays the trace and reports the miss ratio given a cache capacity and a caching policy.

RocksDB Cache Simulations

To replay the trace and evaluate alternative policies, we first need to provide a cache configuration file. An example file contains the following content:

  1. lru,0,0,16M,256M,1G,2G,4G,8G,12G,16G,1T

Cache configuration file format:

Column NameValues
Cache namelru: LRU
lrupriority: LRU with midpoint insertion
lru_hybrid: LRU that also caches row keys
*: A ghost cache for admission control. It admits an entry on its second access.
  • Specifically, the ghost cache only maintains keys and is managed by LRU.
  • Upon an access, the cache inserts the key into the ghost cache.
  • Upon a cache miss, the cache inserts the missing entry only if it observes a hit in the ghost cache.
Number of shard bitsunsigned long
Ghost cache capacityunsigned long
Cache sizesA list of comma separated cache sizes

Next, we can start simulating caches.

  1. ./block_cache_trace_analyzer -mrc_only=true \
  2. -block_cache_trace_downsample_ratio=100 \
  3. -block_cache_trace_path=/tmp/binary_trace_test_example \
  4. -block_cache_sim_config_path=/tmp/cache_config \
  5. -block_cache_analysis_result_dir=/tmp/binary_trace_test_example_results \
  6. -cache_sim_warmup_seconds=3600

It contains two important parameters:

block_cache_trace_downsample_ratio: The sampling frequency used to collect the trace. The simulator scales down the given cache size by this factor. For example, with downsample_ratio of 100, the cache simulator creates a 1 GB cache to simulate a 100 GB cache.

cache_sim_warmup_seconds: The number of seconds used for warmup. The reported miss ratio does NOT include the number of misses/accesses during the warmup.

The analyzer outputs a few files:

  • A miss ratio curve file: {trace_duration_in_seconds}_{total_accesses}_mrc.
  • Three miss ratio timeline files per second (1), per minute (60), and per hour (3600).
  • Three number of misses timeline files per second (1), per minute (60), and per hour (3600).

Python Cache Simulations

We also support a more diverse set of caching policies written in python. In addition to LRU, it provides replacement policies using reinforcement learning, cost class, and more. To use the python cache simulator, we need to first convert the binary trace file into human readable trace file.

  1. ./block_cache_trace_analyzer \
  2. -block_cache_trace_path=/tmp/binary_trace_test_example \
  3. -human_readable_trace_file_path=/tmp/human_readable_block_trace_test_example options:

  1. 1) Cache type (ts, linucb, arc, lru, opt, pylru, pymru, pylfu, pyhb, gdsize, trace).
  2. One may evaluate the hybrid row_block cache by appending '_hybrid' to a cache_type, e.g., ts_hybrid.
  3. Note that hybrid is not supported with opt and trace.
  4. 2) Cache size (xM, xG, xT).
  5. 3) The sampling frequency used to collect the trace.
  6. (The simulation scales down the cache size by the sampling frequency).
  7. 4) Warmup seconds (The number of seconds used for warmup).
  8. 5) Trace file path.
  9. 6) Result directory (A directory that saves generated results)
  10. 7) Max number of accesses to process. (Replay the entire trace if set to -1.)
  11. 8) The target column family. (The simulation will only run accesses on the target column family.
  12. If it is set to all, it will run against all accesses.)

One example:

  1. python lru 16M 100 3600 /tmp/human_readable_block_trace_test_example /tmp/results 10000000 0 all

We also provide a bash script to simulate a batch of cache configurations:

  1. Usage: ./ trace_file_path result_dir downsample_size warmup_seconds max_jobs
  2. -max_jobs: The maximum number of simulators to run at a time.

One example:

  1. bash /tmp/human_readable_block_trace_test_example /tmp/sim_results/bench 1 0 30 output the following files:

  • A miss ratio curve file: data-ml-mrc-{cache_type}-{cache_size}-{target_cf_name}.
  • Two files on the timeline of miss ratios per minute (60), and per hour (3600).
  • Two files on the timeline of number of misses per second (1), per minute (60), and per hour (3600). For ts and linucb, it also outputs the following files:
  • Two files on the timeline of percentage of times a policy is selected: per minute (60) and per hour (3600).
  • Two files on the timeline of number of times a policy is selected: per minute (60) and per hour (3600). combines the outputs of into following files:

  • One miss ratio curve file per target column family: ml_{target_cf_name}_mrc
  • One files on the timeline of number of misses per {target_cf_name}{capacity}{time_unit}: ml_{target_cf_name}{capacity}{time_unit}miss_timeline
  • One files on the timeline of miss ratios per {target_cf_name}{capacity}{time_unit}: ml_{target_cf_name}{capacity}{time_unit}miss_ratio_timeline
  • One files on the timeline of number of times a policy is selected per {target_cf_name}{capacity}{time_unit}: ml_{target_cf_name}{capacity}{time_unit}policy_timeline
  • One files on the timeline of percentage of times a policy is selected per {target_cf_name}{capacity}{time_unit}: ml_{target_cf_name}{capacity}{time_unit}policy_ratio_timeline

Supported Cache Simulators

Cache NameComment
lruStrict (Least recently used) LRU cache. The cache maintains an LRU queue.
gdsizeGreedyDual Size.
N. Young. The k-server dual and loose competitiveness for paging. Algorithmica, June 1994, vol. 11,(no.6):525-41. Rewritten version of ‘’On-line caching as cache size varies’’, in The 2nd Annual ACM-SIAM Symposium on Discrete Algorithms, 241-250, 1991.
optThe Belady MIN algorithm.
L. A. Belady. 1966. A Study of Replacement Algorithms for a Virtual-storage Computer. IBM Syst. J. 5, 2 (June 1966), 78-101. DOI=
arcAdaptive replacement cache.
Nimrod Megiddo and Dharmendra S. Modha. 2003. ARC: A Self-Tuning, Low Overhead Replacement Cache. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST ‘03). USENIX Association, Berkeley, CA, USA, 115-130.
pylruLRU cache with random sampling.
pymru(Most recently used) MRU cache with random sampling.
pylfu(Least frequently used) LFU cache with random sampling.
pyhbHyperbolic Caching.
Aaron Blankstein, Siddhartha Sen, and Michael J. Freedman. 2017. Hyperbolic caching: flexible caching for web applications. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ‘17). USENIX Association, Berkeley, CA, USA, 499-511.
pyccbtCost class: block type
pycccfbtCost class: column family + block type
tsThompson sampling
Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018. A Tutorial on Thompson Sampling. Found. Trends Mach. Learn. 11, 1 (July 2018), 1-96. DOI:
linucbLinear UCB
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web (WWW ‘10). ACM, New York, NY, USA, 661-670. DOI=
*_hybridA hybrid cache that also caches row keys.

py* caches use random sampling at eviction time. It samples 64 random entries in the cache, sorts these entries based on a priority function, e.g., LRU, and evicts from the lowest priority entry until the cache has enough capacity to insert the new entry.

pycc* caches group cached entries by a cost class. The cache maintains aggregated statistics for each cost class such as number of hits, total size. A cached entry is also tagged with one cost class. At eviction time, the cache samples 64 random entries and group them by their cost class. It then evicts entries based on their cost class’s statistics.

ts and linucb are two caches using reinforcement learning. The cache is configured with N policies, e.g., LRU, MRU, LFU, etc. The cache learns which policy is the best overtime and selects the best policy for eviction. The cache rewards the selected policy if the policy has not evicted the missing key before. ts does not use any feature of a block while linucb uses three features: a block’s level, column family, and block type.

trace reports the misses observed in the collected trace.

Analyzing Block Cache Traces

The block_cache_trace_analyzer analyzes a trace file and outputs useful statistics of the access pattern. It provides insights into how to tune and improve a caching policy.

The block_cache_trace_analyzer may output statistics into multiple csv files saved in a result directory. We can plot graphs on these statistics using

Analyzer options:

  1. -access_count_buckets (Group number of blocks by their access count given
  2. these buckets. If specified, the analyzer will output a detailed analysis
  3. on the number of blocks grouped by their access count break down by block
  4. type and column family.) type: string default: ""
  5. -analyze_blocks_reuse_k_reuse_window (Analyze the percentage of blocks that
  6. are accessed in the [k, 2*k] seconds are accessed again in the next [2*k,
  7. 3*k], [3*k, 4*k],...,[k*(n-1), k*n] seconds. ) type: int32 default: 0
  8. -analyze_bottom_k_access_count_blocks (Print out detailed access
  9. information for blocks with their number of accesses are the bottom k
  10. among all blocks.) type: int32 default: 0
  11. -analyze_callers (The list of callers to perform a detailed analysis on. If
  12. speicfied, the analyzer will output a detailed percentage of accesses for
  13. each caller break down by column family, level, and block type. A list of
  14. available callers are: Get, MultiGet, Iterator, ApproximateSize,
  15. VerifyChecksum, SSTDumpTool, ExternalSSTIngestion, Repair, Prefetch,
  16. Compaction, CompactionRefill, Flush, SSTFileReader, Uncategorized.)
  17. type: string default: ""
  18. -analyze_correlation_coefficients_labels (Analyze the correlation
  19. coefficients of features such as number of past accesses with regard to
  20. the number of accesses till the next access.) type: string default: ""
  21. -analyze_correlation_coefficients_max_number_of_values (The maximum number
  22. of values for a feature. If the number of values for a feature is larger
  23. than this max, it randomly selects 'max' number of values.) type: int32
  24. default: 1000000
  25. -analyze_get_spatial_locality_buckets (Group data blocks by their
  26. statistics using these buckets.) type: string default: ""
  27. -analyze_get_spatial_locality_labels (Group data blocks using these
  28. labels.) type: string default: ""
  29. -analyze_top_k_access_count_blocks (Print out detailed access information
  30. for blocks with their number of accesses are the top k among all blocks.)
  31. type: int32 default: 0
  32. -block_cache_analysis_result_dir (The directory that saves block cache
  33. analysis results.) type: string default: ""
  34. -block_cache_sim_config_path (The config file path. One cache configuration
  35. per line. The format of a cache configuration is
  36. cache_name,num_shard_bits,ghost_capacity,cache_capacity_1,...,cache_capacity_N.
  37. Supported cache names are lru, lru_priority, lru_hybrid. User may also add
  38. a prefix 'ghost_' to a cache_name to add a ghost cache in front of the real
  39. cache. ghost_capacity and cache_capacity can be xK, xM or xG where
  40. x is a positive number.)
  41. type: string default: ""
  42. -block_cache_trace_downsample_ratio (The trace collected accesses on one in
  43. every block_cache_trace_downsample_ratio blocks. We scale down the
  44. simulated cache size by this ratio.) type: int32 default: 1
  45. -block_cache_trace_path (The trace file path.) type: string default: ""
  46. -cache_sim_warmup_seconds (The number of seconds to warmup simulated
  47. caches. The hit/miss counters are reset after the warmup completes.)
  48. type: int32 default: 0
  49. -human_readable_trace_file_path (The filt path that saves human readable
  50. access records.) type: string default: ""
  51. -mrc_only (Evaluate alternative cache policies only. When this flag is
  52. true, the analyzer does NOT maintain states of each block in memory for
  53. analysis. It only feeds the accesses into the cache simulators.)
  54. type: bool default: false
  55. -print_access_count_stats (Print access count distribution and the
  56. distribution break down by block type and column family.) type: bool
  57. default: false
  58. -print_block_size_stats (Print block size distribution and the distribution
  59. break down by block type and column family.) type: bool default: false
  60. -print_data_block_access_count_stats (Print data block accesses by user Get
  61. and Multi-Get.) type: bool default: false
  62. -reuse_distance_buckets (Group blocks by their reuse distances given these
  63. buckets. For example, if 'reuse_distance_buckets' is '1K,1M,1G', we will
  64. create four buckets. The first three buckets contain the number of blocks
  65. with reuse distance less than 1KB, between 1K and 1M, between 1M and 1G,
  66. respectively. The last bucket contains the number of blocks with reuse
  67. distance larger than 1G. ) type: string default: ""
  68. -reuse_distance_labels (Group the reuse distance of a block using these
  69. labels. Reuse distance is defined as the cumulated size of unique blocks
  70. read between two consecutive accesses on the same block.) type: string
  71. default: ""
  72. -reuse_interval_buckets (Group blocks by their reuse interval given these
  73. buckets. For example, if 'reuse_distance_buckets' is '1,10,100', we will
  74. create four buckets. The first three buckets contain the number of blocks
  75. with reuse interval less than 1 second, between 1 second and 10 seconds,
  76. between 10 seconds and 100 seconds, respectively. The last bucket
  77. contains the number of blocks with reuse interval longer than 100
  78. seconds.) type: string default: ""
  79. -reuse_interval_labels (Group the reuse interval of a block using these
  80. labels. Reuse interval is defined as the time between two consecutive
  81. accesses on the same block.) type: string default: ""
  82. -reuse_lifetime_buckets (Group blocks by their reuse lifetime given these
  83. buckets. For example, if 'reuse_lifetime_buckets' is '1,10,100', we will
  84. create four buckets. The first three buckets contain the number of blocks
  85. with reuse lifetime less than 1 second, between 1 second and 10 seconds,
  86. between 10 seconds and 100 seconds, respectively. The last bucket
  87. contains the number of blocks with reuse lifetime longer than 100
  88. seconds.) type: string default: ""
  89. -reuse_lifetime_labels (Group the reuse lifetime of a block using these
  90. labels. Reuse lifetime is defined as the time interval between the first
  91. access on a block and the last access on the same block. For blocks that
  92. are only accessed once, its lifetime is set to kMaxUint64.) type: string
  93. default: ""
  94. -skew_buckets (Group the skew labels using these buckets.) type: string
  95. default: ""
  96. -skew_labels (Group the access count of a block using these labels.)
  97. type: string default: ""
  98. -timeline_labels (Group the number of accesses per block per second using
  99. these labels. Possible labels are a combination of the following: cf
  100. (column family), sst, level, bt (block type), caller, block. For example,
  101. label "cf_bt" means the number of acccess per second is grouped by unique
  102. pairs of "cf_bt". A label "all" contains the aggregated number of
  103. accesses per second across all possible labels.) type: string default: ""

An example that outputs a statistics summary of the access pattern:

  1. ./block_cache_trace_analyzer -block_cache_trace_path=/tmp/test_trace_file_path

Another example:

  1. ./block_cache_trace_analyzer \
  2. -block_cache_trace_path=/tmp/test_trace_file_path \
  3. -block_cache_analysis_result_dir=/tmp/sim_results/test_trace_results \
  4. -print_block_size_stats \
  5. -print_access_count_stats \
  6. -print_data_block_access_count_stats \
  7. -timeline_labels=cf,level,bt,caller \
  8. -analyze_callers=Get,Iterator,Compaction \
  9. -access_count_buckets=1,2,3,4,5,6,7,8,9,10,100,10000,100000,1000000 \
  10. -analyze_bottom_k_access_count_blocks=10 \
  11. -analyze_top_k_access_count_blocks=10 \
  12. -reuse_lifetime_labels=cf,level,bt \
  13. -reuse_lifetime_buckets=1,10,100,1000,10000,100000,1000000 \
  14. -reuse_interval_labels=cf,level,bt,caller \
  15. -reuse_interval_buckets=1,10,100,1000,10000,100000,1000000 \
  16. -analyze_blocks_reuse_k_reuse_window=3600 \
  17. -analyze_get_spatial_locality_labels=cf,level,all \
  18. -analyze_get_spatial_locality_buckets=10,20,30,40,50,60,70,80,90,100,101 \
  19. -analyze_correlation_coefficients_labels=all,cf,level,bt,caller \
  20. -skew_labels=block,bt,table,sst,cf,level \
  21. -skew_buckets=10,20,30,40,50,60,70,80,90,100

Next, we can plot graphs using the following command:

  1. python /tmp/sim_results /tmp/sim_results_graphs