Cache pool

Purpose

Use a pool of fast storage devices (probably SSDs) and use it as acache for an existing slower and larger pool.

Use a replicated pool as a front-end to service most I/O, and destagecold data to a separate erasure coded pool that does not currently (andcannot efficiently) handle the workload.

We should be able to create and add a cache pool to an existing poolof data, and later remove it, without disrupting service or migratingdata around.

Use cases

Read-write pool, writeback

We have an existing data pool and put a fast cache pool “in front” ofit. Writes will go to the cache pool and immediately ack. We flushthem back to the data pool based on the defined policy.

Read-only pool, weak consistency

We have an existing data pool and add one or more read-only cachepools. We copy data to the cache pool(s) on read. Writes areforwarded to the original data pool. Stale data is expired from thecache pools based on the defined policy.

This is likely only useful for specific applications with specificdata access patterns. It may be a match for rgw, for example.

Interface

Set up a read/write cache pool foo-hot for pool foo:

  1. ceph osd tier add foo foo-hot
  2. ceph osd tier cache-mode foo-hot writeback

Direct all traffic for foo to foo-hot:

  1. ceph osd tier set-overlay foo foo-hot

Set the target size and enable the tiering agent for foo-hot:

  1. ceph osd pool set foo-hot hit_set_type bloom
  2. ceph osd pool set foo-hot hit_set_count 1
  3. ceph osd pool set foo-hot hit_set_period 3600 # 1 hour
  4. ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
  5. ceph osd pool set foo-hot min_read_recency_for_promote 1
  6. ceph osd pool set foo-hot min_write_recency_for_promote 1

Drain the cache in preparation for turning it off:

  1. ceph osd tier cache-mode foo-hot forward
  2. rados -p foo-hot cache-flush-evict-all

When cache pool is finally empty, disable it:

  1. ceph osd tier remove-overlay foo
  2. ceph osd tier remove foo foo-hot

Read-only pools with lazy consistency:

  1. ceph osd tier add foo foo-east
  2. ceph osd tier cache-mode foo-east readonly
  3. ceph osd tier add foo foo-west
  4. ceph osd tier cache-mode foo-west readonly

Tiering agent

The tiering policy is defined as properties on the cache pool itself.

HitSet metadata

First, the agent requires HitSet information to be tracked on thecache pool in order to determine which objects in the pool are beingaccessed. This is enabled with:

  1. ceph osd pool set foo-hot hit_set_type bloom
  2. ceph osd pool set foo-hot hit_set_count 1
  3. ceph osd pool set foo-hot hit_set_period 3600 # 1 hour

The supported HitSet types include ‘bloom’ (a bloom filter, thedefault), ‘explicit_hash’, and ‘explicit_object’. The latter twoexplicitly enumerate accessed objects and are less memory efficient.They are there primarily for debugging and to demonstrate pluggabilityfor the infrastructure. For the bloom filter type, you can additionallydefine the false positive probability for the bloom filter (default is 0.05):

  1. ceph osd pool set foo-hot hit_set_fpp 0.15

The hit_set_count and hit_set_period define how much time each HitSetshould cover, and how many such HitSets to store. Binning accessesover time allows Ceph to independently determine whether an object wasaccessed at least once and whether it was accessed more than once oversome time period (“age” vs “temperature”).

The min_read_recency_for_promote defines how many HitSets to check for theexistence of an object when handling a read operation. The checking result isused to decide whether to promote the object asynchronously. Its value should bebetween 0 and hit_set_count. If it’s set to 0, the object is always promoted.If it’s set to 1, the current HitSet is checked. And if this object is in thecurrent HitSet, it’s promoted. Otherwise not. For the other values, the exactnumber of archive HitSets are checked. The object is promoted if the object isfound in any of the most recent min_read_recency_for_promote HitSets.

A similar parameter can be set for the write operation, which ismin_write_recency_for_promote.

  1. ceph osd pool set {cachepool} min_read_recency_for_promote 1
  2. ceph osd pool set {cachepool} min_write_recency_for_promote 1

Note that the longer the hit_set_period and the higher themin_read_recency_for_promote/min_write_recency_for_promote the more RAMwill be consumed by the ceph-osd process. In particular, when the agent is activeto flush or evict cache objects, all hit_set_count HitSets are loaded into RAM.

Cache mode

The most important policy is the cache mode:

ceph osd pool set foo-hot cache-mode writeback

The supported modes are ‘none’, ‘writeback’, ‘forward’, and‘readonly’. Most installations want ‘writeback’, which will writeinto the cache tier and only later flush updates back to the basetier. Similarly, any object that is read will be promoted into thecache tier.

The ‘forward’ mode is intended for when the cache is being disabledand needs to be drained. No new objects will be promoted or writtento the cache pool unless they are already present. A backgroundoperation can then do something like:

  1. rados -p foo-hot cache-try-flush-evict-all
  2. rados -p foo-hot cache-flush-evict-all

to force all data to be flushed back to the base tier.

The ‘readonly’ mode is intended for read-only workloads that do notrequire consistency to be enforced by the storage system. Writes willbe forwarded to the base tier, but objects that are read will getpromoted to the cache. No attempt is made by Ceph to ensure that thecontents of the cache tier(s) are consistent in the presence of objectupdates.

Cache sizing

The agent performs two basic functions: flushing (writing ‘dirty’cache objects back to the base tier) and evicting (removing cold andclean objects from the cache).

The thresholds at which Ceph will flush or evict objects is specifiedrelative to a ‘target size’ of the pool. For example:

  1. ceph osd pool set foo-hot cache_target_dirty_ratio .4
  2. ceph osd pool set foo-hot cache_target_dirty_high_ratio .6
  3. ceph osd pool set foo-hot cache_target_full_ratio .8

will begin flushing dirty objects when 40% of the pool is dirty and beginevicting clean objects when we reach 80% of the target size.

The target size can be specified either in terms of objects or bytes:

  1. ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
  2. ceph osd pool set foo-hot target_max_objects 1000000 # 1 million objects

Note that if both limits are specified, Ceph will begin flushing orevicting when either threshold is triggered.

Other tunables

You can specify a minimum object age before a recently updated object isflushed to the base tier:

  1. ceph osd pool set foo-hot cache_min_flush_age 600 # 10 minutes

You can specify the minimum age of an object before it will be evicted fromthe cache tier:

  1. ceph osd pool set foo-hot cache_min_evict_age 1800 # 30 minutes