Cache Tiering

A cache tier provides Ceph Clients with better I/O performance for a subset ofthe data stored in a backing storage tier. Cache tiering involves creating apool of relatively fast/expensive storage devices (e.g., solid state drives)configured to act as a cache tier, and a backing pool of either erasure-codedor relatively slower/cheaper devices configured to act as an economical storagetier. The Ceph objecter handles where to place the objects and the tieringagent determines when to flush objects from the cache to the backing storagetier. So the cache tier and the backing storage tier are completely transparentto Ceph clients.

Cache Tiering - 图1

The cache tiering agent handles the migration of data between the cache tierand the backing storage tier automatically. However, admins have the ability toconfigure how this migration takes place by setting the cache-mode. There aretwo main scenarios:

  • writeback mode: When admins configure tiers with writeback mode, Cephclients write data to the cache tier and receive an ACK from the cache tier.In time, the data written to the cache tier migrates to the storage tierand gets flushed from the cache tier. Conceptually, the cache tier isoverlaid “in front” of the backing storage tier. When a Ceph client needsdata that resides in the storage tier, the cache tiering agent migrates thedata to the cache tier on read, then it is sent to the Ceph client.Thereafter, the Ceph client can perform I/O using the cache tier, until thedata becomes inactive. This is ideal for mutable data (e.g., photo/videoediting, transactional data, etc.).

  • readproxy mode: This mode will use any objects that alreadyexist in the cache tier, but if an object is not present in thecache the request will be proxied to the base tier. This is usefulfor transitioning from writeback mode to a disabled cache as itallows the workload to function properly while the cache is drained,without adding any new objects to the cache.

Other cache modes are:

  • readonly promotes objects to the cache on read operations only; writeoperations are forwarded to the base tier. This mode is intended forread-only workloads that do not require consistency to be enforced by thestorage system. (Warning: when objects are updated in the base tier,Ceph makes no attempt to sync these updates to the corresponding objectsin the cache. Since this mode is considered experimental, a—yes-i-really-mean-it option must be passed in order to enable it.)

  • none is used to completely disable caching.

A word of caution

Cache tiering will degrade performance for most workloads. Users should useextreme caution before using this feature.

  • Workload dependent: Whether a cache will improve performance ishighly dependent on the workload. Because there is a costassociated with moving objects into or out of the cache, it can onlybe effective when there is a large skew in the access pattern inthe data set, such that most of the requests touch a small number ofobjects. The cache pool should be large enough to capture theworking set for your workload to avoid thrashing.

  • Difficult to benchmark: Most benchmarks that users run to measureperformance will show terrible performance with cache tiering, inpart because very few of them skew requests toward a small set ofobjects, it can take a long time for the cache to “warm up,” andbecause the warm-up cost can be high.

  • Usually slower: For workloads that are not cache tiering-friendly,performance is often slower than a normal RADOS pool without cachetiering enabled.

  • librados object enumeration: The librados-level object enumerationAPI is not meant to be coherent in the presence of the case. Ifyour application is using librados directly and relies on objectenumeration, cache tiering will probably not work as expected.(This is not a problem for RGW, RBD, or CephFS.)

  • Complexity: Enabling cache tiering means that a lot of additionalmachinery and complexity within the RADOS cluster is being used.This increases the probability that you will encounter a bug in the systemthat other users have not yet encountered and will put your deployment at ahigher level of risk.

Known Good Workloads

  • RGW time-skewed: If the RGW workload is such that almost all readoperations are directed at recently written objects, a simple cachetiering configuration that destages recently written objects fromthe cache to the base tier after a configurable period can workwell.

Known Bad Workloads

The following configurations are known to work poorly with cachetiering.

  • RBD with replicated cache and erasure-coded base: This is a commonrequest, but usually does not perform well. Even reasonably skewedworkloads still send some small writes to cold objects, and becausesmall writes are not yet supported by the erasure-coded pool, entire(usually 4 MB) objects must be migrated into the cache in order tosatisfy a small (often 4 KB) write. Only a handful of users havesuccessfully deployed this configuration, and it only works for thembecause their data is extremely cold (backups) and they are not inany way sensitive to performance.

  • RBD with replicated cache and base: RBD with a replicated basetier does better than when the base is erasure coded, but it isstill highly dependent on the amount of skew in the workload, andvery difficult to validate. The user will need to have a goodunderstanding of their workload and will need to tune the cachetiering parameters carefully.

Setting Up Pools

To set up cache tiering, you must have two pools. One will act as thebacking storage and the other will act as the cache.

Setting Up a Backing Storage Pool

Setting up a backing storage pool typically involves one of two scenarios:

  • Standard Storage: In this scenario, the pool stores multiple copiesof an object in the Ceph Storage Cluster.

  • Erasure Coding: In this scenario, the pool uses erasure coding tostore data much more efficiently with a small performance tradeoff.

In the standard storage scenario, you can setup a CRUSH rule to establishthe failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSDDaemons perform optimally when all storage drives in the rule are of thesame size, speed (both RPMs and throughput) and type. See CRUSH Mapsfor details on creating a rule. Once you have created a rule, createa backing storage pool.

In the erasure coding scenario, the pool creation arguments will generate theappropriate rule automatically. See Create a Pool for details.

In subsequent examples, we will refer to the backing storage poolas cold-storage.

Setting Up a Cache Pool

Setting up a cache pool follows the same procedure as the standard storagescenario, but with this difference: the drives for the cache tier are typicallyhigh performance drives that reside in their own servers and have their ownCRUSH rule. When setting up such a rule, it should take account of the hoststhat have the high performance drives while omitting the hosts that don’t. SeePlacing Different Pools on Different OSDs for details.

In subsequent examples, we will refer to the cache pool as hot-storage andthe backing pool as cold-storage.

For cache tier configuration and default values, seePools - Set Pool Values.

Creating a Cache Tier

Setting up a cache tier involves associating a backing storage pool witha cache pool

  1. ceph osd tier add {storagepool} {cachepool}

For example

  1. ceph osd tier add cold-storage hot-storage

To set the cache mode, execute the following:

  1. ceph osd tier cache-mode {cachepool} {cache-mode}

For example:

  1. ceph osd tier cache-mode hot-storage writeback

The cache tiers overlay the backing storage tier, so they require oneadditional step: you must direct all client traffic from the storage pool tothe cache pool. To direct client traffic directly to the cache pool, executethe following:

  1. ceph osd tier set-overlay {storagepool} {cachepool}

For example:

  1. ceph osd tier set-overlay cold-storage hot-storage

Configuring a Cache Tier

Cache tiers have several configuration options. You may setcache tier configuration options with the following usage:

  1. ceph osd pool set {cachepool} {key} {value}

See Pools - Set Pool Values for details.

Target Size and Type

Ceph’s production cache tiers use a Bloom Filter for the hit_set_type:

  1. ceph osd pool set {cachepool} hit_set_type bloom

For example:

  1. ceph osd pool set hot-storage hit_set_type bloom

The hit_set_count and hit_set_period define how many such HitSets tostore, and how much time each HitSet should cover.

  1. ceph osd pool set {cachepool} hit_set_count 12
  2. ceph osd pool set {cachepool} hit_set_period 14400
  3. ceph osd pool set {cachepool} target_max_bytes 1000000000000

Note

A larger hit_set_count results in more RAM consumed bythe ceph-osd process.

Binning accesses over time allows Ceph to determine whether a Ceph clientaccessed an object at least once, or more than once over a time period(“age” vs “temperature”).

The min_read_recency_for_promote defines how many HitSets to check for theexistence of an object when handling a read operation. The checking result isused to decide whether to promote the object asynchronously. Its value should bebetween 0 and hit_set_count. If it’s set to 0, the object is always promoted.If it’s set to 1, the current HitSet is checked. And if this object is in thecurrent HitSet, it’s promoted. Otherwise not. For the other values, the exactnumber of archive HitSets are checked. The object is promoted if the object isfound in any of the most recent min_read_recency_for_promote HitSets.

A similar parameter can be set for the write operation, which ismin_write_recency_for_promote.

  1. ceph osd pool set {cachepool} min_read_recency_for_promote 2
  2. ceph osd pool set {cachepool} min_write_recency_for_promote 2

Note

The longer the period and the higher themin_read_recency_for_promote andmin_write_recency_for_promotevalues, the more RAM theceph-osddaemon consumes. In particular, when the agent is active to flushor evict cache objects, all hit_set_count HitSets are loadedinto RAM.

Cache Sizing

The cache tiering agent performs two main functions:

  • Flushing: The agent identifies modified (or dirty) objects and forwardsthem to the storage pool for long-term storage.

  • Evicting: The agent identifies objects that haven’t been modified(or clean) and evicts the least recently used among them from the cache.

Absolute Sizing

The cache tiering agent can flush or evict objects based upon the total numberof bytes or the total number of objects. To specify a maximum number of bytes,execute the following:

  1. ceph osd pool set {cachepool} target_max_bytes {#bytes}

For example, to flush or evict at 1 TB, execute the following:

  1. ceph osd pool set hot-storage target_max_bytes 1099511627776

To specify the maximum number of objects, execute the following:

  1. ceph osd pool set {cachepool} target_max_objects {#objects}

For example, to flush or evict at 1M objects, execute the following:

  1. ceph osd pool set hot-storage target_max_objects 1000000

Note

Ceph is not able to determine the size of a cache pool automatically, sothe configuration on the absolute size is required here, otherwise theflush/evict will not work. If you specify both limits, the cache tieringagent will begin flushing or evicting when either threshold is triggered.

Note

All client requests will be blocked only when target_max_bytes ortarget_max_objects reached

Relative Sizing

The cache tiering agent can flush or evict objects relative to the size of thecache pool(specified by target_max_bytes / target_max_objects inAbsolute sizing). When the cache pool consists of a certain percentage ofmodified (or dirty) objects, the cache tiering agent will flush them to thestorage pool. To set the cache_target_dirty_ratio, execute the following:

  1. ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0}

For example, setting the value to 0.4 will begin flushing modified(dirty) objects when they reach 40% of the cache pool’s capacity:

  1. ceph osd pool set hot-storage cache_target_dirty_ratio 0.4

When the dirty objects reaches a certain percentage of its capacity, flush dirtyobjects with a higher speed. To set the cache_target_dirty_high_ratio:

  1. ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0}

For example, setting the value to 0.6 will begin aggressively flush dirty objectswhen they reach 60% of the cache pool’s capacity. obviously, we’d better set the valuebetween dirty_ratio and full_ratio:

  1. ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6

When the cache pool reaches a certain percentage of its capacity, the cachetiering agent will evict objects to maintain free capacity. To set thecache_target_full_ratio, execute the following:

  1. ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0}

For example, setting the value to 0.8 will begin flushing unmodified(clean) objects when they reach 80% of the cache pool’s capacity:

  1. ceph osd pool set hot-storage cache_target_full_ratio 0.8

Cache Age

You can specify the minimum age of an object before the cache tiering agentflushes a recently modified (or dirty) object to the backing storage pool:

  1. ceph osd pool set {cachepool} cache_min_flush_age {#seconds}

For example, to flush modified (or dirty) objects after 10 minutes, executethe following:

  1. ceph osd pool set hot-storage cache_min_flush_age 600

You can specify the minimum age of an object before it will be evicted fromthe cache tier:

  1. ceph osd pool {cache-tier} cache_min_evict_age {#seconds}

For example, to evict objects after 30 minutes, execute the following:

  1. ceph osd pool set hot-storage cache_min_evict_age 1800

Removing a Cache Tier

Removing a cache tier differs depending on whether it is a writebackcache or a read-only cache.

Removing a Read-Only Cache

Since a read-only cache does not have modified data, you can disableand remove it without losing any recent changes to objects in the cache.

  • Change the cache-mode to none to disable it.
  1. ceph osd tier cache-mode {cachepool} none

For example:

  1. ceph osd tier cache-mode hot-storage none
  • Remove the cache pool from the backing pool.
  1. ceph osd tier remove {storagepool} {cachepool}

For example:

  1. ceph osd tier remove cold-storage hot-storage

Removing a Writeback Cache

Since a writeback cache may have modified data, you must take steps to ensurethat you do not lose any recent changes to objects in the cache before youdisable and remove it.

  • Change the cache mode to proxy so that new and modified objects willflush to the backing storage pool.
  1. ceph osd tier cache-mode {cachepool} proxy

For example:

  1. ceph osd tier cache-mode hot-storage proxy
  • Ensure that the cache pool has been flushed. This may take a few minutes:
  1. rados -p {cachepool} ls

If the cache pool still has objects, you can flush them manually.For example:

  1. rados -p {cachepool} cache-flush-evict-all
  • Remove the overlay so that clients will not direct traffic to the cache.
  1. ceph osd tier remove-overlay {storagetier}

For example:

  1. ceph osd tier remove-overlay cold-storage
  • Finally, remove the cache tier pool from the backing storage pool.
  1. ceph osd tier remove {storagepool} {cachepool}

For example:

  1. ceph osd tier remove cold-storage hot-storage