BlueStore Config Reference

Devices

BlueStore manages either one, two, or (in certain cases) three storagedevices.

In the simplest case, BlueStore consumes a single (primary) storage device.The storage device is normally used as a whole, occupying the full device thatis managed directly by BlueStore. This primary device is normally identifiedby a block symlink in the data directory.

The data directory is a tmpfs mount which gets populated (at boot time, orwhen ceph-volume activates it) with all the common OSD files that holdinformation about the OSD, like: its identifier, which cluster it belongs to,and its private keyring.

It is also possible to deploy BlueStore across two additional devices:

  • A WAL device (identified as block.wal in the data directory) can beused for BlueStore’s internal journal or write-ahead log. It is only usefulto use a WAL device if the device is faster than the primary device (e.g.,when it is on an SSD and the primary device is an HDD).

  • A DB device (identified as block.db in the data directory) can be usedfor storing BlueStore’s internal metadata. BlueStore (or rather, theembedded RocksDB) will put as much metadata as it can on the DB device toimprove performance. If the DB device fills up, metadata will spill backonto the primary device (where it would have been otherwise). Again, it isonly helpful to provision a DB device if it is faster than the primarydevice.

If there is only a small amount of fast storage available (e.g., lessthan a gigabyte), we recommend using it as a WAL device. If there ismore, provisioning a DB device makes more sense. The BlueStorejournal will always be placed on the fastest device available, sousing a DB device will provide the same benefit that the WAL devicewould while also allowing additional metadata to be stored there (ifit will fit).

A single-device BlueStore OSD can be provisioned with:

  1. ceph-volume lvm prepare --bluestore --data <device>

To specify a WAL device and/or DB device,

  1. ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>

Note

–data can be a Logical Volume using the vg/lv notation. Otherdevices can be existing logical volumes or GPT partitions

Provisioning strategies

Although there are multiple ways to deploy a Bluestore OSD (unlike Filestorewhich had 1) here are two common use cases that should help clarify theinitial deployment strategy:

block (data) only

If all the devices are the same type, for example all are spinning drives, andthere are no fast devices to combine these, it makes sense to just deploy withblock only and not try to separate block.db or block.wal. Thelvm call for a single /dev/sda device would look like:

  1. ceph-volume lvm create --bluestore --data /dev/sda

If logical volumes have already been created for each device (1 LV using 100%of the device), then the lvm call for an lv namedceph-vg/block-lv would look like:

  1. ceph-volume lvm create --bluestore --data ceph-vg/block-lv

block and block.db

If there is a mix of fast and slow devices (spinning and solid state),it is recommended to place block.db on the faster device while block(data) lives on the slower (spinning drive). Sizing for block.db should beas large as possible to avoid performance penalties otherwise. Theceph-volume tool is currently not able to create these automatically, sothe volume groups and logical volumes need to be created manually.

For the below example, lets assume 4 spinning drives (sda, sdb, sdc, and sdd)and 1 solid state drive (sdx). First create the volume groups:

  1. $ vgcreate ceph-block-0 /dev/sda
  2. $ vgcreate ceph-block-1 /dev/sdb
  3. $ vgcreate ceph-block-2 /dev/sdc
  4. $ vgcreate ceph-block-3 /dev/sdd

Now create the logical volumes for block:

  1. $ lvcreate -l 100%FREE -n block-0 ceph-block-0
  2. $ lvcreate -l 100%FREE -n block-1 ceph-block-1
  3. $ lvcreate -l 100%FREE -n block-2 ceph-block-2
  4. $ lvcreate -l 100%FREE -n block-3 ceph-block-3

We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GBSSD in /dev/sdx we will create 4 logical volumes, each of 50GB:

  1. $ vgcreate ceph-db-0 /dev/sdx
  2. $ lvcreate -L 50GB -n db-0 ceph-db-0
  3. $ lvcreate -L 50GB -n db-1 ceph-db-0
  4. $ lvcreate -L 50GB -n db-2 ceph-db-0
  5. $ lvcreate -L 50GB -n db-3 ceph-db-0

Finally, create the 4 OSDs with ceph-volume:

  1. $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
  2. $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
  3. $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
  4. $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3

These operations should end up creating 4 OSDs, with block on the slowerspinning drives and a 50GB logical volume for each coming from the solid statedrive.

Sizing

When using a mixed spinning and solid drive setup it is important to make a large-enoughblock.db logical volume for Bluestore. Generally, block.db should haveas large as possible logical volumes.

The general recommendation is to have block.db size in between 1% to 4%of block size. For RGW workloads, it is recommended that the block.dbsize isn’t smaller than 4% of block, because RGW heavily uses it to store itsmetadata. For example, if the block size is 1TB, then block.db shouldn’tbe less than 40GB. For RBD workloads, 1% to 2% of block size is usually enough.

If not using a mix of fast and slow devices, it isn’t required to createseparate logical volumes for block.db (or block.wal). Bluestore willautomatically manage these within the space of block.

Automatic Cache Sizing

Bluestore can be configured to automatically resize it’s caches when tc_mallocis configured as the memory allocator and the bluestore_cache_autotunesetting is enabled. This option is currently enabled by default. Bluestorewill attempt to keep OSD heap memory usage under a designated target size viathe osd_memory_target configuration option. This is a best effortalgorithm and caches will not shrink smaller than the amount specified byosd_memory_cache_min. Cache ratios will be chosen based on a hierarchyof priorities. If priority information is not available, thebluestore_cache_meta_ratio and bluestore_cache_kv_ratio options areused as fallbacks.

bluestore_cache_autotune

  • Description
  • Automatically tune the ratios assigned to different bluestore caches while respecting minimum values.

  • Type

  • Boolean

  • Required

  • Yes

  • Default

  • True

osd_memory_target

  • Description
  • When tcmalloc is available and cache autotuning is enabled, try to keep this many bytes mapped in memory. Note: This may not exactly match the RSS memory usage of the process. While the total amount of heap memory mapped by the process should generally stay close to this target, there is no guarantee that the kernel will actually reclaim memory that has been unmapped. During initial development, it was found that some kernels result in the OSD’s RSS Memory exceeding the mapped memory by up to 20%. It is hypothesised however, that the kernel generally may be more aggressive about reclaiming unmapped memory when there is a high amount of memory pressure. Your mileage may vary.

  • Type

  • Unsigned Integer

  • Required

  • Yes

  • Default

  • 4294967296

bluestore_cache_autotune_chunk_size

  • Description
  • The chunk size in bytes to allocate to caches when cache autotune is enabled. When the autotuner assigns memory to different caches, it will allocate memory in chunks. This is done to avoid evictions when there are minor fluctuations in the heap size or autotuned cache ratios.

  • Type

  • Unsigned Integer

  • Required

  • No

  • Default

  • 33554432

bluestore_cache_autotune_interval

  • Description
  • The number of seconds to wait between rebalances when cache autotune is enabled. This setting changes how quickly the ratios of the difference caches are recomputed. Note: Setting the interval too small can result in high CPU usage and lower performance.

  • Type

  • Float

  • Required

  • No

  • Default

  • 5

osd_memory_base

  • Description
  • When tcmalloc and cache autotuning is enabled, estimate the minimum amount of memory in bytes the OSD will need. This is used to help the autotuner estimate the expected aggregate memory consumption of the caches.

  • Type

  • Unsigned Integer

  • Required

  • No

  • Default

  • 805306368

osd_memory_expected_fragmentation

  • Description
  • When tcmalloc and cache autotuning is enabled, estimate the percent of memory fragmentation. This is used to help the autotuner estimate the expected aggregate memory consumption of the caches.

  • Type

  • Float

  • Required

  • No

  • Default

  • 0.15

osd_memory_cache_min

  • Description
  • When tcmalloc and cache autotuning is enabled, set the minimum amount of memory used for caches. Note: Setting this value too low can result in significant cache thrashing.

  • Type

  • Unsigned Integer

  • Required

  • No

  • Default

  • 134217728

osd_memory_cache_resize_interval

  • Description
  • When tcmalloc and cache autotuning is enabled, wait this many seconds between resizing caches. This setting changes the total amount of memory available for bluestore to use for caching. Note: Setting the interval too small can result in memory allocator thrashing and lower performance.

  • Type

  • Float

  • Required

  • No

  • Default

  • 1

Manual Cache Sizing

The amount of memory consumed by each OSD for BlueStore’s cache isdetermined by the bluestore_cache_size configuration option. Ifthat config option is not set (i.e., remains at 0), there is adifferent default value that is used depending on whether an HDD orSSD is used for the primary device (set by thebluestore_cache_size_ssd and bluestore_cache_size_hdd configoptions).

BlueStore and the rest of the Ceph OSD does the best it can currentlyto stick to the budgeted memory. Note that on top of the configuredcache size, there is also memory consumed by the OSD itself, andgenerally some overhead due to memory fragmentation and otherallocator overhead.

The configured cache memory budget can be used in a few different ways:

  • Key/Value metadata (i.e., RocksDB’s internal cache)

  • BlueStore metadata

  • BlueStore data (i.e., recently read or written object data)

Cache memory usage is governed by the following options:bluestore_cache_meta_ratio and bluestore_cache_kv_ratio.The fraction of the cache devoted to datais governed by the effective bluestore cache size (depending onbluestore_cache_size[_ssd|_hdd] settings and the device class of the primarydevice) as well as the meta and kv ratios.The data fraction can be calculated by<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)

bluestore_cache_size

  • Description
  • The amount of memory BlueStore will use for its cache. If zero, bluestore_cache_size_hdd or bluestore_cache_size_ssd will be used instead.

  • Type

  • Unsigned Integer

  • Required

  • Yes

  • Default

  • 0

bluestore_cache_size_hdd

  • Description
  • The default amount of memory BlueStore will use for its cache when backed by an HDD.

  • Type

  • Unsigned Integer

  • Required

  • Yes

  • Default

  • 1 1024 1024 * 1024 (1 GB)

bluestore_cache_size_ssd

  • Description
  • The default amount of memory BlueStore will use for its cache when backed by an SSD.

  • Type

  • Unsigned Integer

  • Required

  • Yes

  • Default

  • 3 1024 1024 * 1024 (3 GB)

bluestore_cache_meta_ratio

  • Description
  • The ratio of cache devoted to metadata.

  • Type

  • Floating point

  • Required

  • Yes

  • Default

  • .4

bluestore_cache_kv_ratio

  • Description
  • The ratio of cache devoted to key/value data (rocksdb).

  • Type

  • Floating point

  • Required

  • Yes

  • Default

  • .4

bluestore_cache_kv_max

  • Description
  • The maximum amount of cache devoted to key/value data (rocksdb).

  • Type

  • Unsigned Integer

  • Required

  • Yes

  • Default

  • 512 10241024 (512 MB)

Checksums

BlueStore checksums all metadata and data written to disk. Metadatachecksumming is handled by RocksDB and uses crc32c. Datachecksumming is done by BlueStore and can make use of crc32c,xxhash32, or xxhash64. The default is crc32c and should besuitable for most purposes.

Full data checksumming does increase the amount of metadata thatBlueStore must store and manage. When possible, e.g., when clientshint that data is written and read sequentially, BlueStore willchecksum larger blocks, but in many cases it must store a checksumvalue (usually 4 bytes) for every 4 kilobyte block of data.

It is possible to use a smaller checksum value by truncating thechecksum to two or one byte, reducing the metadata overhead. Thetrade-off is that the probability that a random error will not bedetected is higher with a smaller checksum, going from about one infour billion with a 32-bit (4 byte) checksum to one in 65,536 for a16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.The smaller checksum values can be used by selecting crc32c_16 orcrc32c_8 as the checksum algorithm.

The checksum algorithm can be set either via a per-poolcsum_type property or the global config option. For example,

  1. ceph osd pool set <pool-name> csum_type <algorithm>

bluestore_csum_type

  • Description
  • The default checksum algorithm to use.

  • Type

  • String

  • Required

  • Yes

  • Valid Settings

  • none, crc32c, crc32c_16, crc32c_8, xxhash32, xxhash64

  • Default

  • crc32c

Inline Compression

BlueStore supports inline compression using snappy, zlib, orlz4. Please note that the lz4 compression plugin is notdistributed in the official release.

Whether data in BlueStore is compressed is determined by a combinationof the compression mode and any hints associated with a writeoperation. The modes are:

  • none: Never compress data.

  • passive: Do not compress data unless the write operation has acompressible hint set.

  • aggressive: Compress data unless the write operation has anincompressible hint set.

  • force: Try to compress data no matter what.

For more information about the compressible and incompressible IOhints, see rados_set_alloc_hint().

Note that regardless of the mode, if the size of the data chunk is notreduced sufficiently it will not be used and the original(uncompressed) data will be stored. For example, if the bluestorecompression required ratio is set to .7 then the compressed datamust be 70% of the size of the original (or smaller).

The compression mode, compression algorithm, compression requiredratio, min blob size, and max blob size can be set either via aper-pool property or a global config option. Pool properties can beset with:

  1. ceph osd pool set <pool-name> compression_algorithm <algorithm>
  2. ceph osd pool set <pool-name> compression_mode <mode>
  3. ceph osd pool set <pool-name> compression_required_ratio <ratio>
  4. ceph osd pool set <pool-name> compression_min_blob_size <size>
  5. ceph osd pool set <pool-name> compression_max_blob_size <size>

bluestore compression algorithm

  • Description
  • The default compressor to use (if any) if the per-pool propertycompression_algorithm is not set. Note that zstd is _not_recommended for bluestore due to high CPU overhead whencompressing small amounts of data.

  • Type

  • String

  • Required

  • No

  • Valid Settings

  • lz4, snappy, zlib, zstd

  • Default

  • snappy

bluestore compression mode

  • Description
  • The default policy for using compression if the per-pool propertycompression_mode is not set. none means never usecompression. passive means use compression whenclients hint that data iscompressible. aggressive means use compression unlessclients hint that data is not compressible. force means usecompression under all circumstances even if the clients hint thatthe data is not compressible.

  • Type

  • String

  • Required

  • No

  • Valid Settings

  • none, passive, aggressive, force

  • Default

  • none

bluestore compression required ratio

  • Description
  • The ratio of the size of the data chunk aftercompression relative to the original size must be atleast this small in order to store the compressedversion.

  • Type

  • Floating point

  • Required

  • No

  • Default

  • .875

bluestore compression min blob size

  • Description
  • Chunks smaller than this are never compressed.The per-pool property compression_min_blob_size overridesthis setting.

  • Type

  • Unsigned Integer

  • Required

  • No

  • Default

  • 0

bluestore compression min blob size hdd

  • Description
  • Default value of bluestore compression min blob sizefor rotational media.

  • Type

  • Unsigned Integer

  • Required

  • No

  • Default

  • 128K

bluestore compression min blob size ssd

  • Description
  • Default value of bluestore compression min blob sizefor non-rotational (solid state) media.

  • Type

  • Unsigned Integer

  • Required

  • No

  • Default

  • 8K

bluestore compression max blob size

  • Description
  • Chunks larger than this are broken into smaller blobs sizingbluestore compression max blob size before being compressed.The per-pool property compression_max_blob_size overridesthis setting.

  • Type

  • Unsigned Integer

  • Required

  • No

  • Default

  • 0

bluestore compression max blob size hdd

  • Description
  • Default value of bluestore compression max blob sizefor rotational media.

  • Type

  • Unsigned Integer

  • Required

  • No

  • Default

  • 512K

bluestore compression max blob size ssd

  • Description
  • Default value of bluestore compression max blob sizefor non-rotational (solid state) media.

  • Type

  • Unsigned Integer

  • Required

  • No

  • Default

  • 64K

SPDK Usage

If you want to use SPDK driver for NVME SSD, you need to ready your system.Please refer to SPDK document for more details.

SPDK offers a script to configure the device automatically. Users can run thescript as root:

  1. $ sudo src/spdk/scripts/setup.sh

Then you need to specify NVMe device’s device selector here with “spdk:” prefix forbluestore_block_path.

For example, users can find the device selector of an Intel PCIe SSD with:

  1. $ lspci -mm -n -D -d 8086:0953

The device selector always has the form of DDDD:BB:DD.FF or DDDD.BB.DD.FF.

and then set:

  1. bluestore block path = spdk:0000:01:00.0

Where 0000:01:00.0 is the device selector found in the output of lspcicommand above.

If you want to run multiple SPDK instances per node, you must specify theamount of dpdk memory size in MB each instance will use, to make sure eachinstance uses its own dpdk memory

In most cases, we only need one device to serve as data, db, db wal purposes.We need to make sure configurations below to make sure all IOs issued underSPDK.:

  1. bluestore_block_db_path = ""
  2. bluestore_block_db_size = 0
  3. bluestore_block_wal_path = ""
  4. bluestore_block_wal_size = 0

Otherwise, the current implementation will setup symbol file to kernelfile system location and uses kernel driver to issue DB/WAL IO.