Storage Devices

There are two Ceph daemons that store data on disk:

  • Ceph OSDs (or Object Storage Daemons) are where most of thedata is stored in Ceph. Generally speaking, each OSD is backed bya single storage device, like a traditional hard disk (HDD) orsolid state disk (SSD). OSDs can also be backed by a combinationof devices, like a HDD for most data and an SSD (or partition of anSSD) for some metadata. The number of OSDs in a cluster isgenerally a function of how much data will be stored, how big eachstorage device will be, and the level and type of redundancy(replication or erasure coding).

  • Ceph Monitor daemons manage critical cluster state like clustermembership and authentication information. For smaller clusters afew gigabytes is all that is needed, although for larger clustersthe monitor database can reach tens or possibly hundreds ofgigabytes.

OSD Backends

There are two ways that OSDs can manage the data they store. Startingwith the Luminous 12.2.z release, the new default (and recommended) backend isBlueStore. Prior to Luminous, the default (and only option) wasFileStore.

BlueStore

BlueStore is a special-purpose storage backend designed specificallyfor managing data on disk for Ceph OSD workloads. It is motivated byexperience supporting and managing OSDs using FileStore over thelast ten years. Key BlueStore features include:

  • Direct management of storage devices. BlueStore consumes raw blockdevices or partitions. This avoids any intervening layers ofabstraction (such as local file systems like XFS) that may limitperformance or add complexity.

  • Metadata management with RocksDB. We embed RocksDB’s key/value databasein order to manage internal metadata, such as the mapping from objectnames to block locations on disk.

  • Full data and metadata checksumming. By default all data andmetadata written to BlueStore is protected by one or morechecksums. No data or metadata will be read from disk or returnedto the user without being verified.

  • Inline compression. Data written may be optionally compressedbefore being written to disk.

  • Multi-device metadata tiering. BlueStore allows its internaljournal (write-ahead log) to be written to a separate, high-speeddevice (like an SSD, NVMe, or NVDIMM) to increased performance. Ifa significant amount of faster storage is available, internalmetadata can also be stored on the faster device.

  • Efficient copy-on-write. RBD and CephFS snapshots rely on acopy-on-write clone mechanism that is implemented efficiently inBlueStore. This results in efficient IO both for regular snapshotsand for erasure coded pools (which rely on cloning to implementefficient two-phase commits).

For more information, see BlueStore Config Reference and BlueStore Migration.

FileStore

FileStore is the legacy approach to storing objects in Ceph. Itrelies on a standard file system (normally XFS) in combination with akey/value database (traditionally LevelDB, now RocksDB) for somemetadata.

FileStore is well-tested and widely used in production but suffersfrom many performance deficiencies due to its overall design andreliance on a traditional file system for storing object data.

Although FileStore is generally capable of functioning on mostPOSIX-compatible file systems (including btrfs and ext4), we onlyrecommend that XFS be used. Both btrfs and ext4 have known bugs anddeficiencies and their use may lead to data loss. By default all Cephprovisioning tools will use XFS.

For more information, see Filestore Config Reference.