Erasure code

A Ceph pool is associated to a type to sustain the loss of an OSD(i.e. a disk since most of the time there is one OSD per disk). Thedefault choice when creating a pool is replicated,meaning every object is copied on multiple disks. The Erasure Code pool type can be usedinstead to save space.

Creating a sample erasure coded pool

The simplest erasure coded pool is equivalent to RAID5 andrequires at least three hosts:

  1. $ ceph osd pool create ecpool erasure
  2. pool 'ecpool' created
  3. $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
  4. $ rados --pool ecpool get NYAN -
  5. ABCDEFGHI

Note

the 12 in pool create stands forthe number of placement groups.

Erasure code profiles

The default erasure code profile sustains the loss of a two OSDs. Itis equivalent to a replicated pool of size three but requires 2TBinstead of 3TB to store 1TB of data. The default profile can bedisplayed with:

  1. $ ceph osd erasure-code-profile get default
  2. k=2
  3. m=2
  4. plugin=jerasure
  5. crush-failure-domain=host
  6. technique=reed_sol_van

Choosing the right profile is important because it cannot be modifiedafter the pool is created: a new pool with a different profile needsto be created and all objects from the previous pool moved to the new.

The most important parameters of the profile are K, M andcrush-failure-domain because they define the storage overhead andthe data durability. For instance, if the desired architecture mustsustain the loss of two racks with a storage overhead of 67% overhead,the following profile can be defined:

  1. $ ceph osd erasure-code-profile set myprofile \
  2. k=3 \
  3. m=2 \
  4. crush-failure-domain=rack
  5. $ ceph osd pool create ecpool erasure myprofile
  6. $ echo ABCDEFGHI | rados --pool ecpool put NYAN -
  7. $ rados --pool ecpool get NYAN -
  8. ABCDEFGHI

The NYAN object will be divided in three (K=3) and two additionalchunks will be created (M=2). The value of M defines how manyOSD can be lost simultaneously without losing any data. Thecrush-failure-domain=rack will create a CRUSH rule that ensuresno two chunks are stored in the same rack.

Erasure code - 图1

More information can be found in the erasure code profiles documentation.

Erasure Coding with Overwrites

By default, erasure coded pools only work with uses like RGW thatperform full object writes and appends.

Since Luminous, partial writes for an erasure coded pool may beenabled with a per-pool setting. This lets RBD and CephFS store theirdata in an erasure coded pool:

  1. ceph osd pool set ec_pool allow_ec_overwrites true

This can only be enabled on a pool residing on bluestore OSDs, sincebluestore’s checksumming is used to detect bitrot or other corruptionduring deep-scrub. In addition to being unsafe, using filestore withec overwrites yields low performance compared to bluestore.

Erasure coded pools do not support omap, so to use them with RBD andCephFS you must instruct them to store their data in an ec pool, andtheir metadata in a replicated pool. For RBD, this means using theerasure coded pool as the —data-pool during image creation:

  1. rbd create --size 1G --data-pool ec_pool replicated_pool/image_name

For CephFS, an erasure coded pool can be set as the default data pool duringfile system creation or via file layouts.

Erasure coded pool and cache tiering

Erasure coded pools require more resources than replicated pools andlack some functionalities such as omap. To overcome theselimitations, one can set up a cache tierbefore the erasure coded pool.

For instance, if the pool hot-storage is made of fast storage:

  1. $ ceph osd tier add ecpool hot-storage
  2. $ ceph osd tier cache-mode hot-storage writeback
  3. $ ceph osd tier set-overlay ecpool hot-storage

will place the hot-storage pool as tier of ecpool in writeback_mode so that every write and read to the _ecpool are actually usingthe hot-storage and benefit from its flexibility and speed.

More information can be found in the cache tiering documentation.

Erasure coded pool recovery

If an erasure coded pool loses some shards, it must recover them from the others.This generally involves reading from the remaining shards, reconstructing the data, andwriting it to the new peer.In Octopus, erasure coded pools can recover as long as there are at least K shardsavailable. (With fewer than K shards, you have actually lost data!)

Prior to Octopus, erasure coded pools required at least min_size shards to beavailable, even if min_size is greater than K. (We generally recommend minsizebe _K+2 or more to prevent loss of writes and data.)This conservative decision was made out of an abundance of caution when designing the new poolmode but also meant pools with lost OSDs but no data loss were unable to recover and go activewithout manual intervention to change the min_size.

Glossary

  • chunk
  • when the encoding function is called, it returns chunks of the samesize. Data chunks which can be concatenated to reconstruct the originalobject and coding chunks which can be used to rebuild a lost chunk.

  • K

  • the number of data chunks, i.e. the number of chunks in which theoriginal object is divided. For instance if K = 2 a 10KB objectwill be divided into K objects of 5KB each.

  • M

  • the number of coding chunks, i.e. the number of additional chunks_computed by the encoding functions. If there are 2 coding _chunks,it means 2 OSDs can be out without losing data.

Table of content