BlueStore Migration

Each OSD can run either BlueStore or FileStore, and a single Cephcluster can contain a mix of both. Users who have previously deployedFileStore are likely to want to transition to BlueStore in order totake advantage of the improved performance and robustness. There areseveral strategies for making such a transition.

An individual OSD cannot be converted in place in isolation, however:BlueStore and FileStore are simply too different for that to bepractical. “Conversion” will rely either on the cluster’s normalreplication and healing support or tools and strategies that copy OSDcontent from an old (FileStore) device to a new (BlueStore) one.

Deploy new OSDs with BlueStore

Any new OSDs (e.g., when the cluster is expanded) can be deployedusing BlueStore. This is the default behavior so no specific changeis needed.

Similarly, any OSDs that are reprovisioned after replacing a failed drivecan use BlueStore.

Convert existing OSDs

Mark out and replace

The simplest approach is to mark out each device in turn, wait for thedata to replicate across the cluster, reprovision the OSD, and markit back in again. It is simple and easy to automate. However, it requiresmore data migration than should be necessary, so it is not optimal.

  • Identify a FileStore OSD to replace:
  1. ID=<osd-id-number>
  2. DEVICE=<disk-device>

You can tell whether a given OSD is FileStore or BlueStore with:

  1. ceph osd metadata $ID | grep osd_objectstore

You can get a current count of filestore vs bluestore with:

  1. ceph osd count-metadata osd_objectstore
  • Mark the filestore OSD out:
  1. ceph osd out $ID
  • Wait for the data to migrate off the OSD in question:
  1. while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done
  • Stop the OSD:
  1. systemctl kill ceph-osd@$ID
  • Make note of which device this OSD is using:
  1. mount | grep /var/lib/ceph/osd/ceph-$ID
  • Unmount the OSD:
  1. umount /var/lib/ceph/osd/ceph-$ID
  • Destroy the OSD data. Be EXTREMELY CAREFUL as this will destroythe contents of the device; be certain the data on the device isnot needed (i.e., that the cluster is healthy) before proceeding.
  1. ceph-volume lvm zap $DEVICE
  • Tell the cluster the OSD has been destroyed (and a new OSD can bereprovisioned with the same ID):
  1. ceph osd destroy $ID --yes-i-really-mean-it
  • Reprovision a BlueStore OSD in its place with the same OSD ID.This requires you do identify which device to wipe based on what you sawmounted above. BE CAREFUL!
  1. ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID
  • Repeat.

You can allow the refilling of the replacement OSD to happenconcurrently with the draining of the next OSD, or follow the sameprocedure for multiple OSDs in parallel, as long as you ensure thecluster is fully clean (all data has all replicas) before destroyingany OSDs. Failure to do so will reduce the redundancy of your dataand increase the risk of (or potentially even cause) data loss.

Advantages:

  • Simple.

  • Can be done on a device-by-device basis.

  • No spare devices or hosts are required.

Disadvantages:

  • Data is copied over the network twice: once to some other OSD in thecluster (to maintain the desired number of replicas), and then againback to the reprovisioned BlueStore OSD.

Whole host replacement

If you have a spare host in the cluster, or have sufficient free spaceto evacuate an entire host in order to use it as a spare, then theconversion can be done on a host-by-host basis with each stored copy ofthe data migrating only once.

First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn’t yet part of the cluster, or by offloading data from an existing host that in the cluster.

Use a new, empty host

Ideally the host should have roughly thesame capacity as other hosts you will be converting (although itdoesn’t strictly matter).

  1. NEWHOST=<empty-host-name>

Add the host to the CRUSH hierarchy, but do not attach it to the root:

  1. ceph osd crush add-bucket $NEWHOST host

Make sure the ceph packages are installed.

Use an existing host

If you would like to use an existing hostthat is already part of the cluster, and there is sufficient freespace on that host so that all of its data can be migrated off,then you can instead do:

  1. OLDHOST=<existing-cluster-host-to-offload>
  2. ceph osd crush unlink $OLDHOST default

where “default” is the immediate ancestor in the CRUSH map. (Forsmaller clusters with unmodified configurations this will normallybe “default”, but it might also be a rack name.) You should nowsee the host at the top of the OSD tree output with no parent:

  1. $ bin/ceph osd tree
  2. ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
  3. -5 0 host oldhost
  4. 10 ssd 1.00000 osd.10 up 1.00000 1.00000
  5. 11 ssd 1.00000 osd.11 up 1.00000 1.00000
  6. 12 ssd 1.00000 osd.12 up 1.00000 1.00000
  7. -1 3.00000 root default
  8. -2 3.00000 host foo
  9. 0 ssd 1.00000 osd.0 up 1.00000 1.00000
  10. 1 ssd 1.00000 osd.1 up 1.00000 1.00000
  11. 2 ssd 1.00000 osd.2 up 1.00000 1.00000
  12. ...

If everything looks good, jump directly to the “Wait for datamigration to complete” step below and proceed from there to clean upthe old OSDs.

Migration process

If you’re using a new host, start at step #1. For an existing host,jump to step #5 below.

  • Provision new BlueStore OSDs for all devices:
  1. ceph-volume lvm create --bluestore --data /dev/$DEVICE
  • Verify OSDs join the cluster with:
  1. ceph osd tree

You should see the new host $NEWHOST with all of the OSDs beneathit, but the host should not be nested beneath any other node inhierarchy (like root default). For example, if newhost isthe empty host, you might see something like:

  1. $ bin/ceph osd tree
  2. ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
  3. -5 0 host newhost
  4. 10 ssd 1.00000 osd.10 up 1.00000 1.00000
  5. 11 ssd 1.00000 osd.11 up 1.00000 1.00000
  6. 12 ssd 1.00000 osd.12 up 1.00000 1.00000
  7. -1 3.00000 root default
  8. -2 3.00000 host oldhost1
  9. 0 ssd 1.00000 osd.0 up 1.00000 1.00000
  10. 1 ssd 1.00000 osd.1 up 1.00000 1.00000
  11. 2 ssd 1.00000 osd.2 up 1.00000 1.00000
  12. ...
  • Identify the first target host to convert
  1. OLDHOST=<existing-cluster-host-to-convert>
  • Swap the new host into the old host’s position in the cluster:
  1. ceph osd crush swap-bucket $NEWHOST $OLDHOST

At this point all data on $OLDHOST will start migrating to OSDson $NEWHOST. If there is a difference in the total capacity ofthe old and new hosts you may also see some data migrate to or fromother nodes in the cluster, but as long as the hosts are similarlysized this will be a relatively small amount of data.

  • Wait for data migration to complete:
  1. while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done
  • Stop all old OSDs on the now-empty $OLDHOST:
  1. ssh $OLDHOST
  2. systemctl kill ceph-osd.target
  3. umount /var/lib/ceph/osd/ceph-*
  • Destroy and purge the old OSDs:
  1. for osd in `ceph osd ls-tree $OLDHOST`; do
  2. ceph osd purge $osd --yes-i-really-mean-it
  3. done
  • Wipe the old OSD devices. This requires you do identify whichdevices are to be wiped manually (BE CAREFUL!). For each device,:
  1. ceph-volume lvm zap $DEVICE
  • Use the now-empty host as the new host, and repeat:
  1. NEWHOST=$OLDHOST

Advantages:

  • Data is copied over the network only once.

  • Converts an entire host’s OSDs at once.

  • Can parallelize to converting multiple hosts at a time.

  • No spare devices are required on each host.

Disadvantages:

  • A spare host is required.

  • An entire host’s worth of OSDs will be migrating data at a time. Thisis like likely to impact overall cluster performance.

  • All migrated data still makes one full hop over the network.

Per-OSD device copy

A single logical OSD can be converted by using the copy functionof ceph-objectstore-tool. This requires that the host have a freedevice (or devices) to provision a new, empty BlueStore OSD. Forexample, if each host in your cluster has 12 OSDs, then you’d need a13th available device so that each OSD can be converted in turn before theold device is reclaimed to convert the next OSD.

Caveats:

  • This strategy requires that a blank BlueStore OSD be preparedwithout allocating a new OSD ID, something that the ceph-volumetool doesn’t support. More importantly, the setup of dmcrypt isclosely tied to the OSD identity, which means that this approachdoes not work with encrypted OSDs.

  • The device must be manually partitioned.

  • Tooling not implemented!

  • Not documented!

Advantages:

  • Little or no data migrates over the network during the conversion.

Disadvantages:

  • Tooling not fully implemented.

  • Process not documented.

  • Each host must have a spare or empty device.

  • The OSD is offline during the conversion, which means new writes willbe written to only a subset of the OSDs. This increases the risk of dataloss due to a subsequent failure. (However, if there is a failure beforeconversion is complete, the original FileStore OSD can be started to provideaccess to its original data.)