Using the pg-upmap

Starting in Luminous v12.2.z there is a new pg-upmap exception tablein the OSDMap that allows the cluster to explicitly map specific PGs tospecific OSDs. This allows the cluster to fine-tune the datadistribution to, in most cases, perfectly distributed PGs across OSDs.

The key caveat to this new mechanism is that it requires that allclients understand the new pg-upmap structure in the OSDMap.

Enabling

To allow use of the feature, you must tell the cluster that it onlyneeds to support luminous (and newer) clients with:

  1. ceph osd set-require-min-compat-client luminous

This command will fail if any pre-luminous clients or daemons areconnected to the monitors. You can see what client versions are inuse with:

  1. ceph features

A word of caution

This is a new feature and not very user friendly. At the time of thiswriting we are working on a new balancer module for ceph-mgr thatwill eventually do all of this automatically.

Until then,

Offline optimization

Upmap entries are updated with an offline optimizer built into osdmaptool.

  • Grab the latest copy of your osdmap:
  1. ceph osd getmap -o om
  • Run the optimizer:
  1. osdmaptool om --upmap out.txt [--upmap-pool <pool>] [--upmap-max <max-count>] [--upmap-deviation <max-deviation>]

It is highly recommended that optimization be done for each poolindividually, or for sets of similarly-utilized pools. You canspecify the —upmap-pool option multiple times. “Similar pools”means pools that are mapped to the same devices and store the samekind of data (e.g., RBD image pools, yes; RGW index pool and RGWdata pool, no).

The max-count value is the maximum number of upmap entries toidentify in the run. The default is 100, but you may want to makethis a smaller number so that the tool completes more quickly (butdoes less work). If it cannot find any additional changes to makeit will stop early (i.e., when the pool distribution is perfect).

The max-deviation value defaults to .01 (i.e., 1%). If an OSDutilization varies from the average by less than this amount itwill be considered perfect.

  • The proposed changes are written to the output file out.txt inthe example above. These are normal ceph CLI commands that can berun to apply the changes to the cluster. This can be done with:
  1. source out.txt

The above steps can be repeated as many times as necessary to achievea perfect distribution of PGs for each set of pools.

You can see some (gory) details about what the tool is doing bypassing —debug-osd 10 to osdmaptool.