PG (Placement Group) notes

Miscellaneous copy-pastes from emails, when this gets cleaned up itshould move out of /dev.

Overview

PG = “placement group”. When placing data in the cluster, objects aremapped into PGs, and those PGs are mapped onto OSDs. We use theindirection so that we can group objects, which reduces the amount ofper-object metadata we need to keep track of and processes we need torun (it would be prohibitively expensive to track eg the placementhistory on a per-object basis). Increasing the number of PGs canreduce the variance in per-OSD load across your cluster, but each PGrequires a bit more CPU and memory on the OSDs that are storing it. Wetry and ballpark it at 100 PGs/OSD, although it can vary widelywithout ill effects depending on your cluster. You hit a bug in how wecalculate the initial PG number from a cluster description.

There are a couple of different categories of PGs; the 6 that exist(in the original emailer’s ceph -s output) are “local” PGs whichare tied to a specific OSD. However, those aren’t actually used in astandard Ceph configuration.

Mapping algorithm (simplified)

> How does the Object->PG mapping look like, do you map more than one object on

> one PG, or do you sometimes map an object to more than one PG? How about the

> mapping of PGs to OSDs, does one PG belong to exactly one OSD?

>

> Does one PG represent a fixed amount of storage space?

Many objects map to one PG.

Each object maps to exactly one PG.

One PG maps to a single list of OSDs, where the first one in the listis the primary and the rest are replicas.

Many PGs can map to one OSD.

A PG represents nothing but a grouping of objects; you configure thenumber of PGs you want, number of OSDs * 100 is a good starting point, and all of your stored objects are pseudo-randomly evenly distributedto the PGs. So a PG explicitly does NOT represent a fixed amount ofstorage; it represents 1/pg_num’th of the storage you happen to haveon your OSDs.

Ignoring the finer points of CRUSH and custom placement, it goessomething like this in pseudocode:

  1. locator = object_name
  2. obj_hash = hash(locator)
  3. pg = obj_hash % num_pg
  4. OSDs_for_pg = crush(pg) # returns a list of OSDs
  5. primary = osds_for_pg[0]
  6. replicas = osds_for_pg[1:]

If you want to understand the crush() part in the above, imagine aperfectly spherical datacenter in a vacuum ;) that is, if all OSDshave weight 1.0, and there is no topology to the data center (all OSDsare on the top level), and you use defaults, etc, it simplifies toconsistent hashing; you can think of it as:

  1. def crush(pg):
  2. all_osds = ['osd.0', 'osd.1', 'osd.2', ...]
  3. result = []
  4. # size is the number of copies; primary+replicas
  5. while len(result) < size:
  6. r = hash(pg)
  7. chosen = all_osds[ r % len(all_osds) ]
  8. if chosen in result:
  9. # OSD can be picked only once
  10. continue
  11. result.append(chosen)
  12. return result

User-visible PG States

Todo

diagram of states and how they can overlap

  • creating
  • the PG is still being created

  • active

  • requests to the PG will be processed

  • clean

  • all objects in the PG are replicated the correct number of times

  • down

  • a replica with necessary data is down, so the pg is offline

  • recovery_unfound

  • recovery could not finish because object(s) are unfound.

  • backfill_unfound

  • backfill could not finish because object(s) are unfound.

  • premerge

  • the PG is in a quiesced-IO state due to an impending PG merge. Thathappens when pg_num_pending < pg_num, and applies to the PGs withpg_num_pending <= ps < pg_num as well as the corresponding peer PGthat it is merging with.

  • scrubbing

  • the PG is being checked for inconsistencies

  • degraded

  • some objects in the PG are not replicated enough times yet

  • inconsistent

  • replicas of the PG are not consistent (e.g. objects arethe wrong size, objects are missing from one replica after recoveryfinished, etc.)

  • peering

  • the PG is undergoing the Peering process

  • repair

  • the PG is being checked and any inconsistencies found will be repaired (if possible)

  • recovering

  • objects are being migrated/synchronized with replicas

  • backfill_wait

  • the PG is waiting in line to start backfill

  • incomplete

  • a pg is missing a necessary period of history from itslog. If you see this state, report a bug, and try to start anyfailed OSDs that may contain the needed information.

  • stale

  • the PG is in an unknown state - the monitors have not receivedan update for it since the PG mapping changed.

  • remapped

  • the PG is temporarily mapped to a different set of OSDs from whatCRUSH specified

  • deep

  • In conjunction with scrubbing the scrub is a deep scrub

  • backfilling

  • a special case of recovery, in which the entire contents ofthe PG are scanned and synchronized, instead of inferring whatneeds to be transferred from the PG logs of recent operations

  • backfill_toofull

  • backfill reservation rejected, OSD too full

  • recovery_wait

  • the PG is waiting for the local/remote recovery reservations

  • undersized

  • the PG can’t select enough OSDs given its size

  • activating

  • the PG is peered but not yet active

  • peered

  • the PG peered but can’t go active

  • snaptrim

  • the PG is trimming snaps

  • snaptrim_wait

  • the PG is queued to trim snaps

  • recovery_toofull

  • recovery reservation rejected, OSD too full

  • snaptrim_error

  • the PG could not complete snap trimming due to errors

  • forced_recovery

  • the PG has been marked for highest priority recovery

  • forced_backfill

  • the PG has been marked for highest priority backfill

  • failed_repair

  • an attempt to repair the PG has failed. Manual intervention is required.

OMAP STATISTICS

Omap statistics are gathered during deep scrub and displayed in the output ofthe following commands:

  1. ceph pg dump
  2. ceph pg dump all
  3. ceph pg dump summary
  4. ceph pg dump pgs
  5. ceph pg dump pools
  6. ceph pg ls

As these statistics are not updated continuously they may be quite inaccurate inan environment where deep scrubs are run infrequently and/or there is a lot ofomap activity. As such they should not be relied on for exact accuracy butrather used as a guide. Running a deep scrub and checking these statisticsimmediately afterwards should give a good indication of current omap usage.