Troubleshooting PGs

Placement Groups Never Get Clean

When you create a cluster and your cluster remains in active,active+remapped or active+degraded status and never achieves anactive+clean status, you likely have a problem with your configuration.

You may need to review settings in the Pool, PG and CRUSH Config Referenceand make appropriate adjustments.

As a general rule, you should run your cluster with more than one OSD and apool size greater than 1 object replica.

One Node Cluster

Ceph no longer provides documentation for operating on a single node, becauseyou would never deploy a system designed for distributed computing on a singlenode. Additionally, mounting client kernel modules on a single node containing aCeph daemon may cause a deadlock due to issues with the Linux kernel itself(unless you use VMs for the clients). You can experiment with Ceph in a 1-nodeconfiguration, in spite of the limitations as described herein.

If you are trying to create a cluster on a single node, you must change thedefault of the osd crush chooseleaf type setting from 1 (meaninghost or node) to 0 (meaning osd) in your Ceph configurationfile before you create your monitors and OSDs. This tells Ceph that an OSDcan peer with another OSD on the same host. If you are trying to set up a1-node cluster and osd crush chooseleaf type is greater than 0,Ceph will try to peer the PGs of one OSD with the PGs of another OSD onanother node, chassis, rack, row, or even datacenter depending on the setting.

Tip

DO NOT mount kernel clients directly on the same node as yourCeph Storage Cluster, because kernel conflicts can arise. However, youcan mount kernel clients within virtual machines (VMs) on a single node.

If you are creating OSDs using a single disk, you must create directoriesfor the data manually first. For example:

  1. ceph-deploy osd create --data {disk} {host}

Fewer OSDs than Replicas

If you have brought up two OSDs to an up and in state, but you stilldon’t see active + clean placement groups, you may have anosd pool default size set to greater than 2.

There are a few ways to address this situation. If you want to operate yourcluster in an active + degraded state with two replicas, you can set theosd pool default min size to 2 so that you can write objects inan active + degraded state. You may also set the osd pool default sizesetting to 2 so that you only have two stored replicas (the original andone replica), in which case the cluster should achieve an active + cleanstate.

Note

You can make the changes at runtime. If you make the changes inyour Ceph configuration file, you may need to restart your cluster.

Pool Size = 1

If you have the osd pool default size set to 1, you will only haveone copy of the object. OSDs rely on other OSDs to tell them which objectsthey should have. If a first OSD has a copy of an object and there is nosecond copy, then no second OSD can tell the first OSD that it should havethat copy. For each placement group mapped to the first OSD (seeceph pg dump), you can force the first OSD to notice the placement groupsit needs by running:

  1. ceph osd force-create-pg <pgid>

CRUSH Map Errors

Another candidate for placement groups remaining unclean involves errorsin your CRUSH map.

Stuck Placement Groups

It is normal for placement groups to enter states like “degraded” or “peering”following a failure. Normally these states indicate the normal progressionthrough the failure recovery process. However, if a placement group stays in oneof these states for a long time this may be an indication of a larger problem.For this reason, the monitor will warn when placement groups get “stuck” in anon-optimal state. Specifically, we check for:

  • inactive - The placement group has not been active for too long(i.e., it hasn’t been able to service read/write requests).

  • unclean - The placement group has not been clean for too long(i.e., it hasn’t been able to completely recover from a previous failure).

  • stale - The placement group status has not been updated by a ceph-osd,indicating that all nodes storing this placement group may be down.

You can explicitly list stuck placement groups with one of:

  1. ceph pg dump_stuck stale
  2. ceph pg dump_stuck inactive
  3. ceph pg dump_stuck unclean

For stuck stale placement groups, it is normally a matter of getting theright ceph-osd daemons running again. For stuck inactive placementgroups, it is usually a peering problem (see Placement Group Down - Peering Failure). Forstuck unclean placement groups, there is usually something preventingrecovery from completing, like unfound objects (seeUnfound Objects);

Placement Group Down - Peering Failure

In certain cases, the ceph-osd Peering process can run intoproblems, preventing a PG from becoming active and usable. Forexample, ceph health might report:

  1. ceph health detail
  2. HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
  3. ...
  4. pg 0.5 is down+peering
  5. pg 1.4 is down+peering
  6. ...
  7. osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651

We can query the cluster to determine exactly why the PG is marked down with:

  1. ceph pg 0.5 query
  1. { "state": "down+peering",
  2. ...
  3. "recovery_state": [
  4. { "name": "Started\/Primary\/Peering\/GetInfo",
  5. "enter_time": "2012-03-06 14:40:16.169679",
  6. "requested_info_from": []},
  7. { "name": "Started\/Primary\/Peering",
  8. "enter_time": "2012-03-06 14:40:16.169659",
  9. "probing_osds": [
  10. 0,
  11. 1],
  12. "blocked": "peering is blocked due to down osds",
  13. "down_osds_we_would_probe": [
  14. 1],
  15. "peering_blocked_by": [
  16. { "osd": 1,
  17. "current_lost_at": 0,
  18. "comment": "starting or marking this osd lost may let us proceed"}]},
  19. { "name": "Started",
  20. "enter_time": "2012-03-06 14:40:16.169513"}
  21. ]
  22. }

The recovery_state section tells us that peering is blocked due todown ceph-osd daemons, specifically osd.1. In this case, we can start that ceph-osdand things will recover.

Alternatively, if there is a catastrophic failure of osd.1 (e.g., diskfailure), we can tell the cluster that it is lost and to cope asbest it can.

Important

This is dangerous in that the cluster cannotguarantee that the other copies of the data are consistentand up to date.

To instruct Ceph to continue anyway:

  1. ceph osd lost 1

Recovery will proceed.

Unfound Objects

Under certain combinations of failures Ceph may complain aboutunfound objects:

  1. ceph health detail
  2. HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
  3. pg 2.4 is active+degraded, 78 unfound

This means that the storage cluster knows that some objects (or newercopies of existing objects) exist, but it hasn’t found copies of them.One example of how this might come about for a PG whose data is on ceph-osds1 and 2:

  • 1 goes down

  • 2 handles some writes, alone

  • 1 comes up

  • 1 and 2 repeer, and the objects missing on 1 are queued for recovery.

  • Before the new objects are copied, 2 goes down.

Now 1 knows that these object exist, but there is no live ceph-osd whohas a copy. In this case, IO to those objects will block, and thecluster will hope that the failed node comes back soon; this isassumed to be preferable to returning an IO error to the user.

First, you can identify which objects are unfound with:

  1. ceph pg 2.4 list_unfound [starting offset, in json]
  1. { "offset": { "oid": "",
  2. "key": "",
  3. "snapid": 0,
  4. "hash": 0,
  5. "max": 0},
  6. "num_missing": 0,
  7. "num_unfound": 0,
  8. "objects": [
  9. { "oid": "object 1",
  10. "key": "",
  11. "hash": 0,
  12. "max": 0 },
  13. ...
  14. ],
  15. "more": 0}

If there are too many objects to list in a single result, the morefield will be true and you can query for more. (Eventually thecommand line tool will hide this from you, but not yet.)

Second, you can identify which OSDs have been probed or might containdata:

  1. ceph pg 2.4 query
  1. "recovery_state": [
  2. { "name": "Started\/Primary\/Active",
  3. "enter_time": "2012-03-06 15:15:46.713212",
  4. "might_have_unfound": [
  5. { "osd": 1,
  6. "status": "osd is down"}]},

In this case, for example, the cluster knows that osd.1 might havedata, but it is down. The full range of possible states include:

  • already probed

  • querying

  • OSD is down

  • not queried (yet)

Sometimes it simply takes some time for the cluster to query possiblelocations.

It is possible that there are other locations where the object canexist that are not listed. For example, if a ceph-osd is stopped andtaken out of the cluster, the cluster fully recovers, and due to somefuture set of failures ends up with an unfound object, it won’tconsider the long-departed ceph-osd as a potential location toconsider. (This scenario, however, is unlikely.)

If all possible locations have been queried and objects are stilllost, you may have to give up on the lost objects. This, again, ispossible given unusual combinations of failures that allow the clusterto learn about writes that were performed before the writes themselvesare recovered. To mark the “unfound” objects as “lost”:

  1. ceph pg 2.5 mark_unfound_lost revert|delete

This the final argument specifies how the cluster should deal withlost objects.

The “delete” option will forget about them entirely.

The “revert” option (not available for erasure coded pools) willeither roll back to a previous version of the object or (if it was anew object) forget about it entirely. Use this with caution, as itmay confuse applications that expected the object to exist.

Homeless Placement Groups

It is possible for all OSDs that had copies of a given placement groups to fail.If that’s the case, that subset of the object store is unavailable, and themonitor will receive no status updates for those placement groups. To detectthis situation, the monitor marks any placement group whose primary OSD hasfailed as stale. For example:

  1. ceph health
  2. HEALTH_WARN 24 pgs stale; 3/300 in osds are down

You can identify which placement groups are stale, and what the last OSDs tostore them were, with:

  1. ceph health detail
  2. HEALTH_WARN 24 pgs stale; 3/300 in osds are down
  3. ...
  4. pg 2.5 is stuck stale+active+remapped, last acting [2,0]
  5. ...
  6. osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
  7. osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
  8. osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861

If we want to get placement group 2.5 back online, for example, this tells us thatit was last managed by osd.0 and osd.2. Restarting those ceph-osddaemons will allow the cluster to recover that placement group (and, presumably,many others).

Only a Few OSDs Receive Data

If you have many nodes in your cluster and only a few of them receive data,check the number of placement groups in your pool. Since placement groups getmapped to OSDs, a small number of placement groups will not distribute acrossyour cluster. Try creating a pool with a placement group count that is amultiple of the number of OSDs. See Placement Groups for details. The defaultplacement group count for pools is not useful, but you can change it here.

Can’t Write Data

If your cluster is up, but some OSDs are down and you cannot write data,check to ensure that you have the minimum number of OSDs running for theplacement group. If you don’t have the minimum number of OSDs running,Ceph will not allow you to write data because there is no guaranteethat Ceph can replicate your data. See osd pool default min sizein the Pool, PG and CRUSH Config Reference for details.

PGs Inconsistent

If you receive an active + clean + inconsistent state, this may happendue to an error during scrubbing. As always, we can identify the inconsistentplacement group(s) with:

  1. $ ceph health detail
  2. HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
  3. pg 0.6 is active+clean+inconsistent, acting [0,1,2]
  4. 2 scrub errors

Or if you prefer inspecting the output in a programmatic way:

  1. $ rados list-inconsistent-pg rbd
  2. ["0.6"]

There is only one consistent state, but in the worst case, we could havedifferent inconsistencies in multiple perspectives found in more than oneobjects. If an object named foo in PG 0.6 is truncated, we will have:

  1. $ rados list-inconsistent-obj 0.6 --format=json-pretty
  1. {
  2. "epoch": 14,
  3. "inconsistents": [
  4. {
  5. "object": {
  6. "name": "foo",
  7. "nspace": "",
  8. "locator": "",
  9. "snap": "head",
  10. "version": 1
  11. },
  12. "errors": [
  13. "data_digest_mismatch",
  14. "size_mismatch"
  15. ],
  16. "union_shard_errors": [
  17. "data_digest_mismatch_info",
  18. "size_mismatch_info"
  19. ],
  20. "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
  21. "shards": [
  22. {
  23. "osd": 0,
  24. "errors": [],
  25. "size": 968,
  26. "omap_digest": "0xffffffff",
  27. "data_digest": "0xe978e67f"
  28. },
  29. {
  30. "osd": 1,
  31. "errors": [],
  32. "size": 968,
  33. "omap_digest": "0xffffffff",
  34. "data_digest": "0xe978e67f"
  35. },
  36. {
  37. "osd": 2,
  38. "errors": [
  39. "data_digest_mismatch_info",
  40. "size_mismatch_info"
  41. ],
  42. "size": 0,
  43. "omap_digest": "0xffffffff",
  44. "data_digest": "0xffffffff"
  45. }
  46. ]
  47. }
  48. ]
  49. }

In this case, we can learn from the output:

  • The only inconsistent object is named foo, and it is its head that hasinconsistencies.

  • The inconsistencies fall into two categories:

    • errors: these errors indicate inconsistencies between shards without adetermination of which shard(s) are bad. Check for the errors in theshards array, if available, to pinpoint the problem.

      • data_digest_mismatch: the digest of the replica read from OSD.2 isdifferent from the ones of OSD.0 and OSD.1

      • size_mismatch: the size of the replica read from OSD.2 is 0, whilethe size reported by OSD.0 and OSD.1 is 968.

    • union_shard_errors: the union of all shard specific errors inshards array. The errors are set for the given shard that has theproblem. They include errors like read_error. The errors ending inoi indicate a comparison with selected_object_info. Look at theshards array to determine which shard has which error(s).

      • data_digest_mismatch_info: the digest stored in the object-info is not0xffffffff, which is calculated from the shard read from OSD.2

      • size_mismatch_info: the size stored in the object-info is differentfrom the one read from OSD.2. The latter is 0.

You can repair the inconsistent placement group by executing:

  1. ceph pg repair {placement-group-ID}

Which overwrites the bad copies with the authoritative ones. In most cases,Ceph is able to choose authoritative copies from all available replicas usingsome predefined criteria. But this does not always work. For example, the storeddata digest could be missing, and the calculated digest will be ignored whenchoosing the authoritative copies. So, please use the above command with caution.

If read_error is listed in the errors attribute of a shard, theinconsistency is likely due to disk errors. You might want to check your diskused by that OSD.

If you receive active + clean + inconsistent states periodically due toclock skew, you may consider configuring your NTP daemons on yourmonitor hosts to act as peers. See The Network Time Protocol and CephClock Settings for additional details.

Erasure Coded PGs are not active+clean

When CRUSH fails to find enough OSDs to map to a PG, it will show as a2147483647 which is ITEM_NONE or no OSD found. For instance:

  1. [2,1,6,0,5,8,2147483647,7,4]

Not enough OSDs

If the Ceph cluster only has 8 OSDs and the erasure coded pool needs9, that is what it will show. You can either create another erasurecoded pool that requires less OSDs:

  1. ceph osd erasure-code-profile set myprofile k=5 m=3
  2. ceph osd pool create erasurepool erasure myprofile

or add a new OSDs and the PG will automatically use them.

CRUSH constraints cannot be satisfied

If the cluster has enough OSDs, it is possible that the CRUSH ruleimposes constraints that cannot be satisfied. If there are 10 OSDs ontwo hosts and the CRUSH rule requires that no two OSDs from thesame host are used in the same PG, the mapping may fail because onlytwo OSDs will be found. You can check the constraint by displaying (“dumping”)the rule:

  1. $ ceph osd crush rule ls
  2. [
  3. "replicated_rule",
  4. "erasurepool"]
  5. $ ceph osd crush rule dump erasurepool
  6. { "rule_id": 1,
  7. "rule_name": "erasurepool",
  8. "ruleset": 1,
  9. "type": 3,
  10. "min_size": 3,
  11. "max_size": 20,
  12. "steps": [
  13. { "op": "take",
  14. "item": -1,
  15. "item_name": "default"},
  16. { "op": "chooseleaf_indep",
  17. "num": 0,
  18. "type": "host"},
  19. { "op": "emit"}]}

You can resolve the problem by creating a new pool in which PGs are allowedto have OSDs residing on the same host with:

  1. ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
  2. ceph osd pool create erasurepool erasure myprofile

CRUSH gives up too soon

If the Ceph cluster has just enough OSDs to map the PG (for instance acluster with a total of 9 OSDs and an erasure coded pool that requires9 OSDs per PG), it is possible that CRUSH gives up before finding amapping. It can be resolved by:

  • lowering the erasure coded pool requirements to use less OSDs per PG(that requires the creation of another pool as erasure code profilescannot be dynamically modified).

  • adding more OSDs to the cluster (that does not require the erasurecoded pool to be modified, it will become clean automatically)

  • use a handmade CRUSH rule that tries more times to find a goodmapping. This can be done by setting set_choose_tries to a valuegreater than the default.

You should first verify the problem with crushtool afterextracting the crushmap from the cluster so your experiments do notmodify the Ceph cluster and only work on a local files:

  1. $ ceph osd crush rule dump erasurepool
  2. { "rule_name": "erasurepool",
  3. "ruleset": 1,
  4. "type": 3,
  5. "min_size": 3,
  6. "max_size": 20,
  7. "steps": [
  8. { "op": "take",
  9. "item": -1,
  10. "item_name": "default"},
  11. { "op": "chooseleaf_indep",
  12. "num": 0,
  13. "type": "host"},
  14. { "op": "emit"}]}
  15. $ ceph osd getcrushmap > crush.map
  16. got crush map from osdmap epoch 13
  17. $ crushtool -i crush.map --test --show-bad-mappings \
  18. --rule 1 \
  19. --num-rep 9 \
  20. --min-x 1 --max-x $((1024 * 1024))
  21. bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
  22. bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
  23. bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]

Where —num-rep is the number of OSDs the erasure code CRUSHrule needs, —rule is the value of the ruleset fielddisplayed by ceph osd crush rule dump. The test will try mappingone million values (i.e. the range defined by [—min-x,—max-x])and must display at least one bad mapping. If it outputs nothing itmeans all mappings are successful and you can stop right there: theproblem is elsewhere.

The CRUSH rule can be edited by decompiling the crush map:

  1. $ crushtool --decompile crush.map > crush.txt

and adding the following line to the rule:

  1. step set_choose_tries 100

The relevant part of of the crush.txt file should look somethinglike:

  1. rule erasurepool {
  2. ruleset 1
  3. type erasure
  4. min_size 3
  5. max_size 20
  6. step set_chooseleaf_tries 5
  7. step set_choose_tries 100
  8. step take default
  9. step chooseleaf indep 0 type host
  10. step emit
  11. }

It can then be compiled and tested again:

  1. $ crushtool --compile crush.txt -o better-crush.map

When all mappings succeed, an histogram of the number of tries thatwere necessary to find all of them can be displayed with the—show-choose-tries option of crushtool:

  1. $ crushtool -i better-crush.map --test --show-bad-mappings \
  2. --show-choose-tries \
  3. --rule 1 \
  4. --num-rep 9 \
  5. --min-x 1 --max-x $((1024 * 1024))
  6. ...
  7. 11: 42
  8. 12: 44
  9. 13: 54
  10. 14: 45
  11. 15: 35
  12. 16: 34
  13. 17: 30
  14. 18: 25
  15. 19: 19
  16. 20: 22
  17. 21: 20
  18. 22: 17
  19. 23: 13
  20. 24: 16
  21. 25: 13
  22. 26: 11
  23. 27: 11
  24. 28: 13
  25. 29: 11
  26. 30: 10
  27. 31: 6
  28. 32: 5
  29. 33: 10
  30. 34: 3
  31. 35: 7
  32. 36: 5
  33. 37: 2
  34. 38: 5
  35. 39: 5
  36. 40: 2
  37. 41: 5
  38. 42: 4
  39. 43: 1
  40. 44: 2
  41. 45: 2
  42. 46: 3
  43. 47: 1
  44. 48: 0
  45. ...
  46. 102: 0
  47. 103: 1
  48. 104: 0
  49. ...

It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of set_choose_tries that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped).