Troubleshooting Monitors

Troubleshooting Monitors

When a cluster encounters monitor-related troubles there’s a tendency topanic, and some times with good reason. You should keep in mind that losinga monitor, or a bunch of them, don’t necessarily mean that your cluster isdown, as long as a majority is up, running and with a formed quorum.Regardless of how bad the situation is, the first thing you should do is tocalm down, take a breath and try answering our initial troubleshooting script.

Initial Troubleshooting

Are the monitors running?

First of all, we need to make sure the monitors are running. You would beamazed by how often people forget to run the monitors, or restart them afteran upgrade. There’s no shame in that, but let’s try not losing a couple ofhours chasing an issue that is not there.

Are you able to connect to the monitor’s servers?

Doesn’t happen often, but sometimes people do have iptables rules thatblock accesses to monitor servers or monitor ports. Usually leftovers frommonitor stress-testing that were forgotten at some point. Try ssh’ing intothe server and, if that succeeds, try connecting to the monitor’s portusing you tool of choice (telnet, nc,…).

Does ceph -s run and obtain a reply from the cluster?

If the answer is yes then your cluster is up and running. One thing youcan take for granted is that the monitors will only answer to a statusrequest if there is a formed quorum.
If ceph -s blocked however, without obtaining a reply from the clusteror showing a lot of fault messages, then it is likely that your monitorsare either down completely or just a portion is up – a portion that is notenough to form a quorum (keep in mind that a quorum if formed by a majorityof monitors).

What if ceph -s doesn’t finish?

If you haven’t gone through all the steps so far, please go back and do.
For those running on Emperor 0.72-rc1 and forward, you will be able tocontact each monitor individually asking them for their status, regardlessof a quorum being formed. This can be achieved using ceph ping mon.ID,ID being the monitor’s identifier. You should perform this for each monitorin the cluster. In section Understanding mon_status we will explain howto interpret the output of this command.
For the rest of you who don’t tread on the bleeding edge, you will need tossh into the server and use the monitor’s admin socket. Please jump toUsing the monitor’s admin socket.

For other specific issues, keep on reading.

Using the monitor’s admin socket

The admin socket allows you to interact with a given daemon directly using aUnix socket file. This file can be found in your monitor’s run directory.By default, the admin socket will be kept in /var/run/ceph/ceph-mon.ID.asokbut this can vary if you defined it otherwise. If you don’t find it there,please check your ceph.conf for an alternative path or run:

ceph-conf --name mon.ID --show-config-value admin_socket

Please bear in mind that the admin socket will only be available while themonitor is running. When the monitor is properly shutdown, the admin socketwill be removed. If however the monitor is not running and the admin socketstill persists, it is likely that the monitor was improperly shutdown.Regardless, if the monitor is not running, you will not be able to use theadmin socket, with ceph likely returning Error 111: Connection Refused.

Accessing the admin socket is as simple as telling the ceph tool to usethe asok file. In pre-Dumpling Ceph, this can be achieved by:

ceph --admin-daemon /var/run/ceph/ceph-mon.<id>.asok <command>

while in Dumpling and beyond you can use the alternate (and recommended)format:

ceph daemon mon.<id> <command>

Using help as the command to the ceph tool will show you thesupported commands available through the admin socket. Please take a lookat config get, config show, mon_status and quorum_status,as those can be enlightening when troubleshooting a monitor.

Understanding mon_status

mon_status can be obtained through the ceph tool when you havea formed quorum, or via the admin socket if you don’t. This command willoutput a multitude of information about the monitor, including the sameoutput you would get with quorum_status.

Take the following example of mon_status:

{ "name": "c",
  "rank": 2,
  "state": "peon",
  "election_epoch": 38,
  "quorum": [
        1,
        2],
  "outside_quorum": [],
  "extra_probe_peers": [],
  "sync_provider": [],
  "monmap": { "epoch": 3,
      "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8",
      "modified": "2013-10-30 04:12:01.945629",
      "created": "2013-10-29 14:14:41.914786",
      "mons": [
            { "rank": 0,
              "name": "a",
              "addr": "127.0.0.1:6789\/0"},
            { "rank": 1,
              "name": "b",
              "addr": "127.0.0.1:6790\/0"},
            { "rank": 2,
              "name": "c",
              "addr": "127.0.0.1:6795\/0"}]}}

A couple of things are obvious: we have three monitors in the monmap (a, b_and _c), the quorum is formed by only two monitors, and c is in the quorumas a peon.

Which monitor is out of the quorum?

The answer would be a.

Why?

Take a look at the quorum set. We have two monitors in this set: 1and 2. These are not monitor names. These are monitor ranks, as establishedin the current monmap. We are missing the monitor with rank 0, and accordingto the monmap that would be mon.a.

By the way, how are ranks established?

Ranks are (re)calculated whenever you add or remove monitors and follow asimple rule: the greater the IP:PORT combination, the lower therank is. In this case, considering that 127.0.0.1:6789 is lower than allthe remaining IP:PORT combinations, mon.a has rank 0.

Most Common Monitor Issues

Have Quorum but at least one Monitor is down

When this happens, depending on the version of Ceph you are running,you should be seeing something similar to:

$ ceph health detail
[snip]
mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)

How to troubleshoot this?

First, make sure mon.a is running.
Second, make sure you are able to connect to mon.a’s server from theother monitors’ servers. Check the ports as well. Check iptables onall your monitor nodes and make sure you are not dropping/rejectingconnections.
If this initial troubleshooting doesn’t solve your problems, then it’stime to go deeper.
First, check the problematic monitor’s mon_status via the adminsocket as explained in Using the monitor’s admin socket andUnderstanding mon_status.
Considering the monitor is out of the quorum, its state should be one ofprobing, electing or synchronizing. If it happens to be eitherleader or peon, then the monitor believes to be in quorum, whilethe remaining cluster is sure it is not; or maybe it got into the quorumwhile we were troubleshooting the monitor, so check you ceph -s againjust to make sure. Proceed if the monitor is not yet in the quorum.

What if the state is probing?

This means the monitor is still looking for the other monitors. Every timeyou start a monitor, the monitor will stay in this state for some timewhile trying to find the rest of the monitors specified in the monmap.The time a monitor will spend in this state can vary. For instance, when ona single-monitor cluster, the monitor will pass through the probing statealmost instantaneously, since there are no other monitors around. On amulti-monitor cluster, the monitors will stay in this state until theyfind enough monitors to form a quorum – this means that if you have 2 outof 3 monitors down, the one remaining monitor will stay in this stateindefinitely until you bring one of the other monitors up.
If you have a quorum, however, the monitor should be able to find theremaining monitors pretty fast, as long as they can be reached. If yourmonitor is stuck probing and you have gone through with all the communicationtroubleshooting, then there is a fair chance that the monitor is tryingto reach the other monitors on a wrong address. mon_status outputs themonmap known to the monitor: check if the other monitor’s locationsmatch reality. If they don’t, jump toRecovering a Monitor’s Broken monmap; if they do, then it may be relatedto severe clock skews amongst the monitor nodes and you should refer toClock Skews first, but if that doesn’t solve your problem then it isthe time to prepare some logs and reach out to the community (please referto Preparing your logs on how to best prepare your logs).

What if state is electing?

This means the monitor is in the middle of an election. These should befast to complete, but at times the monitors can get stuck electing. Thisis usually a sign of a clock skew among the monitor nodes; jump toClock Skews for more infos on that. If all your clocks are properlysynchronized, it is best if you prepare some logs and reach out to thecommunity. This is not a state that is likely to persist and aside from(really) old bugs there is not an obvious reason besides clock skews onwhy this would happen.

What if state is synchronizing?

This means the monitor is synchronizing with the rest of the cluster inorder to join the quorum. The synchronization process is as faster assmaller your monitor store is, so if you have a big store it maytake a while. Don’t worry, it should be finished soon enough.
However, if you notice that the monitor jumps from synchronizing toelecting and then back to synchronizing, then you do have aproblem: the cluster state is advancing (i.e., generating new maps) waytoo fast for the synchronization process to keep up. This used to be athing in early Cuttlefish, but since then the synchronization process wasquite refactored and enhanced to avoid just this sort of behavior. If thishappens in later versions let us know. And bring some logs(see Preparing your logs).

What if state is leader or peon?

This should not happen. There is a chance this might happen however, andit has a lot to do with clock skews – see Clock Skews. If you are notsuffering from clock skews, then please prepare your logs (seePreparing your logs) and reach out to us.

Recovering a Monitor’s Broken monmap

This is how a monmap usually looks like, depending on the number ofmonitors:

epoch 3
fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8
last_changed 2013-10-30 04:12:01.945629
created 2013-10-29 14:14:41.914786
0: 127.0.0.1:6789/0 mon.a
1: 127.0.0.1:6790/0 mon.b
2: 127.0.0.1:6795/0 mon.c

This may not be what you have however. For instance, in some versions ofearly Cuttlefish there was this one bug that could cause your monmapto be nullified. Completely filled with zeros. This means that not evenmonmaptool would be able to read it because it would find it hard tomake sense of only-zeros. Some other times, you may end up with a monitorwith a severely outdated monmap, thus being unable to find the remainingmonitors (e.g., say mon.c is down; you add a new monitor mon.d,then remove mon.a, then add a new monitor mon.e and removemon.b; you will end up with a totally different monmap from the onemon.c knows).

In this sort of situations, you have two possible solutions:

Scrap the monitor and create a new one

You should only take this route if you are positive that you won’tlose the information kept by that monitor; that you have other monitorsand that they are running just fine so that your new monitor is ableto synchronize from the remaining monitors. Keep in mind that destroyinga monitor, if there are no other copies of its contents, may lead toloss of data.

Inject a monmap into the monitor

Usually the safest path. You should grab the monmap from the remainingmonitors and inject it into the monitor with the corrupted/lost monmap.
These are the basic steps:
Is there a formed quorum? If so, grab the monmap from the quorum:
$ ceph mon getmap -o /tmp/monmap
No quorum? Grab the monmap directly from another monitor (thisassumes the monitor you are grabbing the monmap from has id ID-FOOand has been stopped):
$ ceph-mon -i ID-FOO —extract-monmap /tmp/monmap
Stop the monitor you are going to inject the monmap into.
Inject the monmap:
$ ceph-mon -i ID —inject-monmap /tmp/monmap
Start the monitor
Please keep in mind that the ability to inject monmaps is a powerfulfeature that can cause havoc with your monitors if misused as it willoverwrite the latest, existing monmap kept by the monitor.

Clock Skews

Monitors can be severely affected by significant clock skews across themonitor nodes. This usually translates into weird behavior with no obviouscause. To avoid such issues, you should run a clock synchronization toolon your monitor nodes.

What’s the maximum tolerated clock skew?

By default the monitors will allow clocks to drift up to 0.05 seconds.

Can I increase the maximum tolerated clock skew?

This value is configurable via the mon-clock-drift-allowed option, andalthough you CAN it doesn’t mean you SHOULD. The clock skew mechanismis in place because clock skewed monitor may not properly behave. We, asdevelopers and QA aficionados, are comfortable with the current defaultvalue, as it will alert the user before the monitors get out hand. Changingthis value without testing it first may cause unforeseen effects on thestability of the monitors and overall cluster healthiness, although there isno risk of dataloss.

How do I know there’s a clock skew?

The monitors will warn you in the form of a HEALTH_WARN. ceph healthdetail should show something in the form of:
mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
That means that mon.c has been flagged as suffering from a clock skew.

What should I do if there’s a clock skew?

Synchronize your clocks. Running an NTP client may help. If you are alreadyusing one and you hit this sort of issues, check if you are using some NTPserver remote to your network and consider hosting your own NTP server onyour network. This last option tends to reduce the amount of issues withmonitor clock skews.

Client Can’t Connect or Mount

Check your IP tables. Some OS install utilities add a REJECT rule toiptables. The rule rejects all clients trying to connect to the host exceptfor ssh. If your monitor host’s IP tables have such a REJECT rule inplace, clients connecting from a separate node will fail to mount with a timeouterror. You need to address iptables rules that reject clients trying toconnect to Ceph daemons. For example, you would need to address rules that looklike this appropriately:

REJECT all -- anywhere anywhere reject-with icmp-host-prohibited

You may also need to add rules to IP tables on your Ceph hosts to ensurethat clients can access the ports associated with your Ceph monitors (i.e., port6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). Forexample:

iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT

Monitor Store Failures

Symptoms of store corruption

Ceph monitor stores the cluster map in a key/value store such as LevelDB. Ifa monitor fails due to the key/value store corruption, following error messagesmight be found in the monitor log:

Corruption: error in middle of record

or:

Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb

Recovery using healthy monitor(s)

If there are any survivors, we can always replace the corrupted one with anew one. After booting up, the new joiner will sync up with a healthypeer, and once it is fully sync’ed, it will be able to serve the clients.

Recovery using OSDs

But what if all monitors fail at the same time? Since users are encouraged todeploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneousfailure is rare. But unplanned power-downs in a data center with improperlyconfigured disk/fs settings could fail the underlying file system, and hencekill all the monitors. In this case, we can recover the monitor store with theinformation stored in OSDs.:

ms=/root/mon-store
mkdir $ms
 
# collect the cluster map from stopped OSDs
for host in $hosts; do
  rsync -avz $ms/. user@$host:$ms.remote
  rm -rf $ms
  ssh user@$host <<EOF
    for osd in /var/lib/ceph/osd/ceph-*; do
      ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
    done
EOF
  rsync -avz user@$host:$ms.remote/. $ms
done
 
# rebuild the monitor store from the collected map, if the cluster does not
# use cephx authentication, we can skip the following steps to update the
# keyring with the caps, and there is no need to pass the "--keyring" option.
# i.e. just use "ceph-monstore-tool $ms rebuild" instead
ceph-authtool /path/to/admin.keyring -n mon. \
  --cap mon 'allow *'
ceph-authtool /path/to/admin.keyring -n client.admin \
  --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring
 
# make a backup of the corrupted store.db just in case!  repeat for
# all monitors.
mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted
 
# move rebuild store.db into place.  repeat for all monitors.
mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db

The steps above

collect the map from all OSD hosts,
then rebuild the store,
fill the entities in keyring file with appropriate caps
replace the corrupted store on mon.foo with the recovered copy.

Known limitations

Following information are not recoverable using the steps above:

some added keyrings: all the OSD keyrings added using ceph auth add commandare recovered from the OSD’s copy. And the client.admin keyring is importedusing ceph-monstore-tool. But the MDS keyrings and other keyrings are missingin the recovered monitor store. You might need to re-add them manually.
creating pools: If any RADOS pools were in the process of being creating, that state is lost. The recovery tool assumes that all pools have been created. If there are PGs that are stuck in the ‘unknown’ after the recovery for a partially created pool, you can force creation of the empty PG with the ceph osd force-create-pg command. Note that this will create an empty PG, so only do this if you know the pool is empty.
MDS Maps: the MDS maps are lost.

Everything Failed! Now What?

Reaching out for help

You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net)and on ceph-devel@vger.kernel.org and ceph-users@lists.ceph.com. Makesure you have grabbed your logs and have them ready if someone asks: the fasterthe interaction and lower the latency in response, the better chances everyone’stime is optimized.

Preparing your logs

Monitor logs are, by default, kept in /var/log/ceph/ceph-mon.FOO.log*. Wemay want them. However, your logs may not have the necessary information. Ifyou don’t find your monitor logs at their default location, you can checkwhere they should be by running:

ceph-conf --name mon.FOO --show-config-value log_file

The amount of information in the logs are subject to the debug levels beingenforced by your configuration files. If you have not enforced a specificdebug level then Ceph is using the default levels and your logs may notcontain important information to track down you issue.A first step in getting relevant information into your logs will be to raisedebug levels. In this case we will be interested in the information from themonitor.Similarly to what happens on other components, different parts of the monitorwill output their debug information on different subsystems.

You will have to raise the debug levels of those subsystems more closelyrelated to your issue. This may not be an easy task for someone unfamiliarwith troubleshooting Ceph. For most situations, setting the following optionson your monitors will be enough to pinpoint a potential source of the issue:

debug mon = 10
debug ms = 1

If we find that these debug levels are not enough, there’s a chance we mayask you to raise them or even define other debug subsystems to obtain infosfrom – but at least we started off with some useful information, insteadof a massively empty log without much to go on with.

Do I need to restart a monitor to adjust debug levels?

No. You may do it in one of two ways:

You have quorum

Either inject the debug option into the monitor you want to debug:
ceph tell mon.FOO config set debug_mon 10/10
or into all monitors at once:
ceph tell mon.* config set debug_mon 10/10

No quorum

Use the monitor’s admin socket and directly adjust the configurationoptions:
ceph daemon mon.FOO config set debug_mon 10/10

Going back to default values is as easy as rerunning the above commandsusing the debug level 1/10 instead. You can check your currentvalues using the admin socket and the following commands:

ceph daemon mon.FOO config show

or:

ceph daemon mon.FOO config get 'OPTION_NAME'

Reproduced the problem with appropriate debug levels. Now what?

Ideally you would send us only the relevant portions of your logs.We realise that figuring out the corresponding portion may not be theeasiest of tasks. Therefore, we won’t hold it to you if you provide thefull log, but common sense should be employed. If your log has hundredsof thousands of lines, it may get tricky to go through the whole thing,specially if we are not aware at which point, whatever your issue is,happened. For instance, when reproducing, keep in mind to write downcurrent time and date and to extract the relevant portions of your logsbased on that.

Finally, you should reach out to us on the mailing lists, on IRC or filea new issue on the tracker.