Troubleshooting OSDs

Troubleshooting OSDs

Before troubleshooting your OSDs, check your monitors and network first. Ifyou execute ceph health or ceph -s on the command line and Ceph returnsa health status, it means that the monitors have a quorum.If you don’t have a monitor quorum or if there are errors with the monitorstatus, address the monitor issues first.Check your networks to ensure theyare running properly, because networks may have a significant impact on OSDoperation and performance.

Obtaining Data About OSDs

A good first step in troubleshooting your OSDs is to obtain information inaddition to the information you collected while monitoring your OSDs(e.g., ceph osd tree).

Ceph Logs

If you haven’t changed the default path, you can find Ceph log files at/var/log/ceph:

ls /var/log/ceph

If you don’t get enough log detail, you can change your logging level. SeeLogging and Debugging for details to ensure that Ceph performs adequatelyunder high logging volume.

Admin Socket

Use the admin socket tool to retrieve runtime information. For details, listthe sockets for your Ceph processes:

ls /var/run/ceph

Then, execute the following, replacing {daemon-name} with an actualdaemon (e.g., osd.0):

ceph daemon osd.0 help

Alternatively, you can specify a {socket-file} (e.g., something in /var/run/ceph):

ceph daemon {socket-file} help

The admin socket, among other things, allows you to:

List your configuration at runtime
Dump historic operations
Dump the operation priority queue state
Dump operations in flight
Dump perfcounters

Display Freespace

Filesystem issues may arise. To display your file system’s free space, executedf.

df -h

Execute df —help for additional usage.

I/O Statistics

Use iostat to identify I/O-related issues.

iostat -x

Diagnostic Messages

To retrieve diagnostic messages, use dmesg with less, more, grepor tail. For example:

dmesg | grep scsi

Stopping w/out Rebalancing

Periodically, you may need to perform maintenance on a subset of your cluster,or resolve a problem that affects a failure domain (e.g., a rack). If you do notwant CRUSH to automatically rebalance the cluster as you stop OSDs formaintenance, set the cluster to noout first:

ceph osd set noout

Once the cluster is set to noout, you can begin stopping the OSDs within thefailure domain that requires maintenance work.

stop ceph-osd id={num}

Note

Placement groups within the OSDs you stop will become degradedwhile you are addressing issues with within the failure domain.

Once you have completed your maintenance, restart the OSDs.

start ceph-osd id={num}

Finally, you must unset the cluster from noout.

ceph osd unset noout

OSD Not Running

Under normal circumstances, simply restarting the ceph-osd daemon willallow it to rejoin the cluster and recover.

An OSD Won’t Start

If you start your cluster and an OSD won’t start, check the following:

Configuration File: If you were not able to get OSDs running froma new installation, check your configuration file to ensure it conforms(e.g., host not hostname, etc.).
Check Paths: Check the paths in your configuration, and the actualpaths themselves for data and journals. If you separate the OSD data fromthe journal data and there are errors in your configuration file or in theactual mounts, you may have trouble starting OSDs. If you want to store thejournal on a block device, you should partition your journal disk and assignone partition per OSD.
Check Max Threadcount: If you have a node with a lot of OSDs, you may behitting the default maximum number of threads (e.g., usually 32k), especiallyduring recovery. You can increase the number of threads using sysctl tosee if increasing the maximum number of threads to the maximum possiblenumber of threads allowed (i.e., 4194303) will help. For example:

sysctl -w kernel.pid_max=4194303

If increasing the maximum thread count resolves the issue, you can make itpermanent by including a kernel.pid_max setting in the/etc/sysctl.conf file. For example:

kernel.pid_max = 4194303

Kernel Version: Identify the kernel version and distribution youare using. Ceph uses some third party tools by default, which may bebuggy or may conflict with certain distributions and/or kernelversions (e.g., Google perftools). Check the OS recommendationsto ensure you have addressed any issues related to your kernel.
Segment Fault: If there is a segment fault, turn your logging up(if it is not already), and try again. If it segment faults again,contact the ceph-devel email list and provide your Ceph configurationfile, your monitor output and the contents of your log file(s).

An OSD Failed

When a ceph-osd process dies, the monitor will learn about the failurefrom surviving ceph-osd daemons and report it via the ceph healthcommand:

ceph health
HEALTH_WARN 1/3 in osds are down

Specifically, you will get a warning whenever there are ceph-osdprocesses that are marked in and down. You can identify whichceph-osds are down with:

ceph health detail
HEALTH_WARN 1/3 in osds are down
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080

If there is a diskfailure or other fault preventing ceph-osd from functioning orrestarting, an error message should be present in its log file in/var/log/ceph.

If the daemon stopped because of a heartbeat failure, the underlyingkernel file system may be unresponsive. Check dmesg output for diskor other kernel errors.

If the problem is a software error (failed assertion or otherunexpected error), it should be reported to the ceph-devel email list.

No Free Drive Space

Ceph prevents you from writing to a full OSD so that you don’t lose data.In an operational cluster, you should receive a warning when your clusteris getting near its full ratio. The mon osd full ratio defaults to0.95, or 95% of capacity before it stops clients from writing data.The mon osd backfillfull ratio defaults to 0.90, or 90 % ofcapacity when it blocks backfills from starting. TheOSD nearfull ratio defaults to 0.85, or 85% of capacitywhen it generates a health warning.

Changing it can be done using:

ceph osd set-nearfull-ratio <float[0.0-1.0]>

Full cluster issues usually arise when testing how Ceph handles an OSDfailure on a small cluster. When one node has a high percentage of thecluster’s data, the cluster can easily eclipse its nearfull and full ratioimmediately. If you are testing how Ceph reacts to OSD failures on a smallcluster, you should leave ample free disk space and consider temporarilylowering the OSD full ratio, OSD backfillfull ratio andOSD nearfull ratio using these commands:

ceph osd set-nearfull-ratio <float[0.0-1.0]>
ceph osd set-full-ratio <float[0.0-1.0]>
ceph osd set-backfillfull-ratio <float[0.0-1.0]>

Full ceph-osds will be reported by ceph health:

ceph health
HEALTH_WARN 1 nearfull osd(s)

Or:

ceph health detail
HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
osd.3 is full at 97%
osd.4 is backfill full at 91%
osd.2 is near full at 87%

The best way to deal with a full cluster is to add new ceph-osds, allowingthe cluster to redistribute data to the newly available storage.

If you cannot start an OSD because it is full, you may delete some data by deletingsome placement group directories in the full OSD.

Important

If you choose to delete a placement group directory on a full OSD,DO NOT delete the same placement group directory on another full OSD, orYOU MAY LOSE DATA. You MUST maintain at least one copy of your data onat least one OSD.

See Monitor Config Reference for additional details.

OSDs are Slow/Unresponsive

A commonly recurring issue involves slow or unresponsive OSDs. Ensure that youhave eliminated other troubleshooting possibilities before delving into OSDperformance issues. For example, ensure that your network(s) is working properlyand your OSDs are running. Check to see if OSDs are throttling recovery traffic.

Tip

Newer versions of Ceph provide better recovery handling by preventingrecovering OSDs from using up system resources so that up and inOSDs are not available or are otherwise slow.

Networking Issues

Ceph is a distributed storage system, so it depends upon networks to peer withOSDs, replicate objects, recover from faults and check heartbeats. Networkingissues can cause OSD latency and flapping OSDs. See Flapping OSDs fordetails.

Ensure that Ceph processes and Ceph-dependent processes are connected and/orlistening.

netstat -a | grep ceph
netstat -l | grep ceph
sudo netstat -p | grep ceph

Check network statistics.

netstat -s

Drive Configuration

A storage drive should only support one OSD. Sequential read and sequentialwrite throughput can bottleneck if other processes share the drive, includingjournals, operating systems, monitors, other OSDs and non-Ceph processes.

Ceph acknowledges writes after journaling, so fast SSDs are anattractive option to accelerate the response time–particularly whenusing the XFS or ext4 file systems. By contrast, the btrfsfile system can write and journal simultaneously. (Note, however, thatwe recommend against using btrfs for production deployments.)

Note

Partitioning a drive does not change its total throughput orsequential read/write limits. Running a journal in a separate partitionmay help, but you should prefer a separate physical drive.

Bad Sectors / Fragmented Disk

Check your disks for bad sectors and fragmentation. This can cause total throughputto drop substantially.

Co-resident Monitors/OSDs

Monitors are generally light-weight processes, but they do lots of fsync(),which can interfere with other workloads, particularly if monitors run on thesame drive as your OSDs. Additionally, if you run monitors on the same host asthe OSDs, you may incur performance issues related to:

Running an older kernel (pre-3.0)
Running a kernel with no syncfs(2) syscall.

In these cases, multiple OSDs running on the same host can drag each other downby doing lots of commits. That often leads to the bursty writes.

Co-resident Processes

Spinning up co-resident processes such as a cloud-based solution, virtualmachines and other applications that write data to Ceph while operating on thesame hardware as OSDs can introduce significant OSD latency. Generally, werecommend optimizing a host for use with Ceph and using other hosts for otherprocesses. The practice of separating Ceph operations from other applicationsmay help improve performance and may streamline troubleshooting and maintenance.

Logging Levels

If you turned logging levels up to track an issue and then forgot to turnlogging levels back down, the OSD may be putting a lot of logs onto the disk. Ifyou intend to keep logging levels high, you may consider mounting a drive to thedefault path for logging (i.e., /var/log/ceph/$cluster-$name.log).

Recovery Throttling

Depending upon your configuration, Ceph may reduce recovery rates to maintainperformance or it may increase recovery rates to the point that recoveryimpacts OSD performance. Check to see if the OSD is recovering.

Kernel Version

Check the kernel version you are running. Older kernels may not receivenew backports that Ceph depends upon for better performance.

Kernel Issues with SyncFS

Try running one OSD per host to see if performance improves. Old kernelsmight not have a recent enough version of glibc to support syncfs(2).

Filesystem Issues

Currently, we recommend deploying clusters with XFS.

We recommend against using btrfs or ext4. The btrfs file system hasmany attractive features, but bugs in the file system may lead toperformance issues and spurious ENOSPC errors. We do not recommendext4 because xattr size limitations break our support for long objectnames (needed for RGW).

For more information, see Filesystem Recommendations.

Insufficient RAM

We recommend 1GB of RAM per OSD daemon. You may notice that during normaloperations, the OSD only uses a fraction of that amount (e.g., 100-200MB).Unused RAM makes it tempting to use the excess RAM for co-resident applications,VMs and so forth. However, when OSDs go into recovery mode, their memoryutilization spikes. If there is no RAM available, the OSD performance will slowconsiderably.

Old Requests or Slow Requests

If a ceph-osd daemon is slow to respond to a request, it will generate log messagescomplaining about requests that are taking too long. The warning thresholddefaults to 30 seconds, and is configurable via the osd op complaint timeoption. When this happens, the cluster log will receive messages.

Legacy versions of Ceph complain about old requests:

osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops

New versions of Ceph complain about slow requests:

{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
{date} {osd.num}  [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]

Possible causes include:

A bad drive (check dmesg output)
A bug in the kernel file system (check dmesg output)
An overloaded cluster (check system load, iostat, etc.)
A bug in the ceph-osd daemon.

Possible solutions:

Remove VMs from Ceph hosts
Upgrade kernel
Upgrade Ceph
Restart OSDs

Debugging Slow Requests

If you run ceph daemon osd.<id> dump_historic_ops or ceph daemon osd.<id> dump_ops_in_flight,you will see a set of operations and a list of events each operation wentthrough. These are briefly described below.

Events from the Messenger layer:

header_read: When the messenger first started reading the message off the wire.
throttled: When the messenger tried to acquire memory throttle space to readthe message into memory.
all_read: When the messenger finished reading the message off the wire.
dispatched: When the messenger gave the message to the OSD.
initiated: This is identical to header_read. The existence of both is ahistorical oddity.

Events from the OSD as it prepares operations:

queued_for_pg: The op has been put into the queue for processing by its PG.
reached_pg: The PG has started doing the op.
waiting for *: The op is waiting for some other work to complete before itcan proceed (e.g. a new OSDMap; for its object target to scrub; for the PG tofinish peering; all as specified in the message).
started: The op has been accepted as something the OSD should do andis now being performed.
waiting for subops from: The op has been sent to replica OSDs.

Events from the FileStore:

commit_queued_for_journal_write: The op has been given to the FileStore.
write_thread_in_journal_buffer: The op is in the journal’s buffer and waitingto be persisted (as the next disk write).
journaled_completion_queued: The op was journaled to disk and its callbackqueued for invocation.

Events from the OSD after stuff has been given to local disk:

op_commit: The op has been committed (i.e. written to journal) by theprimary OSD.
op_applied: The op has been write()’en) to the backing FS (i.e. applied in memory but not flushed out to disk) on the primary.
sub_op_applied: op_applied, but for a replica’s “subop”.
sub_op_committed: op_commit, but for a replica’s subop (only for EC pools).
sub_op_commit_rec/sub_op_apply_rec from <X>: The primary marks this when ithears about the above, but for a particular replica (i.e. <X>).
commit_sent: We sent a reply back to the client (or primary OSD, for sub ops).

Many of these events are seemingly redundant, but cross important boundaries inthe internal code (such as passing data across locks into new threads).

Flapping OSDs

We recommend using both a public (front-end) network and a cluster (back-end)network so that you can better meet the capacity requirements of objectreplication. Another advantage is that you can run a cluster network such thatit is not connected to the internet, thereby preventing some denial of serviceattacks. When OSDs peer and check heartbeats, they use the cluster (back-end)network when it’s available. See Monitor/OSD Interaction for details.

However, if the cluster (back-end) network fails or develops significant latencywhile the public (front-end) network operates optimally, OSDs currently do nothandle this situation well. What happens is that OSDs mark each other downon the monitor, while marking themselves up. We call this scenario‘flapping`.

If something is causing OSDs to ‘flap’ (repeatedly getting marked down andthen up again), you can force the monitors to stop the flapping with:

ceph osd set noup      # prevent OSDs from getting marked up
ceph osd set nodown    # prevent OSDs from getting marked down

These flags are recorded in the osdmap structure:

ceph osd dump | grep flags
flags no-up,no-down

You can clear the flags with:

ceph osd unset noup
ceph osd unset nodown

Two other flags are supported, noin and noout, which preventbooting OSDs from being marked in (allocated data) or protect OSDsfrom eventually being marked out (regardless of what the current value formon osd down out interval is).

Note

noup, noout, and nodown are temporary in thesense that once the flags are cleared, the action they were blockingshould occur shortly after. The noin flag, on the other hand,prevents OSDs from being marked in on boot, and any daemons thatstarted while the flag was set will remain that way.