Prometheus Module

Provides a Prometheus exporter to pass on Ceph performance countersfrom the collection point in ceph-mgr. Ceph-mgr receives MMgrReportmessages from all MgrClient processes (mons and OSDs, for instance)with performance counter schema data and actual counter data, and keepsa circular buffer of the last N samples. This module creates an HTTPendpoint (like all Prometheus exporters) and retrieves the latest sampleof every counter when polled (or “scraped” in Prometheus terminology).The HTTP path and query parameters are ignored; all extant countersfor all reporting entities are returned in text exposition format.(See the Prometheus documentation.)

Enabling prometheus output

The prometheus module is enabled with:

  1. ceph mgr module enable prometheus

Configuration

By default the module will accept HTTP requests on port 9283 on allIPv4 and IPv6 addresses on the host. The port and listen address are bothconfigurable with ceph config-key set, with keysmgr/prometheus/server_addr and mgr/prometheus/server_port.This port is registered with Prometheus’s registry.

RBD IO statistics

The module can optionally collect RBD per-image IO statistics by enablingdynamic OSD performance counters. The statistics are gathered for all imagesin the pools that are specified in the mgr/prometheus/rbd_stats_poolsconfiguration parameter. The parameter is a comma or space separated listof pool[/namespace] entries. If the namespace is not specified thestatistics are collected for all namespaces in the pool.

The module makes the list of all available images scanning the specifiedpools and namespaces and refreshes it periodically. The period isconfigurable via the mgr/prometheus/rbd_stats_pools_refresh_intervalparameter (in sec) and is 300 sec (5 minutes) by default. The module willforce refresh earlier if it detects statistics from a previously unknownRBD image.

Statistic names and labels

The names of the stats are exactly as Ceph names them, withillegal characters ., - and :: translated to ,and ceph prefixed to all names.

All daemon statistics have a ceph_daemon label such as “osd.123”that identifies the type and ID of the daemon they come from. Somestatistics can come from different types of daemon, so when queryinge.g. an OSD’s RocksDB stats, you would probably want to filteron ceph_daemon starting with “osd” to avoid mixing in the monitorrocksdb stats.

The cluster statistics (i.e. those global to the Ceph cluster)have labels appropriate to what they report on. For example,metrics relating to pools have a pool_id label.

The long running averages that represent the histograms from core Cephare represented by a pair of <name>_sum and <name>_count metrics.This is similar to how histograms are represented in Prometheusand they can also be treated similarly.

Pool and OSD metadata series

Special series are output to enable displaying and querying oncertain metadata fields.

Pools have a ceph_pool_metadata field like this:

  1. ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0

OSDs have a ceph_osd_metadata field like this:

  1. ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0

Correlating drive statistics with node_exporter

The prometheus output from Ceph is designed to be used in conjunctionwith the generic host monitoring from the Prometheus node_exporter.

To enable correlation of Ceph OSD statistics with node_exporter’sdrive statistics, special series are output like this:

  1. ceph_disk_occupation{ceph_daemon="osd.0",device="sdd", exported_instance="myhost"}

To use this to get disk statistics by OSD ID, use either the and operator orthe operator in your prometheus query. All metadata metrics (like ceph_disk_occupation have the value 1 so they act neutral with . Using *allows to use group_left and group_right grouping modifiers, so thatthe resulting metric has additional labels from one side of the query.

See theprometheus documentation for more information about constructing queries.

The goal is to run a query like

  1. rate(node_disk_bytes_written[30s]) and on (device,instance) ceph_disk_occupation{ceph_daemon="osd.0"}

Out of the box the above query will not return any metrics since the instance labels ofboth metrics don’t match. The instance label of ceph_disk_occupationwill be the currently active MGR node.

The following two section outline two approaches to remedy this.

Use label_replace

The label_replace function (cp.label_replace documentation)can add a label to, or alter a label of, a metric within a query.

To correlate an OSD and its disks write rate, the following query can be used:

  1. label_replace(rate(node_disk_bytes_written[30s]), "exported_instance", "$1", "instance", "(.*):.*") and on (device,exported_instance) ceph_disk_occupation{ceph_daemon="osd.0"}

Configuring Prometheus server

honor_labels

To enable Ceph to output properly-labeled data relating to any host,use the honor_labels setting when adding the ceph-mgr endpointsto your prometheus configuration.

This allows Ceph to export the proper instance label without prometheusoverwriting it. Without this setting, Prometheus applies an instance labelthat includes the hostname and port of the endpoint that the series came from.Because Ceph clusters have multiple manager daemons, this results in aninstance label that changes spuriously when the active manager daemonchanges.

If this is undesirable a custom instance label can be set in thePrometheus target configuration: you might wish to set it to the hostnameof your first mgr daemon, or something completely arbitrary like “ceph_cluster”.

node_exporter hostname labels

Set your instance labels to match what appears in Ceph’s OSD metadatain the instance field. This is generally the short hostname of the node.

This is only necessary if you want to correlate Ceph stats with host stats,but you may find it useful to do it in all cases in case you want to dothe correlation in the future.

Example configuration

This example shows a single node configuration running ceph-mgr andnode_exporter on a server called senta04. Note that this requires to add theappropriate instance label to every node_exporter target individually.

This is just an example: there are other ways to configure prometheusscrape targets and label rewrite rules.

prometheus.yml

  1. global:
  2. scrape_interval: 15s
  3. evaluation_interval: 15s
  4.  
  5. scrape_configs:
  6. - job_name: 'node'
  7. file_sd_configs:
  8. - files:
  9. - node_targets.yml
  10. - job_name: 'ceph'
  11. honor_labels: true
  12. file_sd_configs:
  13. - files:
  14. - ceph_targets.yml

ceph_targets.yml

  1. [
  2. {
  3. "targets": [ "senta04.mydomain.com:9283" ],
  4. "labels": {}
  5. }
  6. ]

node_targets.yml

  1. [
  2. {
  3. "targets": [ "senta04.mydomain.com:9100" ],
  4. "labels": {
  5. "instance": "senta04"
  6. }
  7. }
  8. ]

Notes

Counters and gauges are exported; currently histograms and long-runningaverages are not. It’s possible that Ceph’s 2-D histograms could bereduced to two separate 1-D histograms, and that long-running averagescould be exported as Prometheus’ Summary type.

Timestamps, as with many Prometheus exporters, are established bythe server’s scrape time (Prometheus expects that it is polling theactual counter process synchronously). It is possible to supply atimestamp along with the stat report, but the Prometheus team stronglyadvises against this. This means that timestamps will be delayed byan unpredictable amount; it’s not clear if this will be problematic,but it’s worth knowing about.