Diskprediction Module

The diskprediction module supports two modes: cloud mode and local mode. In cloud mode, the disk and Ceph operating status information is collected from Ceph cluster and sent to a cloud-based DiskPrediction server over the Internet. DiskPrediction server analyzes the data and provides the analytics and prediction results of performance and disk health states for Ceph clusters.

Local mode doesn’t require any external server for data analysis and output results. In local mode, the diskprediction module uses an internal predictor module for disk prediction service, and then returns the disk prediction result to the Ceph system.

Local predictor: 70% accuracy

Cloud predictor for free: 95% accuracy

Enabling

Run the following command to enable the diskprediction module in the Cephenvironment:

  1. ceph mgr module enable diskprediction_cloud
  2. ceph mgr module enable diskprediction_local

Select the prediction mode:

  1. ceph config set global device_failure_prediction_mode local

or:

  1. ceph config set global device_failure_prediction_mode cloud

To disable prediction,:

  1. ceph config set global device_failure_prediction_mode none

Connection settings

The connection settings are used for connection between Ceph and DiskPrediction server.

Local Mode

The diskprediction module leverages Ceph device health check to collect disk health metrics and uses internal predictor module to produce the disk failure prediction and returns back to Ceph. Thus, no connection settings are required in local mode. The local predictor module requires at least six datasets of device health metrics to implement the prediction.

Run the following command to use local predictor predict device life expectancy.

  1. ceph device predict-life-expectancy <device id>

Cloud Mode

The user registration is required in cloud mode. The users have to sign up their accounts at https://www.diskprophet.com/#/ to receive the following DiskPrediction server information for connection settings.

Certificate file path: After user registration is confirmed, the system will send a confirmation email including a certificate file download link. Download the certificate file and save it to the Ceph system. Run the following command to verify the file. Without certificate file verification, the connection settings cannot be completed.

DiskPrediction server: The DiskPrediction server name. It could be an IP address if required.

Connection account: An account name used to set up the connection between Ceph and DiskPrediction server

Connection password: The password used to set up the connection between Ceph and DiskPrediction server

Run the following command to complete connection setup.

  1. ceph device set-cloud-prediction-config <diskprediction_server> <connection_account> <connection_password> <certificate file path>

You can use the following command to display the connection settings:

  1. ceph device show-prediction-config

Additional optional configuration settings are the following:

  • diskprediction_upload_metrics_interval
  • Indicate the frequency to send Ceph performance metrics to DiskPrediction server regularly at times. Default is 10 minutes.

  • diskprediction_upload_smart_interval

  • Indicate the frequency to send Ceph physical device info to DiskPrediction server regularly at times. Default is 12 hours.

  • diskprediction_retrieve_prediction_interval

  • Indicate Ceph that retrieves physical device prediction data from DiskPrediction server regularly at times. Default is 12 hours.

Diskprediction Data

The diskprediction module actively sends/retrieves the following data to/from DiskPrediction server.

Metrics Data

  • Ceph cluster status
keyDescription
cluster_healthCeph health check status
num_monNumber of monitor node
num_mon_quorumNumber of monitors in quorum
num_osdTotal number of OSD
num_osd_upNumber of OSDs that are up
num_osd_inNumber of OSDs that are in cluster
osd_epochCurrent epoch of OSD map
osd_bytesTotal capacity of cluster in bytes
osd_bytes_usedNumber of used bytes on cluster
osd_bytes_availNumber of available bytes on cluster
num_poolNumber of pools
num_pgTotal number of placement groups
num_pg_active_cleanNumber of placement groups inactive+clean state
num_pg_activeNumber of placement groups in activestate
num_pg_peeringNumber of placement groups in peeringstate
num_objectTotal number of objects on cluster
num_object_degradedNumber of degraded (missing replicas)objects
num_object_misplacedNumber of misplaced (wrong location inthe cluster) objects
num_object_unfoundNumber of unfound objects
num_bytesTotal number of bytes of all objects
num_mds_upNumber of MDSs that are up
num_mds_inNumber of MDS that are in cluster
num_mds_failedNumber of failed MDS
mds_epochCurrent epoch of MDS map
  • Ceph mon/osd performance counts

Mon:

keyDescription
num_sessionsCurrent number of opened monitor sessions
session_addNumber of created monitor sessions
session_rmNumber of remove_session calls in monitor
session_trimNumber of trimed monitor sessions
num_electionsNumber of elections monitor took part in
election_callNumber of elections started by monitor
election_winNumber of elections won by monitor
election_loseNumber of elections lost by monitor

Osd:

keyDescription
op_wipReplication operations currently beingprocessed (primary)
op_in_bytesClient operations total write size
op_rClient read operations
op_out_bytesClient operations total read size
op_wClient write operations
op_latencyLatency of client operations (includingqueue time)
op_process_latencyLatency of client operations (excludingqueue time)
op_r_latencyLatency of read operation (includingqueue time)
op_r_process_latencyLatency of read operation (excludingqueue time)
op_w_in_bytesClient data written
op_w_latencyLatency of write operation (includingqueue time)
op_w_process_latencyLatency of write operation (excludingqueue time)
op_rwClient read-modify-write operations
op_rw_in_bytesClient read-modify-write operations writein
op_rw_out_bytesClient read-modify-write operations readout
op_rw_latencyLatency of read-modify-write operation(including queue time)
op_rw_process_latencyLatency of read-modify-write operation(excluding queue time)
  • Ceph pool statistics
keyDescription
bytes_usedPer pool bytes used
max_availMax available number of bytes in the pool
objectsNumber of objects in the pool
wr_bytesNumber of bytes written in the pool
dirtyNumber of bytes dirty in the pool
rd_bytesNumber of bytes read in the pool
stored_rawBytes used in pool including copies made
  • Ceph physical device metadata
keyDescription
disk_domain_idPhysical device identify id
disk_nameDevice attachment name
disk_wwnDevice wwn
modelDevice model name
serial_numberDevice serial number
sizeDevice size
vendorDevice vendor name
  • Ceph each objects correlation information

  • The module agent information

  • The module agent cluster information

  • The module agent host information

SMART Data

  • Ceph physical device SMART data (provided by Ceph devicehealth module)

Prediction Data

  • Ceph physical device prediction data

Receiving predicted health status from a Ceph OSD disk drive

You can receive predicted health status from Ceph OSD disk drive by using thefollowing command.

  1. ceph device get-predicted-status <device id>

The get-predicted-status command returns:

  1. {
  2. "near_failure": "Good",
  3. "disk_wwn": "5000011111111111",
  4. "serial_number": "111111111",
  5. "predicted": "2018-05-30 18:33:12",
  6. "attachment": "sdb"
  7. }
AttributeDescription
near_failureThe disk failure prediction state:Good/Warning/Bad/Unknown
disk_wwnDisk WWN number
serial_numberDisk serial number
predictedPredicted date
attachmentdevice name on the local system

The near_failure attribute for disk failure prediction state indicates disk life expectancy in the following table.

near_failureLife expectancy (weeks)
Good> 6 weeks
Warning2 weeks ~ 6 weeks
Bad< 2 weeks

Debugging

If you want to debug the DiskPrediction module mapping to Ceph logging level,use the following command.

  1. [mgr]
  2.  
  3. debug mgr = 20

With logging set to debug for the manager the module will print out loggingmessage with prefix mgr[diskprediction] for easy filtering.