Showing data collected by remote health monitoring

As an administrator, you can review the metrics collected by Telemetry and the Insights Operator.

Showing data collected by Telemetry

You can view the cluster and components time series data captured by Telemetry.

Prerequisites

  • You have installed the OKD CLI (oc).

  • You have access to the cluster as a user with the cluster-admin role or the cluster-monitoring-view role.

Procedure

  1. Log in to a cluster.

  2. Run the following command, which queries a cluster’s Prometheus service and returns the full set of time series data captured by Telemetry:

    1. $ curl -G -k -H "Authorization: Bearer $(oc whoami -t)" \
    2. https://$(oc get route prometheus-k8s-federate -n \
    3. openshift-monitoring -o jsonpath="{.spec.host}")/federate \
    4. --data-urlencode 'match[]={__name__=~"cluster:usage:.*"}' \
    5. --data-urlencode 'match[]={__name__="count:up0"}' \
    6. --data-urlencode 'match[]={__name__="count:up1"}' \
    7. --data-urlencode 'match[]={__name__="cluster_version"}' \
    8. --data-urlencode 'match[]={__name__="cluster_version_available_updates"}' \
    9. --data-urlencode 'match[]={__name__="cluster_version_capability"}' \
    10. --data-urlencode 'match[]={__name__="cluster_operator_up"}' \
    11. --data-urlencode 'match[]={__name__="cluster_operator_conditions"}' \
    12. --data-urlencode 'match[]={__name__="cluster_version_payload"}' \
    13. --data-urlencode 'match[]={__name__="cluster_installer"}' \
    14. --data-urlencode 'match[]={__name__="cluster_infrastructure_provider"}' \
    15. --data-urlencode 'match[]={__name__="cluster_feature_set"}' \
    16. --data-urlencode 'match[]={__name__="instance:etcd_object_counts:sum"}' \
    17. --data-urlencode 'match[]={__name__="ALERTS",alertstate="firing"}' \
    18. --data-urlencode 'match[]={__name__="code:apiserver_request_total:rate:sum"}' \
    19. --data-urlencode 'match[]={__name__="cluster:capacity_cpu_cores:sum"}' \
    20. --data-urlencode 'match[]={__name__="cluster:capacity_memory_bytes:sum"}' \
    21. --data-urlencode 'match[]={__name__="cluster:cpu_usage_cores:sum"}' \
    22. --data-urlencode 'match[]={__name__="cluster:memory_usage_bytes:sum"}' \
    23. --data-urlencode 'match[]={__name__="openshift:cpu_usage_cores:sum"}' \
    24. --data-urlencode 'match[]={__name__="openshift:memory_usage_bytes:sum"}' \
    25. --data-urlencode 'match[]={__name__="workload:cpu_usage_cores:sum"}' \
    26. --data-urlencode 'match[]={__name__="workload:memory_usage_bytes:sum"}' \
    27. --data-urlencode 'match[]={__name__="cluster:virt_platform_nodes:sum"}' \
    28. --data-urlencode 'match[]={__name__="cluster:node_instance_type_count:sum"}' \
    29. --data-urlencode 'match[]={__name__="cnv:vmi_status_running:count"}' \
    30. --data-urlencode 'match[]={__name__="cluster:vmi_request_cpu_cores:sum"}' \
    31. --data-urlencode 'match[]={__name__="node_role_os_version_machine:cpu_capacity_cores:sum"}' \
    32. --data-urlencode 'match[]={__name__="node_role_os_version_machine:cpu_capacity_sockets:sum"}' \
    33. --data-urlencode 'match[]={__name__="subscription_sync_total"}' \
    34. --data-urlencode 'match[]={__name__="olm_resolution_duration_seconds"}' \
    35. --data-urlencode 'match[]={__name__="csv_succeeded"}' \
    36. --data-urlencode 'match[]={__name__="csv_abnormal"}' \
    37. --data-urlencode 'match[]={__name__="cluster:kube_persistentvolumeclaim_resource_requests_storage_bytes:provisioner:sum"}' \
    38. --data-urlencode 'match[]={__name__="cluster:kubelet_volume_stats_used_bytes:provisioner:sum"}' \
    39. --data-urlencode 'match[]={__name__="ceph_cluster_total_bytes"}' \
    40. --data-urlencode 'match[]={__name__="ceph_cluster_total_used_raw_bytes"}' \
    41. --data-urlencode 'match[]={__name__="ceph_health_status"}' \
    42. --data-urlencode 'match[]={__name__="odf_system_raw_capacity_total_bytes"}' \
    43. --data-urlencode 'match[]={__name__="odf_system_raw_capacity_used_bytes"}' \
    44. --data-urlencode 'match[]={__name__="odf_system_health_status"}' \
    45. --data-urlencode 'match[]={__name__="job:ceph_osd_metadata:count"}' \
    46. --data-urlencode 'match[]={__name__="job:kube_pv:count"}' \
    47. --data-urlencode 'match[]={__name__="job:odf_system_pvs:count"}' \
    48. --data-urlencode 'match[]={__name__="job:ceph_pools_iops:total"}' \
    49. --data-urlencode 'match[]={__name__="job:ceph_pools_iops_bytes:total"}' \
    50. --data-urlencode 'match[]={__name__="job:ceph_versions_running:count"}' \
    51. --data-urlencode 'match[]={__name__="job:noobaa_total_unhealthy_buckets:sum"}' \
    52. --data-urlencode 'match[]={__name__="job:noobaa_bucket_count:sum"}' \
    53. --data-urlencode 'match[]={__name__="job:noobaa_total_object_count:sum"}' \
    54. --data-urlencode 'match[]={__name__="odf_system_bucket_count", system_type="OCS", system_vendor="Red Hat"}' \
    55. --data-urlencode 'match[]={__name__="odf_system_objects_total", system_type="OCS", system_vendor="Red Hat"}' \
    56. --data-urlencode 'match[]={__name__="noobaa_accounts_num"}' \
    57. --data-urlencode 'match[]={__name__="noobaa_total_usage"}' \
    58. --data-urlencode 'match[]={__name__="console_url"}' \
    59. --data-urlencode 'match[]={__name__="cluster:ovnkube_master_egress_routing_via_host:max"}' \
    60. --data-urlencode 'match[]={__name__="cluster:network_attachment_definition_instances:max"}' \
    61. --data-urlencode 'match[]={__name__="cluster:network_attachment_definition_enabled_instance_up:max"}' \
    62. --data-urlencode 'match[]={__name__="cluster:ingress_controller_aws_nlb_active:sum"}' \
    63. --data-urlencode 'match[]={__name__="cluster:route_metrics_controller_routes_per_shard:min"}' \
    64. --data-urlencode 'match[]={__name__="cluster:route_metrics_controller_routes_per_shard:max"}' \
    65. --data-urlencode 'match[]={__name__="cluster:route_metrics_controller_routes_per_shard:avg"}' \
    66. --data-urlencode 'match[]={__name__="cluster:route_metrics_controller_routes_per_shard:median"}' \
    67. --data-urlencode 'match[]={__name__="cluster:openshift_route_info:tls_termination:sum"}' \
    68. --data-urlencode 'match[]={__name__="insightsclient_request_send_total"}' \
    69. --data-urlencode 'match[]={__name__="cam_app_workload_migrations"}' \
    70. --data-urlencode 'match[]={__name__="cluster:apiserver_current_inflight_requests:sum:max_over_time:2m"}' \
    71. --data-urlencode 'match[]={__name__="cluster:alertmanager_integrations:max"}' \
    72. --data-urlencode 'match[]={__name__="cluster:telemetry_selected_series:count"}' \
    73. --data-urlencode 'match[]={__name__="openshift:prometheus_tsdb_head_series:sum"}' \
    74. --data-urlencode 'match[]={__name__="openshift:prometheus_tsdb_head_samples_appended_total:sum"}' \
    75. --data-urlencode 'match[]={__name__="monitoring:container_memory_working_set_bytes:sum"}' \
    76. --data-urlencode 'match[]={__name__="namespace_job:scrape_series_added:topk3_sum1h"}' \
    77. --data-urlencode 'match[]={__name__="namespace_job:scrape_samples_post_metric_relabeling:topk3"}' \
    78. --data-urlencode 'match[]={__name__="monitoring:haproxy_server_http_responses_total:sum"}' \
    79. --data-urlencode 'match[]={__name__="rhmi_status"}' \
    80. --data-urlencode 'match[]={__name__="status:upgrading:version:rhoam_state:max"}' \
    81. --data-urlencode 'match[]={__name__="state:rhoam_critical_alerts:max"}' \
    82. --data-urlencode 'match[]={__name__="state:rhoam_warning_alerts:max"}' \
    83. --data-urlencode 'match[]={__name__="rhoam_7d_slo_percentile:max"}' \
    84. --data-urlencode 'match[]={__name__="rhoam_7d_slo_remaining_error_budget:max"}' \
    85. --data-urlencode 'match[]={__name__="cluster_legacy_scheduler_policy"}' \
    86. --data-urlencode 'match[]={__name__="cluster_master_schedulable"}' \
    87. --data-urlencode 'match[]={__name__="che_workspace_status"}' \
    88. --data-urlencode 'match[]={__name__="che_workspace_started_total"}' \
    89. --data-urlencode 'match[]={__name__="che_workspace_failure_total"}' \
    90. --data-urlencode 'match[]={__name__="che_workspace_start_time_seconds_sum"}' \
    91. --data-urlencode 'match[]={__name__="che_workspace_start_time_seconds_count"}' \
    92. --data-urlencode 'match[]={__name__="cco_credentials_mode"}' \
    93. --data-urlencode 'match[]={__name__="cluster:kube_persistentvolume_plugin_type_counts:sum"}' \
    94. --data-urlencode 'match[]={__name__="visual_web_terminal_sessions_total"}' \
    95. --data-urlencode 'match[]={__name__="acm_managed_cluster_info"}' \
    96. --data-urlencode 'match[]={__name__="cluster:vsphere_vcenter_info:sum"}' \
    97. --data-urlencode 'match[]={__name__="cluster:vsphere_esxi_version_total:sum"}' \
    98. --data-urlencode 'match[]={__name__="cluster:vsphere_node_hw_version_total:sum"}' \
    99. --data-urlencode 'match[]={__name__="openshift:build_by_strategy:sum"}' \
    100. --data-urlencode 'match[]={__name__="rhods_aggregate_availability"}' \
    101. --data-urlencode 'match[]={__name__="rhods_total_users"}' \
    102. --data-urlencode 'match[]={__name__="instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile",quantile="0.99"}' \
    103. --data-urlencode 'match[]={__name__="instance:etcd_mvcc_db_total_size_in_bytes:sum"}' \
    104. --data-urlencode 'match[]={__name__="instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile",quantile="0.99"}' \
    105. --data-urlencode 'match[]={__name__="instance:etcd_mvcc_db_total_size_in_use_in_bytes:sum"}' \
    106. --data-urlencode 'match[]={__name__="instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile",quantile="0.99"}' \
    107. --data-urlencode 'match[]={__name__="jaeger_operator_instances_storage_types"}' \
    108. --data-urlencode 'match[]={__name__="jaeger_operator_instances_strategies"}' \
    109. --data-urlencode 'match[]={__name__="jaeger_operator_instances_agent_strategies"}' \
    110. --data-urlencode 'match[]={__name__="appsvcs:cores_by_product:sum"}' \
    111. --data-urlencode 'match[]={__name__="nto_custom_profiles:count"}' \
    112. --data-urlencode 'match[]={__name__="openshift_csi_share_configmap"}' \
    113. --data-urlencode 'match[]={__name__="openshift_csi_share_secret"}' \
    114. --data-urlencode 'match[]={__name__="openshift_csi_share_mount_failures_total"}' \
    115. --data-urlencode 'match[]={__name__="openshift_csi_share_mount_requests_total"}' \
    116. --data-urlencode 'match[]={__name__="cluster:velero_backup_total:max"}' \
    117. --data-urlencode 'match[]={__name__="cluster:velero_restore_total:max"}' \
    118. --data-urlencode 'match[]={__name__="eo_es_storage_info"}' \
    119. --data-urlencode 'match[]={__name__="eo_es_redundancy_policy_info"}' \
    120. --data-urlencode 'match[]={__name__="eo_es_defined_delete_namespaces_total"}' \
    121. --data-urlencode 'match[]={__name__="eo_es_misconfigured_memory_resources_info"}' \
    122. --data-urlencode 'match[]={__name__="cluster:eo_es_data_nodes_total:max"}' \
    123. --data-urlencode 'match[]={__name__="cluster:eo_es_documents_created_total:sum"}' \
    124. --data-urlencode 'match[]={__name__="cluster:eo_es_documents_deleted_total:sum"}' \
    125. --data-urlencode 'match[]={__name__="pod:eo_es_shards_total:max"}' \
    126. --data-urlencode 'match[]={__name__="eo_es_cluster_management_state_info"}' \
    127. --data-urlencode 'match[]={__name__="imageregistry:imagestreamtags_count:sum"}' \
    128. --data-urlencode 'match[]={__name__="imageregistry:operations_count:sum"}' \
    129. --data-urlencode 'match[]={__name__="log_logging_info"}' \
    130. --data-urlencode 'match[]={__name__="log_collector_error_count_total"}' \
    131. --data-urlencode 'match[]={__name__="log_forwarder_pipeline_info"}' \
    132. --data-urlencode 'match[]={__name__="log_forwarder_input_info"}' \
    133. --data-urlencode 'match[]={__name__="log_forwarder_output_info"}' \
    134. --data-urlencode 'match[]={__name__="cluster:log_collected_bytes_total:sum"}' \
    135. --data-urlencode 'match[]={__name__="cluster:log_logged_bytes_total:sum"}' \
    136. --data-urlencode 'match[]={__name__="cluster:kata_monitor_running_shim_count:sum"}' \
    137. --data-urlencode 'match[]={__name__="platform:hypershift_hostedclusters:max"}' \
    138. --data-urlencode 'match[]={__name__="platform:hypershift_nodepools:max"}' \
    139. --data-urlencode 'match[]={__name__="namespace:noobaa_unhealthy_bucket_claims:max"}' \
    140. --data-urlencode 'match[]={__name__="namespace:noobaa_buckets_claims:max"}' \
    141. --data-urlencode 'match[]={__name__="namespace:noobaa_unhealthy_namespace_resources:max"}' \
    142. --data-urlencode 'match[]={__name__="namespace:noobaa_namespace_resources:max"}' \
    143. --data-urlencode 'match[]={__name__="namespace:noobaa_unhealthy_namespace_buckets:max"}' \
    144. --data-urlencode 'match[]={__name__="namespace:noobaa_namespace_buckets:max"}' \
    145. --data-urlencode 'match[]={__name__="namespace:noobaa_accounts:max"}' \
    146. --data-urlencode 'match[]={__name__="namespace:noobaa_usage:max"}' \
    147. --data-urlencode 'match[]={__name__="namespace:noobaa_system_health_status:max"}' \
    148. --data-urlencode 'match[]={__name__="ocs_advanced_feature_usage"}' \
    149. --data-urlencode 'match[]={__name__="os_image_url_override:sum"}'

Showing data collected by the Insights Operator

You can review the data that is collected by the Insights Operator.

Prerequisites

  • Access to the cluster as a user with the cluster-admin role.

Procedure

  1. Find the name of the currently running pod for the Insights Operator:

    1. $ INSIGHTS_OPERATOR_POD=$(oc get pods --namespace=openshift-insights -o custom-columns=:metadata.name --no-headers --field-selector=status.phase=Running)
  2. Copy the recent data archives collected by the Insights Operator:

    1. $ oc cp openshift-insights/$INSIGHTS_OPERATOR_POD:/var/lib/insights-operator ./insights-data

The recent Insights Operator archives are now available in the insights-data directory.