Troubleshooting and debugging metering

Metering is a deprecated feature. Deprecated functionality is still included in OKD and continues to be supported; however, it will be removed in a future release of this product and is not recommended for new deployments.

For the most recent list of major functionality that has been deprecated or removed within OKD, refer to the Deprecated and removed features section of the OKD release notes.

Use the following sections to help troubleshoot and debug specific issues with metering.

In addition to the information in this section, be sure to review the following topics:

Troubleshooting metering

A common issue with metering is pods failing to start. Pods might fail to start due to lack of resources or if they have a dependency on a resource that does not exist, such as a StorageClass or Secret resource.

Not enough compute resources

A common issue when installing or running metering is a lack of compute resources. As the cluster grows and more reports are created, the Reporting Operator pod requires more memory. If memory usage reaches the pod limit, the cluster considers the pod out of memory (OOM) and terminates it with an OOMKilled status. Ensure that metering is allocated the minimum resource requirements described in the installation prerequisites.

The Metering Operator does not autoscale the Reporting Operator based on the load in the cluster. Therefore, CPU usage for the Reporting Operator pod does not increase as the cluster grows.

To determine if the issue is with resources or scheduling, follow the troubleshooting instructions included in the Kubernetes document Managing Compute Resources for Containers.

To troubleshoot issues due to a lack of compute resources, check the following within the openshift-metering namespace.

Prerequisites

  • You are currently in the openshift-metering namespace. Change to the openshift-metering namespace by running:

    1. $ oc project openshift-metering

Procedure

  1. Check for metering Report resources that fail to complete and show the status of ReportingPeriodUnmetDependencies:

    1. $ oc get reports

    Example output

    1. NAME QUERY SCHEDULE RUNNING FAILED LAST REPORT TIME AGE
    2. namespace-cpu-utilization-adhoc-10 namespace-cpu-utilization Finished 2020-10-31T00:00:00Z 2m38s
    3. namespace-cpu-utilization-adhoc-11 namespace-cpu-utilization ReportingPeriodUnmetDependencies 2m23s
    4. namespace-memory-utilization-202010 namespace-memory-utilization ReportingPeriodUnmetDependencies 26s
    5. namespace-memory-utilization-202011 namespace-memory-utilization ReportingPeriodUnmetDependencies 14s
  2. Check the ReportDataSource resources where the NEWEST METRIC is less than the report end date:

    1. $ oc get reportdatasource

    Example output

    1. NAME EARLIEST METRIC NEWEST METRIC IMPORT START IMPORT END LAST IMPORT TIME AGE
    2. ...
    3. node-allocatable-cpu-cores 2020-04-23T09:14:00Z 2020-08-31T10:07:00Z 2020-04-23T09:14:00Z 2020-10-15T17:13:00Z 2020-12-09T12:45:10Z 230d
    4. node-allocatable-memory-bytes 2020-04-23T09:14:00Z 2020-08-30T05:19:00Z 2020-04-23T09:14:00Z 2020-10-14T08:01:00Z 2020-12-09T12:45:12Z 230d
    5. ...
    6. pod-usage-memory-bytes 2020-04-23T09:14:00Z 2020-08-24T20:25:00Z 2020-04-23T09:14:00Z 2020-10-09T23:31:00Z 2020-12-09T12:45:12Z 230d
  3. Check the health of the reporting-operator Pod resource for a high number of pod restarts:

    1. $ oc get pods -l app=reporting-operator

    Example output

    1. NAME READY STATUS RESTARTS AGE
    2. reporting-operator-84f7c9b7b6-fr697 2/2 Running 542 8d (1)
    1The Reporting Operator pod is restarting at a high rate.
  4. Check the reporting-operator Pod resource for an OOMKilled termination:

    1. $ oc describe pod/reporting-operator-84f7c9b7b6-fr697

    Example output

    1. Name: reporting-operator-84f7c9b7b6-fr697
    2. Namespace: openshift-metering
    3. Priority: 0
    4. Node: ip-10-xx-xx-xx.ap-southeast-1.compute.internal/10.xx.xx.xx
    5. ...
    6. Ports: 8080/TCP, 6060/TCP, 8082/TCP
    7. Host Ports: 0/TCP, 0/TCP, 0/TCP
    8. State: Running
    9. Started: Thu, 03 Dec 2020 20:59:45 +1000
    10. Last State: Terminated
    11. Reason: OOMKilled (1)
    12. Exit Code: 137
    13. Started: Thu, 03 Dec 2020 20:38:05 +1000
    14. Finished: Thu, 03 Dec 2020 20:59:43 +1000
    1The Reporting Operator pod was terminated due to OOM kill.

Increasing the reporting-operator pod memory limit

If you are experiencing an increase in pod restarts and OOM kill events, you can check the current memory limit set for the Reporting Operator pod. Increasing the memory limit allows the Reporting Operator pod to update the report data sources. If necessary, increase the memory limit in your MeteringConfig resource by 25% - 50%.

Procedure

  1. Check the current memory limits of the reporting-operator Pod resource:

    1. $ oc describe pod reporting-operator-67d6f57c56-79mrt

    Example output

    1. Name: reporting-operator-67d6f57c56-79mrt
    2. Namespace: openshift-metering
    3. Priority: 0
    4. ...
    5. Ports: 8080/TCP, 6060/TCP, 8082/TCP
    6. Host Ports: 0/TCP, 0/TCP, 0/TCP
    7. State: Running
    8. Started: Tue, 08 Dec 2020 14:26:21 +1000
    9. Ready: True
    10. Restart Count: 0
    11. Limits:
    12. cpu: 1
    13. memory: 500Mi (1)
    14. Requests:
    15. cpu: 500m
    16. memory: 250Mi
    17. Environment:
    18. ...
    1The current memory limit for the Reporting Operator pod.
  2. Edit the MeteringConfig resource to update the memory limit:

    1. $ oc edit meteringconfig/operator-metering

    Example MeteringConfig resource

    1. kind: MeteringConfig
    2. metadata:
    3. name: operator-metering
    4. namespace: openshift-metering
    5. spec:
    6. reporting-operator:
    7. spec:
    8. resources: (1)
    9. limits:
    10. cpu: 1
    11. memory: 750Mi
    12. requests:
    13. cpu: 500m
    14. memory: 500Mi
    15. ...
    1Add or increase memory limits within the resources field of the MeteringConfig resource.

    If there continue to be numerous OOM killed events after memory limits are increased, this might indicate that a different issue is causing the reports to be in a pending state.

StorageClass resource not configured

Metering requires that a default StorageClass resource be configured for dynamic provisioning.

See the documentation on configuring metering for information on how to check if there are any StorageClass resources configured for the cluster, how to set the default, and how to configure metering to use a storage class other than the default.

Secret not configured correctly

A common issue with metering is providing the incorrect secret when configuring your persistent storage. Be sure to review the example configuration files and create you secret according to the guidelines for your storage provider.

Debugging metering

Debugging metering is much easier when you interact directly with the various components. The sections below detail how you can connect and query Presto and Hive as well as view the dashboards of the Presto and HDFS components.

All of the commands in this section assume you have installed metering through OperatorHub in the openshift-metering namespace.

Get reporting operator logs

Use the command below to follow the logs of the reporting-operator:

  1. $ oc -n openshift-metering logs -f "$(oc -n openshift-metering get pods -l app=reporting-operator -o name | cut -c 5-)" -c reporting-operator

Query Presto using presto-cli

The following command opens an interactive presto-cli session where you can query Presto. This session runs in the same container as Presto and launches an additional Java instance, which can create memory limits for the pod. If this occurs, you should increase the memory request and limits of the Presto pod.

By default, Presto is configured to communicate using TLS. You must use the following command to run Presto queries:

  1. $ oc -n openshift-metering exec -it "$(oc -n openshift-metering get pods -l app=presto,presto=coordinator -o name | cut -d/ -f2)" \
  2. -- /usr/local/bin/presto-cli --server https://presto:8080 --catalog hive --schema default --user root --keystore-path /opt/presto/tls/keystore.pem

Once you run this command, a prompt appears where you can run queries. Use the show tables from metering; query to view the list of tables:

  1. $ presto:default> show tables from metering;

Example output

  1. Table
  2. datasource_your_namespace_cluster_cpu_capacity_raw
  3. datasource_your_namespace_cluster_cpu_usage_raw
  4. datasource_your_namespace_cluster_memory_capacity_raw
  5. datasource_your_namespace_cluster_memory_usage_raw
  6. datasource_your_namespace_node_allocatable_cpu_cores
  7. datasource_your_namespace_node_allocatable_memory_bytes
  8. datasource_your_namespace_node_capacity_cpu_cores
  9. datasource_your_namespace_node_capacity_memory_bytes
  10. datasource_your_namespace_node_cpu_allocatable_raw
  11. datasource_your_namespace_node_cpu_capacity_raw
  12. datasource_your_namespace_node_memory_allocatable_raw
  13. datasource_your_namespace_node_memory_capacity_raw
  14. datasource_your_namespace_persistentvolumeclaim_capacity_bytes
  15. datasource_your_namespace_persistentvolumeclaim_capacity_raw
  16. datasource_your_namespace_persistentvolumeclaim_phase
  17. datasource_your_namespace_persistentvolumeclaim_phase_raw
  18. datasource_your_namespace_persistentvolumeclaim_request_bytes
  19. datasource_your_namespace_persistentvolumeclaim_request_raw
  20. datasource_your_namespace_persistentvolumeclaim_usage_bytes
  21. datasource_your_namespace_persistentvolumeclaim_usage_raw
  22. datasource_your_namespace_persistentvolumeclaim_usage_with_phase_raw
  23. datasource_your_namespace_pod_cpu_request_raw
  24. datasource_your_namespace_pod_cpu_usage_raw
  25. datasource_your_namespace_pod_limit_cpu_cores
  26. datasource_your_namespace_pod_limit_memory_bytes
  27. datasource_your_namespace_pod_memory_request_raw
  28. datasource_your_namespace_pod_memory_usage_raw
  29. datasource_your_namespace_pod_persistentvolumeclaim_request_info
  30. datasource_your_namespace_pod_request_cpu_cores
  31. datasource_your_namespace_pod_request_memory_bytes
  32. datasource_your_namespace_pod_usage_cpu_cores
  33. datasource_your_namespace_pod_usage_memory_bytes
  34. (32 rows)
  35. Query 20190503_175727_00107_3venm, FINISHED, 1 node
  36. Splits: 19 total, 19 done (100.00%)
  37. 0:02 [32 rows, 2.23KB] [19 rows/s, 1.37KB/s]
  38. presto:default>

Query Hive using beeline

The following opens an interactive beeline session where you can query Hive. This session runs in the same container as Hive and launches an additional Java instance, which can create memory limits for the pod. If this occurs, you should increase the memory request and limits of the Hive pod.

  1. $ oc -n openshift-metering exec -it $(oc -n openshift-metering get pods -l app=hive,hive=server -o name | cut -d/ -f2) \
  2. -c hiveserver2 -- beeline -u 'jdbc:hive2://127.0.0.1:10000/default;auth=noSasl'

Once you run this command, a prompt appears where you can run queries. Use the show tables; query to view the list of tables:

  1. $ 0: jdbc:hive2://127.0.0.1:10000/default> show tables from metering;

Example output

  1. +----------------------------------------------------+
  2. | tab_name |
  3. +----------------------------------------------------+
  4. | datasource_your_namespace_cluster_cpu_capacity_raw |
  5. | datasource_your_namespace_cluster_cpu_usage_raw |
  6. | datasource_your_namespace_cluster_memory_capacity_raw |
  7. | datasource_your_namespace_cluster_memory_usage_raw |
  8. | datasource_your_namespace_node_allocatable_cpu_cores |
  9. | datasource_your_namespace_node_allocatable_memory_bytes |
  10. | datasource_your_namespace_node_capacity_cpu_cores |
  11. | datasource_your_namespace_node_capacity_memory_bytes |
  12. | datasource_your_namespace_node_cpu_allocatable_raw |
  13. | datasource_your_namespace_node_cpu_capacity_raw |
  14. | datasource_your_namespace_node_memory_allocatable_raw |
  15. | datasource_your_namespace_node_memory_capacity_raw |
  16. | datasource_your_namespace_persistentvolumeclaim_capacity_bytes |
  17. | datasource_your_namespace_persistentvolumeclaim_capacity_raw |
  18. | datasource_your_namespace_persistentvolumeclaim_phase |
  19. | datasource_your_namespace_persistentvolumeclaim_phase_raw |
  20. | datasource_your_namespace_persistentvolumeclaim_request_bytes |
  21. | datasource_your_namespace_persistentvolumeclaim_request_raw |
  22. | datasource_your_namespace_persistentvolumeclaim_usage_bytes |
  23. | datasource_your_namespace_persistentvolumeclaim_usage_raw |
  24. | datasource_your_namespace_persistentvolumeclaim_usage_with_phase_raw |
  25. | datasource_your_namespace_pod_cpu_request_raw |
  26. | datasource_your_namespace_pod_cpu_usage_raw |
  27. | datasource_your_namespace_pod_limit_cpu_cores |
  28. | datasource_your_namespace_pod_limit_memory_bytes |
  29. | datasource_your_namespace_pod_memory_request_raw |
  30. | datasource_your_namespace_pod_memory_usage_raw |
  31. | datasource_your_namespace_pod_persistentvolumeclaim_request_info |
  32. | datasource_your_namespace_pod_request_cpu_cores |
  33. | datasource_your_namespace_pod_request_memory_bytes |
  34. | datasource_your_namespace_pod_usage_cpu_cores |
  35. | datasource_your_namespace_pod_usage_memory_bytes |
  36. +----------------------------------------------------+
  37. 32 rows selected (13.101 seconds)
  38. 0: jdbc:hive2://127.0.0.1:10000/default>

Port-forward to the Hive web UI

Run the following command to port-forward to the Hive web UI:

  1. $ oc -n openshift-metering port-forward hive-server-0 10002

You can now open http://127.0.0.1:10002 in your browser window to view the Hive web interface.

Port-forward to HDFS

Run the following command to port-forward to the HDFS namenode:

  1. $ oc -n openshift-metering port-forward hdfs-namenode-0 9870

You can now open http://127.0.0.1:9870 in your browser window to view the HDFS web interface.

Run the following command to port-forward to the first HDFS datanode:

  1. $ oc -n openshift-metering port-forward hdfs-datanode-0 9864 (1)
1To check other datanodes, replace hdfs-datanode-0 with the pod you want to view information on.

Metering Ansible Operator

Metering uses the Ansible Operator to watch and reconcile resources in a cluster environment. When debugging a failed metering installation, it can be helpful to view the Ansible logs or status of your MeteringConfig custom resource.

Accessing Ansible logs

In the default installation, the Metering Operator is deployed as a pod. In this case, you can check the logs of the Ansible container within this pod:

  1. $ oc -n openshift-metering logs $(oc -n openshift-metering get pods -l app=metering-operator -o name | cut -d/ -f2) -c ansible

Alternatively, you can view the logs of the Operator container (replace -c ansible with -c operator) for condensed output.

Checking the MeteringConfig Status

It can be helpful to view the .status field of your MeteringConfig custom resource to debug any recent failures. The following command shows status messages with type Invalid:

  1. $ oc -n openshift-metering get meteringconfig operator-metering -o=jsonpath='{.status.conditions[?(@.type=="Invalid")].message}'

Checking MeteringConfig Events

Check events that the Metering Operator is generating. This can be helpful during installation or upgrade to debug any resource failures. Sort events by the last timestamp:

  1. $ oc -n openshift-metering get events --field-selector involvedObject.kind=MeteringConfig --sort-by='.lastTimestamp'

Example output with latest changes in the MeteringConfig resources

  1. LAST SEEN TYPE REASON OBJECT MESSAGE
  2. 4m40s Normal Validating meteringconfig/operator-metering Validating the user-provided configuration
  3. 4m30s Normal Started meteringconfig/operator-metering Configuring storage for the metering-ansible-operator
  4. 4m26s Normal Started meteringconfig/operator-metering Configuring TLS for the metering-ansible-operator
  5. 3m58s Normal Started meteringconfig/operator-metering Configuring reporting for the metering-ansible-operator
  6. 3m53s Normal Reconciling meteringconfig/operator-metering Reconciling metering resources
  7. 3m47s Normal Reconciling meteringconfig/operator-metering Reconciling monitoring resources
  8. 3m41s Normal Reconciling meteringconfig/operator-metering Reconciling HDFS resources
  9. 3m23s Normal Reconciling meteringconfig/operator-metering Reconciling Hive resources
  10. 2m59s Normal Reconciling meteringconfig/operator-metering Reconciling Presto resources
  11. 2m35s Normal Reconciling meteringconfig/operator-metering Reconciling reporting-operator resources
  12. 2m14s Normal Reconciling meteringconfig/operator-metering Reconciling reporting resources