The NVIDIA GPU administration dashboard

Introduction

The OpenShift Console NVIDIA GPU plugin is a dedicated administration dashboard for NVIDIA GPU usage visualization in the OpenShift Container Platform (OCP) Console. The visualizations in the administration dashboard provide guidance on how to best optimize GPU resources in clusters, such as when a GPU is under- or over-utilized.

The OpenShift Console NVIDIA GPU plugin works as a remote bundle for the OCP console. To run the plugin the OCP console must be running.

Installing the NVIDIA GPU administration dashboard

Install the NVIDIA GPU plugin by using Helm on the OpenShift Container Platform (OCP) Console to add GPU capabilities.

The OpenShift Console NVIDIA GPU plugin works as a remote bundle for the OCP console. To run the OpenShift Console NVIDIA GPU plugin an instance of the OCP console must be running.

Prerequisites

  • Red Hat OpenShift 4.11+

  • NVIDIA GPU operator

  • Helm

Procedure

Use the following procedure to install the OpenShift Console NVIDIA GPU plugin.

  1. Add the Helm repository:

    1. $ helm repo add rh-ecosystem-edge https://rh-ecosystem-edge.github.io/console-plugin-nvidia-gpu
    1. $ helm repo update
  2. Install the Helm chart in the default NVIDIA GPU operator namespace:

    1. $ helm install -n nvidia-gpu-operator console-plugin-nvidia-gpu rh-ecosystem-edge/console-plugin-nvidia-gpu

    Example output

    1. NAME: console-plugin-nvidia-gpu
    2. LAST DEPLOYED: Tue Aug 23 15:37:35 2022
    3. NAMESPACE: nvidia-gpu-operator
    4. STATUS: deployed
    5. REVISION: 1
    6. NOTES:
    7. View the Console Plugin NVIDIA GPU deployed resources by running the following command:
    8. $ oc -n {{ .Release.Namespace }} get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu
    9. Enable the plugin by running the following command:
    10. # Check if a plugins field is specified
    11. $ oc get consoles.operator.openshift.io cluster --output=jsonpath="{.spec.plugins}"
    12. # if not, then run the following command to enable the plugin
    13. $ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-plugin-nvidia-gpu"] } }' --type=merge
    14. # if yes, then run the following command to enable the plugin
    15. $ oc patch consoles.operator.openshift.io cluster --patch '[{"op": "add", "path": "/spec/plugins/-", "value": "console-plugin-nvidia-gpu" }]' --type=json
    16. # add the required DCGM Exporter metrics ConfigMap to the existing NVIDIA operator ClusterPolicy CR:
    17. oc patch clusterpolicies.nvidia.com gpu-cluster-policy --patch '{ "spec": { "dcgmExporter": { "config": { "name": "console-plugin-nvidia-gpu" } } } }' --type=merge

    The dashboard relies mostly on Prometheus metrics exposed by the NVIDIA DCGM Exporter, but the default exposed metrics are not enough for the dashboard to render the required gauges. Therefore, the DGCM exporter is configured to expose a custom set of metrics, as shown here.

    1. apiVersion: v1
    2. data:
    3. dcgm-metrics.csv: |
    4. DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, gpu utilization.
    5. DCGM_FI_DEV_MEM_COPY_UTIL, gauge, mem utilization.
    6. DCGM_FI_DEV_ENC_UTIL, gauge, enc utilization.
    7. DCGM_FI_DEV_DEC_UTIL, gauge, dec utilization.
    8. DCGM_FI_DEV_POWER_USAGE, gauge, power usage.
    9. DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, gauge, power mgmt limit.
    10. DCGM_FI_DEV_GPU_TEMP, gauge, gpu temp.
    11. DCGM_FI_DEV_SM_CLOCK, gauge, sm clock.
    12. DCGM_FI_DEV_MAX_SM_CLOCK, gauge, max sm clock.
    13. DCGM_FI_DEV_MEM_CLOCK, gauge, mem clock.
    14. DCGM_FI_DEV_MAX_MEM_CLOCK, gauge, max mem clock.
    15. kind: ConfigMap
    16. metadata:
    17. annotations:
    18. meta.helm.sh/release-name: console-plugin-nvidia-gpu
    19. meta.helm.sh/release-namespace: nvidia-gpu-operator
    20. creationTimestamp: "2022-10-26T19:46:41Z"
    21. labels:
    22. app.kubernetes.io/component: console-plugin-nvidia-gpu
    23. app.kubernetes.io/instance: console-plugin-nvidia-gpu
    24. app.kubernetes.io/managed-by: Helm
    25. app.kubernetes.io/name: console-plugin-nvidia-gpu
    26. app.kubernetes.io/part-of: console-plugin-nvidia-gpu
    27. app.kubernetes.io/version: latest
    28. helm.sh/chart: console-plugin-nvidia-gpu-0.2.3
    29. name: console-plugin-nvidia-gpu
    30. namespace: nvidia-gpu-operator
    31. resourceVersion: "19096623"
    32. uid: 96cdf700-dd27-437b-897d-5cbb1c255068

    Install the ConfigMap and edit the NVIDIA Operator ClusterPolicy CR to add that ConfigMap in the DCGM exporter configuration. The installation of the ConfigMap is done by the new version of the Console Plugin NVIDIA GPU Helm Chart, but the ClusterPolicy CR editing is done by the user.

  3. View the deployed resources:

    1. $ oc -n nvidia-gpu-operator get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu

    Example output

    1. NAME READY STATUS RESTARTS AGE
    2. pod/console-plugin-nvidia-gpu-7dc9cfb5df-ztksx 1/1 Running 0 2m6s
    3. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    4. service/console-plugin-nvidia-gpu ClusterIP 172.30.240.138 <none> 9443/TCP 2m6s
    5. NAME READY UP-TO-DATE AVAILABLE AGE
    6. deployment.apps/console-plugin-nvidia-gpu 1/1 1 1 2m6s
    7. NAME DESIRED CURRENT READY AGE
    8. replicaset.apps/console-plugin-nvidia-gpu-7dc9cfb5df 1 1 1 2m6s

Using the NVIDIA GPU administration dashboard

After deploying the OpenShift Console NVIDIA GPU plugin, log in to the OpenShift Container Platform web console using your login credentials to access the Administrator perspective.

To view the changes, you need to refresh the console to see the GPUs tab under Compute.

Viewing the cluster GPU overview

You can view the status of your cluster GPUs in the Overview page by selecting Overview in the Home section.

The Overview page provides information about the cluster GPUs, including:

  • Details about the GPU providers

  • Status of the GPUs

  • Cluster utilization of the GPUs

Viewing the GPUs dashboard

You can view the NVIDIA GPU administration dashboard by selecting GPUs in the Compute section of the OpenShift Console.

Charts on the GPUs dashboard include:

  • GPU utilization: Shows the ratio of time the graphics engine is active and is based on the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric.

  • Memory utilization: Shows the memory being used by the GPU and is based on the DCGM_FI_DEV_MEM_COPY_UTIL metric.

  • Encoder utilization: Shows the video encoder rate of utilization and is based on the DCGM_FI_DEV_ENC_UTIL metric.

  • Decoder utilization: Encoder utilization: Shows the video decoder rate of utilization and is based on the DCGM_FI_DEV_DEC_UTIL metric.

  • Power consumption: Shows the average power usage of the GPU in Watts and is based on the DCGM_FI_DEV_POWER_USAGE metric.

  • GPU temperature: Shows the current GPU temperature and is based on the DCGM_FI_DEV_GPU_TEMP metric. The maximum is set to 110, which is an empirical number, as the actual number is not exposed via a metric.

  • GPU clock speed: Shows the average clock speed utilized by the GPU and is based on the DCGM_FI_DEV_SM_CLOCK metric.

  • Memory clock speed: Shows the average clock speed utilized by memory and is based on the DCGM_FI_DEV_MEM_CLOCK metric.

Viewing the GPU Metrics

You can view the metrics for the GPUs by selecting the metric at the bottom of each GPU to view the Metrics page.

On the Metrics page, you can:

  • Specify a refresh rate for the metrics

  • Add, run, disable, and delete queries

  • Insert Metrics

  • Reset the zoom view