Setting up Prometheus and Grafana to monitor Longhorn

This document is a quick guide to setting up the monitor for Longhorn.

Longhorn natively exposes metrics in Prometheus text format on a REST endpoint http://LONGHORN_MANAGER_IP:PORT/metrics.

You can use any collecting tools such as Prometheus, Graphite, Telegraf to scrape these metrics then visualize the collected data by tools such as Grafana.

See Longhorn Metrics for Monitoring for available metrics.

High-level Overview

The monitoring system uses Prometheus for collecting data and alerting, and Grafana for visualizing/dashboarding the collected data.

  • Prometheus server which scrapes and stores time-series data from Longhorn metrics endpoints. The Prometheus is also responsible for generating alerts based on configured rules and collected data. Prometheus servers then send alerts to an Alertmanager.
  • AlertManager then manages those alerts, including silencing, inhibition, aggregation, and sending out notifications via methods such as email, on-call notification systems, and chat platforms.
  • Grafana which queries Prometheus server for data and draws a dashboard for visualization.

The below picture describes the detailed architecture of the monitoring system.

images

There are 2 unmentioned components in the above picture:

  • Longhorn Backend service is a service pointing to the set of Longhorn manager pods. Longhorn’s metrics are exposed in Longhorn manager pods at the endpoint http://LONGHORN_MANAGER_IP:PORT/metrics.
  • Prometheus operator makes running Prometheus on top of Kubernetes very easy. The operator watches 3 custom resources: ServiceMonitor, Prometheus ,and AlertManager. When you create those custom resources, Prometheus Operator deploys and manages the Prometheus server, AlertManager with the user-specified configurations.

Installation

This document uses the default namespace for the monitoring system. To install on a different namespace, change the field namespace: <OTHER_NAMESPACE> in manifests.

Install Prometheus Operator

Follow instructions in Prometheus Operator - Quickstart.

NOTE: You may need to choose a release that is compatible with the Kubernetes version of the cluster.

Install Longhorn ServiceMonitor

  1. Create a ServiceMonitor for Longhorn manager.

    1. apiVersion: monitoring.coreos.com/v1
    2. kind: ServiceMonitor
    3. metadata:
    4. name: longhorn-prometheus-servicemonitor
    5. namespace: default
    6. labels:
    7. name: longhorn-prometheus-servicemonitor
    8. spec:
    9. selector:
    10. matchLabels:
    11. app: longhorn-manager
    12. namespaceSelector:
    13. matchNames:
    14. - longhorn-system
    15. endpoints:
    16. - port: manager

    Longhorn ServiceMonitor has a label selector app: longhorn-manager for selecting Longhorn backend service.

    Longhorn ServiceMonitor is included in the Prometheus custom resource so that the Prometheus server can discover all Longhorn manager pods and their endpoints.

Install and configure Prometheus AlertManager

  1. Create a highly available Alertmanager deployment with 3 instances.

    1. apiVersion: monitoring.coreos.com/v1
    2. kind: Alertmanager
    3. metadata:
    4. name: longhorn
    5. namespace: default
    6. spec:
    7. replicas: 3
  2. The Alertmanager instances will not start unless a valid configuration is given. See Prometheus - Configuration for more explanation.

    1. global:
    2. resolve_timeout: 5m
    3. route:
    4. group_by: [alertname]
    5. receiver: email_and_slack
    6. receivers:
    7. - name: email_and_slack
    8. email_configs:
    9. - to: <the email address to send notifications to>
    10. from: <the sender address>
    11. smarthost: <the SMTP host through which emails are sent>
    12. # SMTP authentication information.
    13. auth_username: <the username>
    14. auth_identity: <the identity>
    15. auth_password: <the password>
    16. headers:
    17. subject: 'Longhorn-Alert'
    18. text: |-
    19. {{ range .Alerts }}
    20. *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
    21. *Description:* {{ .Annotations.description }}
    22. *Details:*
    23. {{ range .Labels.SortedPairs }} *{{ .Name }}:* `{{ .Value }}`
    24. {{ end }}
    25. {{ end }}
    26. slack_configs:
    27. - api_url: <the Slack webhook URL>
    28. channel: <the channel or user to send notifications to>
    29. text: |-
    30. {{ range .Alerts }}
    31. *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
    32. *Description:* {{ .Annotations.description }}
    33. *Details:*
    34. {{ range .Labels.SortedPairs }} *{{ .Name }}:* `{{ .Value }}`
    35. {{ end }}
    36. {{ end }}

    Save the above Alertmanager config in a file called alertmanager.yaml and create a secret from it using kubectl.

    Alertmanager instances require the secret resource naming to follow the format alertmanager-<ALERTMANAGER_NAME>. In the previous step, the name of the Alertmanager is longhorn, so the secret name must be alertmanager-longhorn

    1. $ kubectl create secret generic alertmanager-longhorn --from-file=alertmanager.yaml -n default
  3. To be able to view the web UI of the Alertmanager, expose it through a Service. A simple way to do this is to use a Service of type NodePort.

    1. apiVersion: v1
    2. kind: Service
    3. metadata:
    4. name: alertmanager-longhorn
    5. namespace: default
    6. spec:
    7. type: NodePort
    8. ports:
    9. - name: web
    10. nodePort: 30903
    11. port: 9093
    12. protocol: TCP
    13. targetPort: web
    14. selector:
    15. alertmanager: longhorn

    After creating the above service, you can access the web UI of Alertmanager via a Node’s IP and the port 30903.

    Use the above NodePort service for quick verification only because it doesn’t communicate over the TLS connection. You may want to change the service type to ClusterIP and set up an Ingress-controller to expose the web UI of Alertmanager over a TLS connection.

Install and configure Prometheus server

  1. Create PrometheusRule custom resource to define alert conditions. See more examples about Longhorn alert rules at Longhorn Alert Rule Examples.

    1. apiVersion: monitoring.coreos.com/v1
    2. kind: PrometheusRule
    3. metadata:
    4. labels:
    5. prometheus: longhorn
    6. role: alert-rules
    7. name: prometheus-longhorn-rules
    8. namespace: default
    9. spec:
    10. groups:
    11. - name: longhorn.rules
    12. rules:
    13. - alert: LonghornVolumeUsageCritical
    14. annotations:
    15. description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% used for
    16. more than 5 minutes.
    17. summary: Longhorn volume capacity is over 90% used.
    18. expr: 100 * (longhorn_volume_usage_bytes / longhorn_volume_capacity_bytes) > 90
    19. for: 5m
    20. labels:
    21. issue: Longhorn volume {{$labels.volume}} usage on {{$labels.node}} is critical.
    22. severity: critical

    See Prometheus - Alerting rules for more information.

  2. If RBAC authorization is activated, Create a ClusterRole and ClusterRoleBinding for the Prometheus Pods.

    1. apiVersion: v1
    2. kind: ServiceAccount
    3. metadata:
    4. name: prometheus
    5. namespace: default
    1. apiVersion: rbac.authorization.k8s.io/v1
    2. kind: ClusterRole
    3. metadata:
    4. name: prometheus
    5. namespace: default
    6. rules:
    7. - apiGroups: [""]
    8. resources:
    9. - nodes
    10. - services
    11. - endpoints
    12. - pods
    13. verbs: ["get", "list", "watch"]
    14. - apiGroups: [""]
    15. resources:
    16. - configmaps
    17. verbs: ["get"]
    18. - nonResourceURLs: ["/metrics"]
    19. verbs: ["get"]
    1. apiVersion: rbac.authorization.k8s.io/v1
    2. kind: ClusterRoleBinding
    3. metadata:
    4. name: prometheus
    5. roleRef:
    6. apiGroup: rbac.authorization.k8s.io
    7. kind: ClusterRole
    8. name: prometheus
    9. subjects:
    10. - kind: ServiceAccount
    11. name: prometheus
    12. namespace: default
  3. Create a Prometheus custom resource. Notice that we select the Longhorn service monitor and Longhorn rules in the spec.

    1. apiVersion: monitoring.coreos.com/v1
    2. kind: Prometheus
    3. metadata:
    4. name: longhorn
    5. namespace: default
    6. spec:
    7. replicas: 2
    8. serviceAccountName: prometheus
    9. alerting:
    10. alertmanagers:
    11. - namespace: default
    12. name: alertmanager-longhorn
    13. port: web
    14. serviceMonitorSelector:
    15. matchLabels:
    16. name: longhorn-prometheus-servicemonitor
    17. ruleSelector:
    18. matchLabels:
    19. prometheus: longhorn
    20. role: alert-rules
  4. To be able to view the web UI of the Prometheus server, expose it through a Service. A simple way to do this is to use a Service of type NodePort.

    1. apiVersion: v1
    2. kind: Service
    3. metadata:
    4. name: prometheus-longhorn
    5. namespace: default
    6. spec:
    7. type: NodePort
    8. ports:
    9. - name: web
    10. nodePort: 30904
    11. port: 9090
    12. protocol: TCP
    13. targetPort: web
    14. selector:
    15. prometheus: longhorn

    After creating the above service, you can access the web UI of the Prometheus server via a Node’s IP and the port 30904.

    At this point, you should be able to see all Longhorn manager targets as well as Longhorn rules in the targets and rules section of the Prometheus server UI.

    Use the above NodePort service for quick verification only because it doesn’t communicate over the TLS connection. You may want to change the service type to ClusterIP and set up an Ingress controller to expose the web UI of the Prometheus server over a TLS connection.

Setup Grafana

  1. Create Grafana datasource ConfigMap.

    1. apiVersion: v1
    2. kind: ConfigMap
    3. metadata:
    4. name: grafana-datasources
    5. namespace: default
    6. data:
    7. prometheus.yaml: |-
    8. {
    9. "apiVersion": 1,
    10. "datasources": [
    11. {
    12. "access":"proxy",
    13. "editable": true,
    14. "name": "prometheus-longhorn",
    15. "orgId": 1,
    16. "type": "prometheus",
    17. "url": "http://prometheus-longhorn.default.svc:9090",
    18. "version": 1
    19. }
    20. ]
    21. }

    NOTE: change field url if you are installing the monitoring stack in a different namespace. http://prometheus-longhorn.<NAMESPACE>.svc:9090"

  2. Create Grafana Deployment.

    1. apiVersion: apps/v1
    2. kind: Deployment
    3. metadata:
    4. name: grafana
    5. namespace: default
    6. labels:
    7. app: grafana
    8. spec:
    9. replicas: 1
    10. selector:
    11. matchLabels:
    12. app: grafana
    13. template:
    14. metadata:
    15. name: grafana
    16. labels:
    17. app: grafana
    18. spec:
    19. containers:
    20. - name: grafana
    21. image: grafana/grafana:7.1.5
    22. ports:
    23. - name: grafana
    24. containerPort: 3000
    25. resources:
    26. limits:
    27. memory: "500Mi"
    28. cpu: "300m"
    29. requests:
    30. memory: "500Mi"
    31. cpu: "200m"
    32. volumeMounts:
    33. - mountPath: /var/lib/grafana
    34. name: grafana-storage
    35. - mountPath: /etc/grafana/provisioning/datasources
    36. name: grafana-datasources
    37. readOnly: false
    38. volumes:
    39. - name: grafana-storage
    40. emptyDir: {}
    41. - name: grafana-datasources
    42. configMap:
    43. defaultMode: 420
    44. name: grafana-datasources
  3. Create Grafana Service.

    1. apiVersion: v1
    2. kind: Service
    3. metadata:
    4. name: grafana
    5. namespace: default
    6. spec:
    7. selector:
    8. app: grafana
    9. type: ClusterIP
    10. ports:
    11. - port: 3000
    12. targetPort: 3000
  4. Expose Grafana on NodePort 32000.

    1. kubectl -n default patch svc grafana --type='json' -p '[{"op":"replace","path":"/spec/type","value":"NodePort"},{"op":"replace","path":"/spec/ports/0/nodePort","value":32000}]'

    Use the above NodePort service for quick verification only because it doesn’t communicate over the TLS connection. You may want to change the service type to ClusterIP and set up an Ingress controller to expose Grafana over a TLS connection.

  5. Access the Grafana dashboard using any node IP on port 32000.

    1. # Default Credential
    2. User: admin
    3. Pass: admin
  6. Setup Longhorn dashboard.

    Once inside Grafana, import the prebuilt Longhorn example dashboard.

    See Grafana Lab - Export and import for instructions on how to import a Grafana dashboard.

    You should see the following dashboard at successful setup: images