Longhorn Alert Rule Examples
We provide a couple of example Longhorn alert rules below for your references. See here for a list of all available Longhorn metrics and build your own alert rules.
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata:labels:prometheus: longhornrole: alert-rulesname: prometheus-longhorn-rulesnamespace: monitoringspec:groups:- name: longhorn.rulesrules:- alert: LonghornVolumeActualSpaceUsedWarningannotations:description: The actual space used by Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% capacity formore than 5 minutes.summary: The actual used space of Longhorn volume is over 90% of the capacity.expr: (longhorn_volume_actual_size_bytes / longhorn_volume_capacity_bytes) * 100 > 90for: 5mlabels:issue: The actual used space of Longhorn volume {{$labels.volume}} on {{$labels.node}} is high.severity: warning- alert: LonghornVolumeStatusCriticalannotations:description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Fault formore than 2 minutes.summary: Longhorn volume {{$labels.volume}} is Faultexpr: longhorn_volume_robustness == 3for: 5mlabels:issue: Longhorn volume {{$labels.volume}} is Fault.severity: critical- alert: LonghornVolumeStatusWarningannotations:description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Degraded formore than 5 minutes.summary: Longhorn volume {{$labels.volume}} is Degradedexpr: longhorn_volume_robustness == 2for: 5mlabels:issue: Longhorn volume {{$labels.volume}} is Degraded.severity: warning- alert: LonghornNodeStorageWarningannotations:description: The used storage of node {{$labels.node}} is at {{$value}}% capacity formore than 5 minutes.summary: The used storage of node is over 70% of the capacity.expr: (longhorn_node_storage_usage_bytes / longhorn_node_storage_capacity_bytes) * 100 > 70for: 5mlabels:issue: The used storage of node {{$labels.node}} is high.severity: warning- alert: LonghornDiskStorageWarningannotations:description: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is at {{$value}}% capacity formore than 5 minutes.summary: The used storage of disk is over 70% of the capacity.expr: (longhorn_disk_usage_bytes / longhorn_disk_capacity_bytes) * 100 > 70for: 5mlabels:issue: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is high.severity: warning- alert: LonghornNodeDownannotations:description: There are {{$value}} Longhorn nodes which have been offline for more than 5 minutes.summary: Longhorn nodes is offlineexpr: (avg(longhorn_node_count_total) or on() vector(0)) - (count(longhorn_node_status{condition="ready"} == 1) or on() vector(0)) > 0for: 5mlabels:issue: There are {{$value}} Longhorn nodes are offlineseverity: critical- alert: LonghornIntanceManagerCPUUsageWarningannotations:description: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is {{$value}}% formore than 5 minutes.summary: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is over 300%.expr: (longhorn_instance_manager_cpu_usage_millicpu/longhorn_instance_manager_cpu_requests_millicpu) * 100 > 300for: 5mlabels:issue: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} consumes 3 times the CPU request.severity: warning- alert: LonghornNodeCPUUsageWarningannotations:description: Longhorn node {{$labels.node}} has CPU Usage / CPU capacity is {{$value}}% formore than 5 minutes.summary: Longhorn node {{$labels.node}} experiences high CPU pressure for more than 5m.expr: (longhorn_node_cpu_usage_millicpu / longhorn_node_cpu_capacity_millicpu) * 100 > 90for: 5mlabels:issue: Longhorn node {{$labels.node}} experiences high CPU pressure.severity: warning
See more about how to define alert rules at here.