Longhorn Alert Rule Examples

    We provide a couple of example Longhorn alert rules below for your references. See here for a list of all available Longhorn metrics and build your own alert rules.

    1. apiVersion: monitoring.coreos.com/v1
    2. kind: PrometheusRule
    3. metadata:
    4. labels:
    5. prometheus: longhorn
    6. role: alert-rules
    7. name: prometheus-longhorn-rules
    8. namespace: monitoring
    9. spec:
    10. groups:
    11. - name: longhorn.rules
    12. rules:
    13. - alert: LonghornVolumeActualSpaceUsedWarning
    14. annotations:
    15. description: The actual space used by Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% capacity for
    16. more than 5 minutes.
    17. summary: The actual used space of Longhorn volume is over 90% of the capacity.
    18. expr: (longhorn_volume_actual_size_bytes / longhorn_volume_capacity_bytes) * 100 > 90
    19. for: 5m
    20. labels:
    21. issue: The actual used space of Longhorn volume {{$labels.volume}} on {{$labels.node}} is high.
    22. severity: warning
    23. - alert: LonghornVolumeStatusCritical
    24. annotations:
    25. description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Fault for
    26. more than 2 minutes.
    27. summary: Longhorn volume {{$labels.volume}} is Fault
    28. expr: longhorn_volume_robustness == 3
    29. for: 5m
    30. labels:
    31. issue: Longhorn volume {{$labels.volume}} is Fault.
    32. severity: critical
    33. - alert: LonghornVolumeStatusWarning
    34. annotations:
    35. description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Degraded for
    36. more than 5 minutes.
    37. summary: Longhorn volume {{$labels.volume}} is Degraded
    38. expr: longhorn_volume_robustness == 2
    39. for: 5m
    40. labels:
    41. issue: Longhorn volume {{$labels.volume}} is Degraded.
    42. severity: warning
    43. - alert: LonghornNodeStorageWarning
    44. annotations:
    45. description: The used storage of node {{$labels.node}} is at {{$value}}% capacity for
    46. more than 5 minutes.
    47. summary: The used storage of node is over 70% of the capacity.
    48. expr: (longhorn_node_storage_usage_bytes / longhorn_node_storage_capacity_bytes) * 100 > 70
    49. for: 5m
    50. labels:
    51. issue: The used storage of node {{$labels.node}} is high.
    52. severity: warning
    53. - alert: LonghornDiskStorageWarning
    54. annotations:
    55. description: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is at {{$value}}% capacity for
    56. more than 5 minutes.
    57. summary: The used storage of disk is over 70% of the capacity.
    58. expr: (longhorn_disk_usage_bytes / longhorn_disk_capacity_bytes) * 100 > 70
    59. for: 5m
    60. labels:
    61. issue: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is high.
    62. severity: warning
    63. - alert: LonghornNodeDown
    64. annotations:
    65. description: There are {{$value}} Longhorn nodes which have been offline for more than 5 minutes.
    66. summary: Longhorn nodes is offline
    67. expr: (avg(longhorn_node_count_total) or on() vector(0)) - (count(longhorn_node_status{condition="ready"} == 1) or on() vector(0)) > 0
    68. for: 5m
    69. labels:
    70. issue: There are {{$value}} Longhorn nodes are offline
    71. severity: critical
    72. - alert: LonghornIntanceManagerCPUUsageWarning
    73. annotations:
    74. description: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is {{$value}}% for
    75. more than 5 minutes.
    76. summary: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is over 300%.
    77. expr: (longhorn_instance_manager_cpu_usage_millicpu/longhorn_instance_manager_cpu_requests_millicpu) * 100 > 300
    78. for: 5m
    79. labels:
    80. issue: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} consumes 3 times the CPU request.
    81. severity: warning
    82. - alert: LonghornNodeCPUUsageWarning
    83. annotations:
    84. description: Longhorn node {{$labels.node}} has CPU Usage / CPU capacity is {{$value}}% for
    85. more than 5 minutes.
    86. summary: Longhorn node {{$labels.node}} experiences high CPU pressure for more than 5m.
    87. expr: (longhorn_node_cpu_usage_millicpu / longhorn_node_cpu_capacity_millicpu) * 100 > 90
    88. for: 5m
    89. labels:
    90. issue: Longhorn node {{$labels.node}} experiences high CPU pressure.
    91. severity: warning

    See more about how to define alert rules at here.