Alerting rules

Alerting rules allow you to define alert conditions based on Prometheusexpression language expressions and to send notifications about firing alertsto an external service. Whenever the alert expression results in one or morevector elements at a given point in time, the alert counts as active for theseelements' label sets.

Defining alerting rules

Alerting rules are configured in Prometheus in the same way as recordingrules.

An example rules file with an alert would be:

  1. groups:
  2. - name: example
  3. rules:
  4. - alert: HighRequestLatency
  5. expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
  6. for: 10m
  7. labels:
  8. severity: page
  9. annotations:
  10. summary: High request latency

The optional for clause causes Prometheus to wait for a certain durationbetween first encountering a new expression output vector element and counting an alert as firing for this element. In this case, Prometheus will check that the alert continues to be active during each evaluation for 10 minutes before firing the alert. Elements that are active, but not firing yet, are in the pending state.

The labels clause allows specifying a set of additional labels to be attachedto the alert. Any existing conflicting labels will be overwritten. The labelvalues can be templated.

The annotations clause specifies a set of informational labels that can be used to store longer additional information such as alert descriptions or runbook links. The annotation values can be templated.

Templating

Label and annotation values can be templated using consoletemplates. The $labelsvariable holds the label key/value pairs of an alert instance. The configuredexternal labels can be accessed via the $externalLabels variable. The$value variable holds the evaluated value of an alert instance.

  1. # To insert a firing element's label values:
  2. {{ $labels.<labelname> }}
  3. # To insert the numeric expression value of the firing element:
  4. {{ $value }}

Examples:

  1. groups:
  2. - name: example
  3. rules:
  4. # Alert for any instance that is unreachable for >5 minutes.
  5. - alert: InstanceDown
  6. expr: up == 0
  7. for: 5m
  8. labels:
  9. severity: page
  10. annotations:
  11. summary: "Instance {{ $labels.instance }} down"
  12. description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
  13. # Alert for any instance that has a median request latency >1s.
  14. - alert: APIHighRequestLatency
  15. expr: api_http_request_latencies_second{quantile="0.5"} > 1
  16. for: 10m
  17. annotations:
  18. summary: "High request latency on {{ $labels.instance }}"
  19. description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

Inspecting alerts during runtime

To manually inspect which alerts are active (pending or firing), navigate tothe "Alerts" tab of your Prometheus instance. This will show you the exactlabel sets for which each defined alert is currently active.

For pending and firing alerts, Prometheus also stores synthetic time series ofthe form ALERTS{alertname="<alert name>", alertstate="pending|firing", <additional alert labels>}.The sample value is set to 1 as long as the alert is in the indicated active(pending or firing) state, and the series is marked stale when this is nolonger the case.

Sending alert notifications

Prometheus's alerting rules are good at figuring what is broken right now, butthey are not a fully-fledged notification solution. Another layer is needed toadd summarization, notification rate limiting, silencing and alert dependencieson top of the simple alert definitions. In Prometheus's ecosystem, theAlertmanager takes on thisrole. Thus, Prometheus may be configured to periodically send information aboutalert states to an Alertmanager instance, which then takes care of dispatchingthe right notifications.Prometheus can be configured to automatically discovered availableAlertmanager instances through its service discovery integrations.