Recording rules

A consistent naming scheme for recording rulesmakes it easier to interpret the meaning of a rule at a glance. It also avoidsmistakes by making incorrect or meaningless calculations stand out.

This page documents how to correctly do aggregation and suggests a namingconvention.

Naming and aggregation

Recording rules should be of the general form level:metric:operations.level represents the aggregation level and labels of the rule output.metric is the metric name and should be unchanged other than stripping_total off counters when using rate() or irate(). operations is a listof operations that were applied to the metric, newest operation first.

Keeping the metric name unchanged makes it easy to know what a metric is andeasy to find in the codebase.

To keep the operations clean, _sum is omitted if there are other operations,as sum(). Associative operations can be merged (for example min_min is thesame as min).

If there is no obvious operation to use, use sum. When taking a ratio bydoing division, separate the metrics using per and call the operationratio.

When aggregating up ratios, aggregate up the numerator and denominatorseparately and then divide. Do not take the average of a ratio or average of anaverage as that is not statistically valid.

When aggregating up the _count and _sum of a Summary and dividing tocalculate average observation size, treating it as a ratio would be unwieldy.Instead keep the metric name without the _count or _sum suffix and replacethe rate in the operation with mean. This represents the averageobservation size over that time period.

Always specify a without clause with the labels you are aggregating away.This is to preserve all the other labels such as job, which will avoidconflicts and give you more useful metrics and alerts.

Examples

Note the indentation style with outdented operators on their own line betweentwo vectors. To make this style possible in Yaml, block quotes with anindentation indicator(e.g. |2) are used.

Aggregating up requests per second that has a path label:

  1. - record: instance_path:requests:rate5m
  2. expr: rate(requests_total{job="myjob"}[5m])
  3. - record: path:requests:rate5m
  4. expr: sum without (instance)(instance_path:requests:rate5m{job="myjob"})

Calculating a request failure ratio and aggregating up to the job-level failure ratio:

  1. - record: instance_path:request_failures:rate5m
  2. expr: rate(request_failures_total{job="myjob"}[5m])
  3. - record: instance_path:request_failures_per_requests:ratio_rate5m
  4. expr: |2
  5. instance_path:request_failures:rate5m{job="myjob"}
  6. /
  7. instance_path:requests:rate5m{job="myjob"}
  8. # Aggregate up numerator and denominator, then divide to get path-level ratio.
  9. - record: path:request_failures_per_requests:ratio_rate5m
  10. expr: |2
  11. sum without (instance)(instance_path:request_failures:rate5m{job="myjob"})
  12. /
  13. sum without (instance)(instance_path:requests:rate5m{job="myjob"})
  14. # No labels left from instrumentation or distinguishing instances,
  15. # so we use 'job' as the level.
  16. - record: job:request_failures_per_requests:ratio_rate5m
  17. expr: |2
  18. sum without (instance, path)(instance_path:request_failures:rate5m{job="myjob"})
  19. /
  20. sum without (instance, path)(instance_path:requests:rate5m{job="myjob"})

Calculating average latency over a time period from a Summary:

  1. - record: instance_path:request_latency_seconds_count:rate5m
  2. expr: rate(request_latency_seconds_count{job="myjob"}[5m])
  3. - record: instance_path:request_latency_seconds_sum:rate5m
  4. expr: rate(request_latency_seconds_sum{job="myjob"}[5m])
  5. - record: instance_path:request_latency_seconds:mean5m
  6. expr: |2
  7. instance_path:request_latency_seconds_sum:rate5m{job="myjob"}
  8. /
  9. instance_path:request_latency_seconds_count:rate5m{job="myjob"}
  10. # Aggregate up numerator and denominator, then divide.
  11. - record: path:request_latency_seconds:mean5m
  12. expr: |2
  13. sum without (instance)(instance_path:request_latency_seconds_sum:rate5m{job="myjob"})
  14. /
  15. sum without (instance)(instance_path:request_latency_seconds_count:rate5m{job="myjob"})

Calculating the average query rate across instances and paths is done using theavg() function:

  1. - record: job:request_latency_seconds_count:avg_rate5m
  2. expr: avg without (instance, path)(instance:request_latency_seconds_count:rate5m{job="myjob"})

Notice that when aggregating that the labels in the without clause are removedfrom the level of the output metric name compared to the input metric names.When there is no aggregation, the levels always match. If this is not the casea mistake has likely been made in the rules.