Custom logging alerts

In logging 5.7 and later versions, users can configure the LokiStack deployment to produce customized alerts and recorded metrics. If you want to use customized alerting and recording rules, you must enable the LokiStack ruler component.

LokiStack log-based alerts and recorded metrics are triggered by providing LogQL expressions to the ruler component. The Loki Operator manages a ruler that is optimized for the selected LokiStack size, which can be 1x.extra-small, 1x.small, or 1x.medium.

The 1x.extra-small size is not supported. It is for demonstration purposes only.

To provide these expressions, you must create an AlertingRule custom resource (CR) containing Prometheus-compatible alerting rules, or a RecordingRule CR containing Prometheus-compatible recording rules.

Administrators can configure log-based alerts or recorded metrics for application, audit, or infrastructure tenants. Users without administrator permissions can configure log-based alerts or recorded metrics for application tenants of the applications that they have access to.

Application, audit, and infrastructure alerts are sent by default to the OKD monitoring stack Alertmanager in the openshift-monitoring namespace, unless you have disabled the local Alertmanager instance. If the Alertmanager that is used to monitor user-defined projects in the openshift-user-workload-monitoring namespace is enabled, application alerts are sent to the Alertmanager in this namespace by default.

Configuring the ruler

When the LokiStack ruler component is enabled, users can define a group of LogQL expressions that trigger logging alerts or recorded metrics.

Administrators can enable the ruler by modifying the LokiStack custom resource (CR).

Procedure

  • Enable the ruler by ensuring that the LokiStack CR contains the following spec configuration:

    1. apiVersion: loki.grafana.com/v1
    2. kind: LokiStack
    3. metadata:
    4. name: <name>
    5. namespace: <namespace>
    6. spec:
    7. # ...
    8. rules:
    9. enabled: true (1)
    10. selector:
    11. matchLabels:
    12. openshift.io/<label_name>: "true" (2)
    13. namespaceSelector:
    14. matchLabels:
    15. openshift.io/<label_name>: "true" (3)
    1Enable Loki alerting and recording rules in your cluster.
    2Add a custom label that can be added to namespaces where you want to enable the use of logging alerts and metrics.
    3Add a custom label that can be added to namespaces where you want to enable the use of logging alerts and metrics.

Authorizing Loki rules RBAC permissions

Administrators can allow users to create and manage their own alerting rules by creating a ClusterRole object and binding this role to usernames. The ClusterRole object defines the necessary role-based access control (RBAC) permissions for users.

Prerequisites

  • The Cluster Logging Operator is installed in the openshift-logging namespace.

  • You have administrator permissions.

Procedure

  1. Create a cluster role that defines the necessary RBAC permissions.

  2. Bind the appropriate cluster roles to the username:

    Example binding command

    1. $ oc adm policy add-role-to-user <cluster_role_name> -n <namespace> <username>

Creating a log-based alerting rule with Loki

The AlertingRule CR contains a set of specifications and webhook validation definitions to declare groups of alerting rules for a single LokiStack instance. In addition, the webhook validation definition provides support for rule validation conditions:

  • If an AlertingRule CR includes an invalid interval period, it is an invalid alerting rule

  • If an AlertingRule CR includes an invalid for period, it is an invalid alerting rule.

  • If an AlertingRule CR includes an invalid LogQL expr, it is an invalid alerting rule.

  • If an AlertingRule CR includes two groups with the same name, it is an invalid alerting rule.

  • If none of above applies, an alerting rule is considered valid.

Tenant typeValid namespaces for AlertingRule CRs

application

audit

openshift-logging

infrastructure

openshift-/, kube-/\, default

Prerequisites

  • Logging subsystem for Red Hat OpenShift Operator 5.7 and later

  • OKD 4.13 and later

Procedure

  1. Create an AlertingRule custom resource (CR):

    Example infrastructure AlertingRule CR

    1. apiVersion: loki.grafana.com/v1
    2. kind: AlertingRule
    3. metadata:
    4. name: loki-operator-alerts
    5. namespace: openshift-operators-redhat (1)
    6. labels: (2)
    7. openshift.io/<label_name>: "true"
    8. spec:
    9. tenantID: "infrastructure" (3)
    10. groups:
    11. - name: LokiOperatorHighReconciliationError
    12. rules:
    13. - alert: HighPercentageError
    14. expr: | (4)
    15. sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"} |= "error" [1m])) by (job)
    16. /
    17. sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"}[1m])) by (job)
    18. > 0.01
    19. for: 10s
    20. labels:
    21. severity: critical (5)
    22. annotations:
    23. summary: High Loki Operator Reconciliation Errors (6)
    24. description: High Loki Operator Reconciliation Errors (7)
    1The namespace where this AlertingRule CR is created must have a label matching the LokiStack spec.rules.namespaceSelector definition.
    2The labels block must match the LokiStack spec.rules.selector definition.
    3AlertingRule CRs for infrastructure tenants are only supported in the openshift-, kube-\, or default namespaces.
    4The value for kubernetes_namespace_name: must match the value for metadata.namespace.
    5The value of this mandatory field must be critical, warning, or info.
    6This field is mandatory.
    7This field is mandatory.

    Example application AlertingRule CR

    1. apiVersion: loki.grafana.com/v1
    2. kind: AlertingRule
    3. metadata:
    4. name: app-user-workload
    5. namespace: app-ns (1)
    6. labels: (2)
    7. openshift.io/<label_name>: "true"
    8. spec:
    9. tenantID: "application"
    10. groups:
    11. - name: AppUserWorkloadHighError
    12. rules:
    13. - alert:
    14. expr: | (3)
    15. sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job)
    16. for: 10s
    17. labels:
    18. severity: critical (4)
    19. annotations:
    20. summary: (5)
    21. description: (6)
    1The namespace where this AlertingRule CR is created must have a label matching the LokiStack spec.rules.namespaceSelector definition.
    2The labels block must match the LokiStack spec.rules.selector definition.
    3Value for kubernetes_namespace_name: must match the value for metadata.namespace.
    4The value of this mandatory field must be critical, warning, or info.
    5The value of this mandatory field is a summary of the rule.
    6The value of this mandatory field is a detailed description of the rule.
  2. Apply the AlertingRule CR:

    1. $ oc apply -f <filename>.yaml

Additional resources