Configuring your logging deployment

You can configure your logging subsystem deployment with Custom Resource (CR) YAML files implemented by each Operator.

Red Hat Openshift Logging Operator:

  • ClusterLogging (CL) - Deploys the collector and forwarder which currently are both implemented by a daemonset running on each node.

  • ClusterLogForwarder (CLF) - Generates collector configuration to forward logs per user configuration.

Loki Operator:

  • LokiStack - Controls the Loki cluster as log store and the web proxy with OpenShift Container Platform authentication integration to enforce multi-tenancy.

OpenShift Elasticsearch Operator:

These CRs are generated and managed by the ClusterLogging Operator, manual changes cannot be made without being overwritten by the Operator.

  • ElasticSearch - Configure and deploy an Elasticsearch instance as the default log store.

  • Kibana - Configure and deploy Kibana instance to search, query and view logs.

The supported way of configuring the logging subsystem for Red Hat OpenShift is by configuring it using the options described in this documentation. Do not use other configurations, as they are unsupported. Configuration paradigms might change across OpenShift Container Platform releases, and such cases can only be handled gracefully if all configuration possibilities are controlled. If you use configurations other than those described in this documentation, your changes will disappear because the Operators reconcile any differences. The Operators reverse everything to the defined state by default and by design.

If you must perform configurations not described in the OpenShift Container Platform documentation, you must set your Red Hat OpenShift Logging Operator to Unmanaged. An unmanaged OpenShift Logging environment is not supported and does not receive updates until you return OpenShift Logging to Managed.

Enabling stream-based retention with Loki

With Logging version 5.6 and higher, you can configure retention policies based on log streams. Rules for these may be set globally, per tenant, or both. If you configure both, tenant rules apply before global rules.

  1. To enable stream-based retention, create or edit the LokiStack custom resource (CR):
  1. oc create -f <file-name>.yaml
  1. You can refer to the examples below to configure your LokiStack CR.

Example global stream-based retention

  1. apiVersion: loki.grafana.com/v1
  2. kind: LokiStack
  3. metadata:
  4. name: logging-loki
  5. namespace: openshift-logging
  6. spec:
  7. limits:
  8. global: (1)
  9. retention: (2)
  10. days: 20
  11. streams:
  12. - days: 4
  13. priority: 1
  14. selector: '{kubernetes_namespace_name=~"test.+"}' (3)
  15. - days: 1
  16. priority: 1
  17. selector: '{log_type="infrastructure"}'
  18. managementState: Managed
  19. replicationFactor: 1
  20. size: 1x.small
  21. storage:
  22. schemas:
  23. - effectiveDate: "2020-10-11"
  24. version: v11
  25. secret:
  26. name: logging-loki-s3
  27. type: aws
  28. storageClassName: standard
  29. tenants:
  30. mode: openshift-logging
1Sets retention policy for all log streams. Note: This field does not impact the retention period for stored logs in object storage.
2Retention is enabled in the cluster when this block is added to the CR.
3Contains the LogQL query used to define the log stream.

Example per-tenant stream-based retention

  1. apiVersion: loki.grafana.com/v1
  2. kind: LokiStack
  3. metadata:
  4. name: logging-loki
  5. namespace: openshift-logging
  6. spec:
  7. limits:
  8. global:
  9. retention:
  10. days: 20
  11. tenants: (1)
  12. application:
  13. retention:
  14. days: 1
  15. streams:
  16. - days: 4
  17. selector: '{kubernetes_namespace_name=~"test.+"}' (2)
  18. infrastructure:
  19. retention:
  20. days: 5
  21. streams:
  22. - days: 1
  23. selector: '{kubernetes_namespace_name=~"openshift-cluster.+"}'
  24. managementState: Managed
  25. replicationFactor: 1
  26. size: 1x.small
  27. storage:
  28. schemas:
  29. - effectiveDate: "2020-10-11"
  30. version: v11
  31. secret:
  32. name: logging-loki-s3
  33. type: aws
  34. storageClassName: standard
  35. tenants:
  36. mode: openshift-logging
1Sets retention policy by tenant. Valid tenant types are application, audit, and infrastructure.
2Contains the LogQL query used to define the log stream.
  1. Then apply your configuration:

  1. oc apply -f <file-name>.yaml

This is not for managing the retention for stored logs. Global retention periods for stored logs to a supported maximum of 30 days is configured with your object storage.

Enabling multi-line exception detection

Enables multi-line error detection of container logs.

Enabling this feature could have performance implications and may require additional computing resources or alternate logging solutions.

Log parsers often incorrectly identify separate lines of the same exception as separate exceptions. This leads to extra log entries and an incomplete or inaccurate view of the traced information.

Example java exception

  1. java.lang.NullPointerException: Cannot invoke "String.toString()" because "<param1>" is null
  2. at testjava.Main.handle(Main.java:47)
  3. at testjava.Main.printMe(Main.java:19)
  4. at testjava.Main.main(Main.java:10)
  • To enable logging to detect multi-line exceptions and reassemble them into a single log entry, ensure that the ClusterLogForwarder Custom Resource (CR) contains a detectMultilineErrors field, with a value of true.

Example ClusterLogForwarder CR

  1. apiVersion: logging.openshift.io/v1
  2. kind: ClusterLogForwarder
  3. metadata:
  4. name: instance
  5. namespace: openshift-logging
  6. spec:
  7. pipelines:
  8. - name: my-app-logs
  9. inputRefs:
  10. - application
  11. outputRefs:
  12. - default
  13. detectMultilineErrors: true

Details

When log messages appear as a consecutive sequence forming an exception stack trace, they are combined into a single, unified log record. The first log message’s content is replaced with the concatenated content of all the message fields in the sequence.

Table 1. Supported languages per collector:
LanguageFluentdVector

Java

JS

Ruby

Python

Golang

PHP

Dart

Troubleshooting

When enabled, the collector configuration will include a new section with type: detect_exceptions

Example vector configuration section

  1. [transforms.detect_exceptions_app-logs]
  2. type = "detect_exceptions"
  3. inputs = ["application"]
  4. languages = ["All"]
  5. group_by = ["kubernetes.namespace_name","kubernetes.pod_name","kubernetes.container_name"]
  6. expire_after_ms = 2000
  7. multiline_flush_interval_ms = 1000

Example fluentd config section

  1. <label @MULTILINE_APP_LOGS>
  2. <match kubernetes.**>
  3. @type detect_exceptions
  4. remove_tag_prefix 'kubernetes'
  5. message message
  6. force_line_breaks true
  7. multiline_flush_interval .2
  8. </match>
  9. </label>

Enabling log based alerts with Loki

Loki alerting rules use LogQL and follow Prometheus formatting. You can set log based alerts by creating an AlertingRule custom resource (CR). AlertingRule CRs may be created for application, audit, or infrastructure tenants.

Tenant typeValid namespaces

application

audit

openshift-logging

infrastructure

openshift-/, kube-/\, default

Application, Audit, and Infrastructure alerts are sent to the Cluster Monitoring Operator (CMO) Alertmanager in the openshift-monitoring namespace by default unless you have disabled the local Alertmanager instance.

Application alerts are not sent to the CMO Alertmanager in the openshift-user-workload-monitoring namespace by default unless you have enabled a separate Alertmanager instance.

The AlertingRule CR contains a set of specifications and webhook validation definitions to declare groups of alerting rules for a single LokiStack instance. In addition, the webhook validation definition provides support for rule validation conditions:

  • If an AlertingRule CR includes an invalid interval period, it is an invalid alerting rule

  • If an AlertingRule CR includes an invalid for period, it is an invalid alerting rule.

  • If an AlertingRule CR includes an invalid LogQL expr, it is an invalid alerting rule.

  • If an AlertingRule CR includes two groups with the same name, it is an invalid alerting rule.

  • If none of above applies, an AlertingRule is considered a valid alerting rule.

Prerequisites

  • Logging subsystem for Red Hat OpenShift Operator 5.7 and later

  • OKD 4.13 and later

Procedure

  1. Create an AlertingRule CR:

    1. oc create -f <file-name>.yaml
  2. Populate your AlertingRule CR using the appropriate example below:

    Example infrastructure AlertingRule CR

    1. apiVersion: loki.grafana.com/v1
    2. kind: AlertingRule
    3. metadata:
    4. name: loki-operator-alerts
    5. namespace: openshift-operators-redhat (1)
    6. labels: (2)
    7. openshift.io/cluster-monitoring: "true"
    8. spec:
    9. tenantID: "infrastructure" (3)
    10. groups:
    11. - name: LokiOperatorHighReconciliationError
    12. rules:
    13. - alert: HighPercentageError
    14. expr: | (4)
    15. sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"} |= "error" [1m])) by (job)
    16. /
    17. sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"}[1m])) by (job)
    18. > 0.01
    19. for: 10s
    20. labels:
    21. severity: critical (5)
    22. annotations:
    23. summary: High Loki Operator Reconciliation Errors (6)
    24. description: High Loki Operator Reconciliation Errors (7)
    1The namespace where this AlertingRule is created must have a label matching the LokiStack spec.rules.namespaceSelector definition.
    2The labels block must match the LokiStack spec.rules.selector definition.
    3AlertingRules for infrastructure tenants are only supported in the openshift-, kube-\, or default namespaces.
    4Value for kubernetes_namespace_name: must match the value for metadata.namespace.
    5Mandatory field. Must be critical, warning, or info.
    6Mandatory field.
    7Mandatory field.

    Example application AlertingRule CR

    1. apiVersion: loki.grafana.com/v1
    2. kind: AlertingRule
    3. metadata:
    4. name: app-user-workload
    5. namespace: app-ns (1)
    6. labels: (2)
    7. openshift.io/cluster-monitoring: "true"
    8. spec:
    9. tenantID: "application"
    10. groups:
    11. - name: AppUserWorkloadHighError
    12. rules:
    13. - alert:
    14. expr: | (3)
    15. sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job)
    16. for: 10s
    17. labels:
    18. severity: critical (4)
    19. annotations:
    20. summary: (5)
    21. description: (6)
    1The namespace where this AlertingRule is created must have a label matching the LokiStack spec.rules.namespaceSelector definition.
    2The labels block must match the LokiStack spec.rules.selector definition.
    3Value for kubernetes_namespace_name: must match the value for metadata.namespace.
    4Mandatory field. Must be critical, warning, or info.
    5Mandatory field. Summary of the rule.
    6Mandatory field. Detailed description of the rule.
  3. Apply the CR.

    1. oc apply -f <file-name>.yaml