Logging using LokiStack

In logging subsystem documentation, LokiStack refers to the logging subsystem supported combination of Loki and web proxy with OKD authentication integration. LokiStack’s proxy uses OKD authentication to enforce multi-tenancy. Loki refers to the log store as either the individual component or an external store.

Loki is a horizontally scalable, highly available, multi-tenant log aggregation system currently offered as an alternative to Elasticsearch as a log store for the logging subsystem. Elasticsearch indexes incoming log records completely during ingestion. Loki only indexes a few fixed labels during ingestion and defers more complex parsing until after the logs have been stored. This means Loki can collect logs more quickly. You can query Loki by using the LogQL log query language.

Deployment Sizing

Sizing for Loki follows the format of N<x>._<size>_ where the value <N> is number of instances and <size> specifies performance capabilities.

1x.extra-small is for demo purposes only, and is not supported.

Table 1. Loki Sizing
1x.extra-small1x.small1x.medium

Data transfer

Demo use only.

500GB/day

2TB/day

Queries per second (QPS)

Demo use only.

25-50 QPS at 200ms

25-75 QPS at 200ms

Replication factor

None

2

3

Total CPU requests

5 vCPUs

36 vCPUs

54 vCPUs

Total Memory requests

7.5Gi

63Gi

139Gi

Total Disk requests

150Gi

300Gi

450Gi

Supported API Custom Resource Definitions

LokiStack development is ongoing, not all APIs are supported currently supported.

CustomResourceDefinition (CRD)ApiVersionSupport state

LokiStack

lokistack.loki.grafana.com/v1

Supported in 5.5

RulerConfig

rulerconfig.loki.grafana/v1beta1

Technology Preview

AlertingRule

alertingrule.loki.grafana/v1beta1

Technology Preview

RecordingRule

recordingrule.loki.grafana/v1beta1

Technology Preview

Usage of RulerConfig, AlertingRule and RecordingRule custom resource definitions (CRDs). is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Deploying the LokiStack

You can use the OKD web console to deploy the LokiStack.

Prerequisites

  • Logging subsystem for Red Hat OpenShift Operator 5.5 and later

  • Supported Log Store (AWS S3, Google Cloud Storage, Azure, Swift, Minio, OpenShift Data Foundation)

Procedure

  1. Install the Loki Operator Operator:

    1. In the OKD web console, click OperatorsOperatorHub.

    2. Choose Loki Operator from the list of available Operators, and click Install.

    3. Under Installation Mode, select All namespaces on the cluster.

    4. Under Installed Namespace, select openshift-operators-redhat.

      You must specify the openshift-operators-redhat namespace. The openshift-operators namespace might contain Community Operators, which are untrusted and might publish a metric with the same name as an OKD metric, which would cause conflicts.

    5. Select Enable operator recommended cluster monitoring on this namespace.

      This option sets the openshift.io/cluster-monitoring: "true" label in the Namespace object. You must select this option to ensure that cluster monitoring scrapes the openshift-operators-redhat namespace.

    6. Select an Approval Strategy.

      • The Automatic strategy allows Operator Lifecycle Manager (OLM) to automatically update the Operator when a new version is available.

      • The Manual strategy requires a user with appropriate credentials to approve the Operator update.

    7. Click Install.

    8. Verify that you installed the Loki Operator. Visit the OperatorsInstalled Operators page and look for Loki Operator.

    9. Ensure that Loki Operator is listed with Status as Succeeded in all the projects.

  2. Create a Secret YAML file that uses the access_key_id and access_key_secret fields to specify your AWS credentials and bucketnames, endpoint and region to define the object storage location. For example:

    1. apiVersion: v1
    2. kind: Secret
    3. metadata:
    4. name: logging-loki-s3
    5. namespace: openshift-logging
    6. stringData:
    7. access_key_id: AKIAIOSFODNN7EXAMPLE
    8. access_key_secret: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    9. bucketnames: s3-bucket-name
    10. endpoint: https://s3.eu-central-1.amazonaws.com
    11. region: eu-central-1
  3. Create the LokiStack custom resource (CR):

    1. apiVersion: loki.grafana.com/v1
    2. kind: LokiStack
    3. metadata:
    4. name: logging-loki
    5. namespace: openshift-logging
    6. spec:
    7. size: 1x.small
    8. storage:
    9. schemas:
    10. - version: v12
    11. effectiveDate: "2022-06-01"
    12. secret:
    13. name: logging-loki-s3
    14. type: s3
    15. storageClassName: gp3-csi (1)
    16. tenants:
    17. mode: openshift-logging
    1Or gp2-csi.
  4. Apply the LokiStack CR:

    1. $ oc apply -f logging-loki.yaml
  5. Create a ClusterLogging custom resource (CR):

    1. apiVersion: logging.openshift.io/v1
    2. kind: ClusterLogging
    3. metadata:
    4. name: instance
    5. namespace: openshift-logging
    6. spec:
    7. managementState: Managed
    8. logStore:
    9. type: lokistack
    10. lokistack:
    11. name: logging-loki
    12. collection:
    13. type: vector
  6. Apply the ClusterLogging CR:

    1. $ oc apply -f cr-lokistack.yaml
  7. Enable the RedHat OpenShift Logging Console Plugin:

    1. In the OKD web console, click OperatorsInstalled Operators.

    2. Select the RedHat OpenShift Logging Operator.

    3. Under Console plugin, click Disabled.

    4. Select Enable and then Save. This change restarts the openshift-console pods.

    5. After the pods restart, you will receive a notification that a web console update is available, prompting you to refresh.

    6. After refreshing the web console, click Observe from the left main menu. A new option for Logs is available.

Enabling stream-based retention with Loki

With Logging version 5.6 and higher, you can configure retention policies based on log streams. Rules for these may be set globally, per tenant, or both. If you configure both, tenant rules apply before global rules.

  1. To enable stream-based retention, create a LokiStack custom resource (CR):

    Example global stream-based retention

    1. apiVersion: loki.grafana.com/v1
    2. kind: LokiStack
    3. metadata:
    4. name: logging-loki
    5. namespace: openshift-logging
    6. spec:
    7. limits:
    8. global: (1)
    9. retention: (2)
    10. days: 20
    11. streams:
    12. - days: 4
    13. priority: 1
    14. selector: '{kubernetes_namespace_name=~"test.+"}' (3)
    15. - days: 1
    16. priority: 1
    17. selector: '{log_type="infrastructure"}'
    18. managementState: Managed
    19. replicationFactor: 1
    20. size: 1x.small
    21. storage:
    22. schemas:
    23. - effectiveDate: "2020-10-11"
    24. version: v11
    25. secret:
    26. name: logging-loki-s3
    27. type: aws
    28. storageClassName: standard
    29. tenants:
    30. mode: openshift-logging
    1Sets retention policy for all log streams. Note: This field does not impact the retention period for stored logs in object storage.
    2Retention is enabled in the cluster when this block is added to the CR.
    3Contains the LogQL query used to define the log stream.

    Example per-tenant stream-based retention

    1. apiVersion: loki.grafana.com/v1
    2. kind: LokiStack
    3. metadata:
    4. name: logging-loki
    5. namespace: openshift-logging
    6. spec:
    7. limits:
    8. global:
    9. retention:
    10. days: 20
    11. tenants: (1)
    12. application:
    13. retention:
    14. days: 1
    15. streams:
    16. - days: 4
    17. selector: '{kubernetes_namespace_name=~"test.+"}' (2)
    18. infrastructure:
    19. retention:
    20. days: 5
    21. streams:
    22. - days: 1
    23. selector: '{kubernetes_namespace_name=~"openshift-cluster.+"}'
    24. managementState: Managed
    25. replicationFactor: 1
    26. size: 1x.small
    27. storage:
    28. schemas:
    29. - effectiveDate: "2020-10-11"
    30. version: v11
    31. secret:
    32. name: logging-loki-s3
    33. type: aws
    34. storageClassName: standard
    35. tenants:
    36. mode: openshift-logging
    1Sets retention policy by tenant. Valid tenant types are application, audit, and infrastructure.
    2Contains the LogQL query used to define the log stream.
  2. Apply the LokiStack CR:

    1. $ oc apply -f <filename>.yaml

This is not for managing the retention for stored logs. Global retention periods for stored logs to a supported maximum of 30 days is configured with your object storage.

Forwarding logs to LokiStack

To configure log forwarding to the LokiStack gateway, you must create a ClusterLogging custom resource (CR).

Prerequisites

  • The Logging subsystem for Red Hat OpenShift version 5.5 or newer is installed on your cluster.

  • The Loki Operator is installed on your cluster.

Procedure

  • Create a ClusterLogging custom resource (CR):

    1. apiVersion: logging.openshift.io/v1
    2. kind: ClusterLogging
    3. metadata:
    4. name: instance
    5. namespace: openshift-logging
    6. spec:
    7. managementState: Managed
    8. logStore:
    9. type: lokistack
    10. lokistack:
    11. name: logging-loki
    12. collection:
    13. type: vector

Troubleshooting Loki rate limit errors

If the Log Forwarder API forwards a large block of messages that exceeds the rate limit to Loki, Loki generates rate limit (429) errors.

These errors can occur during normal operation. For example, when adding the logging subsystem to a cluster that already has some logs, rate limit errors might occur while the logging subsystem tries to ingest all of the existing log entries. In this case, if the rate of addition of new logs is less than the total rate limit, the historical data is eventually ingested, and the rate limit errors are resolved without requiring user intervention.

In cases where the rate limit errors continue to occur, you can fix the issue by modifying the LokiStack custom resource (CR).

The LokiStack CR is not available on Grafana-hosted Loki. This topic does not apply to Grafana-hosted Loki servers.

Conditions

  • The Log Forwarder API is configured to forward logs to Loki.

  • Your system sends a block of messages that is larger than 2 MB to Loki. For example:

    1. "values":[["1630410392689800468","{\"kind\":\"Event\",\"apiVersion\":\
    2. .......
    3. ......
    4. ......
    5. ......
    6. \"received_at\":\"2021-08-31T11:46:32.800278+00:00\",\"version\":\"1.7.4 1.6.0\"}},\"@timestamp\":\"2021-08-31T11:46:32.799692+00:00\",\"viaq_index_name\":\"audit-write\",\"viaq_msg_id\":\"MzFjYjJkZjItNjY0MC00YWU4LWIwMTEtNGNmM2E5ZmViMGU4\",\"log_type\":\"audit\"}"]]}]}
  • After you enter oc logs -n openshift-logging -l component=collector, the collector logs in your cluster show a line containing one of the following error messages:

    1. 429 Too Many Requests Ingestion rate limit exceeded

    Example Vector error message

    1. 2023-08-25T16:08:49.301780Z WARN sink{component_kind="sink" component_id=default_loki_infra component_type=loki component_name=default_loki_infra}: vector::sinks::util::retries: Retrying after error. error=Server responded with an error: 429 Too Many Requests internal_log_rate_limit=true

    Example Fluentd error message

    1. 2023-08-30 14:52:15 +0000 [warn]: [default_loki_infra] failed to flush the buffer. retry_times=2 next_retry_time=2023-08-30 14:52:19 +0000 chunk="604251225bf5378ed1567231a1c03b8b" error_class=Fluent::Plugin::LokiOutput::LogPostError error="429 Too Many Requests Ingestion rate limit exceeded for user infrastructure (limit: 4194304 bytes/sec) while attempting to ingest '4082' lines totaling '7820025' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\n"

    The error is also visible on the receiving end. For example, in the LokiStack ingester pod:

    Example Loki ingester error message

    1. level=warn ts=2023-08-30T14:57:34.155592243Z caller=grpc_logging.go:43 duration=1.434942ms method=/logproto.Pusher/Push err="rpc error: code = Code(429) desc = entry with timestamp 2023-08-30 14:57:32.012778399 +0000 UTC ignored, reason: 'Per stream rate limit exceeded (limit: 3MB/sec) while attempting to ingest for stream

Procedure

  • Update the ingestionBurstSize and ingestionRate fields in the LokiStack CR:

    1. apiVersion: loki.grafana.com/v1
    2. kind: LokiStack
    3. metadata:
    4. name: logging-loki
    5. namespace: openshift-logging
    6. spec:
    7. limits:
    8. global:
    9. ingestion:
    10. ingestionBurstSize: 16 (1)
    11. ingestionRate: 8 (2)
    12. # ...
    1The ingestionBurstSize field defines the maximum local rate-limited sample size per distributor replica in MB. This value is a hard limit. Set this value to at least the maximum logs size expected in a single push request. Single requests that are larger than the ingestionBurstSize value are not permitted.
    2The ingestionRate field is a soft limit on the maximum amount of ingested samples per second in MB. Rate limit errors occur if the rate of logs exceeds the limit, but the collector retries sending the logs. As long as the total average is lower than the limit, the system recovers and errors are resolved without user intervention.

Additional Resources