Understanding cluster logging alerts

All of the logging collector alerts are listed on the Alerting UI of the OKD web console.

Viewing logging collector alerts

Alerts are shown in the OKD web console, on the Alerts tab of the Alerting UI. Alerts are in one of the following states:

  • Firing. The alert condition is true for the duration of the timeout. Click the Options menu at the end of the firing alert to view more information or silence the alert.

  • Pending The alert condition is currently true, but the timeout has not been reached.

  • Not Firing. The alert is not currently triggered.

Procedure

To view cluster logging and other OKD alerts:

  1. In the OKD console, click MonitoringAlerting.

  2. Click the Alerts tab. The alerts are listed, based on the filters selected.

Additional resources

About logging collector alerts

The following alerts are generated by the logging collector. You can view these alerts in the OKD web console, on the Alerts page of the Alerting UI.

Table 1. Fluentd Prometheus alerts
AlertMessageDescriptionSeverity

FluentDHighErrorRate

<value> of records have resulted in an error by fluentd <instance>.

The number of FluentD output errors is high, by default more than 10 in the previous 15 minutes.

Warning

FluentdNodeDown

Prometheus could not scrape fluentd <instance> for more than 10m.

Fluentd is reporting that Prometheus could not scrape a specific Fluentd instance.

Critical

FluentdQueueLengthBurst

In the last minute, fluentd <instance> buffer queue length increased more than 32. Current value is <value>.

Fluentd is reporting that it cannot keep up with the data being indexed.

Warning

FluentdQueueLengthIncreasing

In the last 12h, fluentd <instance> buffer queue length constantly increased more than 1. Current value is <value>.

Fluentd is reporting that the queue size is increasing.

Critical

FluentDVeryHighErrorRate

<value> of records have resulted in an error by fluentd <instance>.

The number of FluentD output errors is very high, by default more than 25 in the previous 15 minutes.

Critical

About Elasticsearch alerting rules

You can view these alerting rules in Prometheus.

AlertDescriptionSeverity

ElasticsearchClusterNotHealthy

The cluster health status has been RED for at least 2 minutes. The cluster does not accept writes, shards may be missing, or the master node hasn’t been elected yet.

critical

ElasticsearchClusterNotHealthy

The cluster health status has been YELLOW for at least 20 minutes. Some shard replicas are not allocated.

warning

ElasticsearchDiskSpaceRunningLow

The cluster is expected to be out of disk space within the next 6 hours.

Critical

ElasticsearchHighFileDescriptorUsage

The cluster is predicted to be out of file descriptors within the next hour.

warning

ElasticsearchJVMHeapUseHigh

The JVM Heap usage on the specified node is high.

Alert

ElasticsearchNodeDiskWatermarkReached

The specified node has hit the low watermark due to low free disk space. Shards can not be allocated to this node anymore. You should consider adding more disk space to the node.

info

ElasticsearchNodeDiskWatermarkReached

The specified node has hit the high watermark due to low free disk space. Some shards will be re-allocated to different nodes if possible. Make sure more disk space is added to the node or drop old indices allocated to this node.

warning

ElasticsearchNodeDiskWatermarkReached

The specified node has hit the flood watermark due to low free disk space. Every index that has a shard allocated on this node is enforced a read-only block. The index block must be manually released when the disk use falls below the high watermark.

critical

ElasticsearchJVMHeapUseHigh

The JVM Heap usage on the specified node is too high.

alert

ElasticsearchWriteRequestsRejectionJumps

Elasticsearch is experiencing an increase in write rejections on the specified node. This node might not be keeping up with the indexing speed.

Warning

AggregatedLoggingSystemCPUHigh

The CPU used by the system on the specified node is too high.

alert

ElasticsearchProcessCPUHigh

The CPU used by Elasticsearch on the specified node is too high.

alert