Troubleshooting for Critical Alerts

Elasticsearch Cluster Health is Red

At least one primary shard and its replicas are not allocated to a node.

Troubleshooting

  1. Check the Elasticsearch cluster health and verify that the cluster status is red.

    1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- health
  2. List the nodes that have joined the cluster.

    1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/nodes?v
  3. List the Elasticsearch pods and compare them with the nodes in the command output from the previous step.

    1. oc -n openshift-logging get pods -l component=elasticsearch
  4. If some of the Elasticsearch nodes have not joined the cluster, perform the following steps.

    1. Confirm that Elasticsearch has an elected master node.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/master?v
    2. Review the pod logs of the elected master node for issues.

      1. oc logs <elasticsearch_master_pod_name> -c elasticsearch -n openshift-logging
    3. Review the logs of nodes that have not joined the cluster for issues.

      1. oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging
  5. If all the nodes have joined the cluster, perform the following steps, check if the cluster is in the process of recovering.

    1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/recovery?active_only=true

    If there is no command output, the recovery process might be delayed or stalled by pending tasks.

  6. Check if there are pending tasks.

    1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- health |grep number_of_pending_tasks
  7. If there are pending tasks, monitor their status.

    If their status changes and indicates that the cluster is recovering, continue waiting. The recovery time varies according to the size of the cluster and other factors.

    Otherwise, if the status of the pending tasks does not change, this indicates that the recovery has stalled.

  8. If it seems like the recovery has stalled, check if cluster.routing.allocation.enable is set to none.

    1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/settings?pretty
  9. If cluster.routing.allocation.enable is set to none, set it to all.

    1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/settings?pretty -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'
  10. Check which indices are still red.

    1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/indices?v
  11. If any indices are still red, try to clear them by performing the following steps.

    1. Clear the cache.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_cache/clear?pretty
    2. Increase the max allocation retries.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.allocation.max_retries":10}'
    3. Delete all the scroll items.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_search/scroll/_all -X DELETE
    4. Increase the timeout.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.unassigned.node_left.delayed_timeout":"10m"}'
  12. If the preceding steps do not clear the red indices, delete the indices individually.

    1. Identify the red index name.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/indices?v
    2. Delete the red index.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_red_index_name> -X DELETE
  13. If there are no red indices and the cluster status is red, check for a continuous heavy processing load on a data node.

    1. Check if the Elasticsearch JVM Heap usage is high.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_nodes/stats?pretty

      In the command output, review the node_name.jvm.mem.heap_used_percent field to determine the JVM Heap usage.

    2. Check for high CPU utilization.

Additional resources

Elasticsearch Cluster Health is Yellow

Replica shards for at least one primary shard are not allocated to nodes.

Troubleshooting

  1. Increase the node count by adjusting nodeCount in the ClusterLogging CR.

Additional resources

Elasticsearch Node Disk Low Watermark Reached

Elasticsearch does not allocate shards to nodes that reach the low watermark.

Troubleshooting

  1. Identify the node on which Elasticsearch is deployed.

    1. oc -n openshift-logging get po -o wide
  2. Check if there are unassigned shards.

    1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/health?pretty | grep unassigned_shards
  3. If there are unassigned shards, check the disk space on each node.

    1. for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
  4. Check the nodes.node_name.fs field to determine the free disk space on that node.

    If the used disk percentage is above 85%, the node has exceeded the low watermark, and shards can no longer be allocated to this node.

  5. Try to increase the disk space on all nodes.

  6. If increasing the disk space is not possible, try adding a new data node to the cluster.

  7. If adding a new data node is problematic, decrease the total cluster redundancy policy.

    1. Check the current redundancyPolicy.

      1. oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
      If you are using a ClusterLogging CR, enter:
      1. oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
    2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.

  8. If the preceding steps do not fix the issue, delete the old indices.

    1. Check the status of all indices on Elasticsearch.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
    2. Identify an old index that can be deleted.

    3. Delete the index.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE

Additional resources

Elasticsearch Node Disk High Watermark Reached

Elasticsearch attempts to relocate shards away from a node that has reached the high watermark.

Troubleshooting

  1. Identify the node on which Elasticsearch is deployed.

    1. oc -n openshift-logging get po -o wide
  2. Check the disk space on each node.

    1. for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
  3. Check if the cluster is rebalancing.

    1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/health?pretty | grep relocating_shards

    If the command output shows relocating shards, the High Watermark has been exceeded. The default value of the High Watermark is 90%.

    The shards relocate to a node with low disk usage that has not crossed any watermark threshold limits.

  4. To allocate shards to a particular node, free up some space.

  5. Try to increase the disk space on all nodes.

  6. If increasing the disk space is not possible, try adding a new data node to the cluster.

  7. If adding a new data node is problematic, decrease the total cluster redundancy policy.

    1. Check the current redundancyPolicy.

      1. oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
      If you are using a ClusterLogging CR, enter:
      1. oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
    2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.

  8. If the preceding steps do not fix the issue, delete the old indices.

    1. Check the status of all indices on Elasticsearch.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
    2. Identify an old index that can be deleted.

    3. Delete the index.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE

Additional resources

Elasticsearch Node Disk Flood Watermark Reached

Elasticsearch enforces a read-only index block on every index that has both of these conditions:

  • One or more shards are allocated to the node.

  • One or more disks exceed the flood stage.

Troubleshooting

  1. Check the disk space of the Elasticsearch node.

    1. for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done

    Check the nodes.node_name.fs field to determine the free disk space on that node.

  2. If the used disk percentage is above 95%, it signifies that the node has crossed the flood watermark. Writing is blocked for shards allocated on this particular node.

  3. Try to increase the disk space on all nodes.

  4. If increasing the disk space is not possible, try adding a new data node to the cluster.

  5. If adding a new data node is problematic, decrease the total cluster redundancy policy.

    1. Check the current redundancyPolicy.

      1. oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
      If you are using a ClusterLogging CR, enter:
      1. oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
    2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.

  6. If the preceding steps do not fix the issue, delete the old indices.

    1. Check the status of all indices on Elasticsearch.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
    2. Identify an old index that can be deleted.

    3. Delete the index.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
  7. Continue freeing up and monitoring the disk space until the used disk space drops below 90%. Then, unblock write to this particular node.

    1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_all/_settings?pretty -X PUT -d '{"index.blocks.read_only_allow_delete": null}'

Additional resources

Elasticsearch JVM Heap Use is High

The Elasticsearch node JVM Heap memory used is above 75%.

Troubleshooting

Consider increasing the heap size.

Aggregated Logging System CPU is High

System CPU usage on the node is high.

Troubleshooting

Check the CPU of the cluster node. Consider allocating more CPU resources to the node.

Elasticsearch Process CPU is High

Elasticsearch process CPU usage on the node is high.

Troubleshooting

Check the CPU of the cluster node. Consider allocating more CPU resources to the node.

Elasticsearch Disk Space is Running Low

The Elasticsearch Cluster is predicted to be out of disk space within the next 6 hours based on current disk usage.

Troubleshooting

  1. Get the disk space of the Elasticsearch node.

    1. for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
  2. In the command output, check the nodes.node_name.fs field to determine the free disk space on that node.

  3. Try to increase the disk space on all nodes.

  4. If increasing the disk space is not possible, try adding a new data node to the cluster.

  5. If adding a new data node is problematic, decrease the total cluster redundancy policy.

    1. Check the current redundancyPolicy.

      1. oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
      If you are using a ClusterLogging CR, enter:
      1. oc -n openshift-logging get cl -o jsonpath='{.items[*].spec.logStore.elasticsearch.redundancyPolicy}'
    2. If the cluster redundancyPolicy is higher than SingleRedundancy, set it to SingleRedundancy and save this change.

  6. If the preceding steps do not fix the issue, delete the old indices.

    1. Check the status of all indices on Elasticsearch.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
    2. Identify an old index that can be deleted.

    3. Delete the index.

      1. oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE

Additional resources

Elasticsearch FileDescriptor Usage is high

Based on current usage trends, the predicted number of file descriptors on the node is insufficient.

Troubleshooting

Check and, if needed, configure the value of max_file_descriptors for each node, as described in the Elasticsearch File descriptors topic.

Additional resources