Deploying node health checks by using the Node Health Check Operator

Use the Node Health Check Operator to deploy the NodeHealthCheck controller. The controller identifies unhealthy nodes and uses the Poison Pill Operator to remediate the unhealthy nodes.

Additional resources

Remediating nodes with the Poison Pill Operator

Node Health Check Operator is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/.

About the Node Health Check Operator

The Node Health Check Operator deploys the NodeHealthCheck controller, which in turn creates the NodeHealthCheck custom resource (CR). The Node Health Check Operator also installs the Poison Pill Operator as a default remediation provider.

The Operator uses the controller to detect the health of a node in the cluster. The controller creates a NodeHealthCheck custom resource (CR), which defines a set of criteria and thresholds to determine the node’s health.

When node health check detects an unhealthy node, it creates a remediation CR that triggers the remediation provider. For example, the node health check creates the PoisonPillRemediation CR, which triggers the Poison Pill Operator to remediate the unhealthy node.

The NodeHealthCheck CR resembles the following YAML file:

  1. apiVersion: remediation.medik8s.io/v1alpha1
  2. kind: NodeHealthCheck
  3. metadata:
  4. name: nodehealthcheck-sample
  5. namespace: openshift-operators
  6. spec:
  7. minHealthy: 51% (1)
  8. pauseRequests: (2)
  9. - <pause-test-cluster>
  10. remediationTemplate: (3)
  11. apiVersion: poison-pill.medik8s.io/v1alpha1
  12. name: group-x
  13. namespace: openshift-operators
  14. kind: PoisonPillRemediationTemplate
  15. selector: (4)
  16. matchExpressions:
  17. - key: node-role.kubernetes.io/worker
  18. operator: Exists
  19. unhealthyConditions: (5)
  20. - type: Ready
  21. status: "False"
  22. duration: 300s (6)
  23. - type: Ready
  24. status: Unknown
  25. duration: 300s (6)
1Specifies the amount (in percentage) of nodes allowed to be concurrently remediated in the targeted pool. If the number of healthy nodes equals to or exceeds the limit set by minHealthy, remediation occurs. The default value is 51%.
2Prevents any new remediation from starting, while allowing any ongoing remediations to persist. The default value is empty. However, you can enter an array of strings that identify the cause of pausing the remediation. For example, pause-test-cluster.

During the upgrade process, nodes in the cluster might become temporarily unavailable and get identified as unhealthy. In the case of worker nodes, when the Operator detects that the cluster is upgrading, it stops remediating new unhealthy nodes to prevent such nodes from rebooting.

3Specifies a remediation template from the remediation provider. For example, from the Poison Pill Operator.
4Specifies a selector that matches labels or expressions that you want to check. The default value is empty, which selects all nodes.
5Specifies a list of the conditions that determine whether a node is considered unhealthy.
6Specifies the timeout duration for a node condition. If a condition is met for the duration of the timeout, the node will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy node.

Understanding the Node Health Check Operator workflow

When a node is identified as unhealthy, the Operator checks how many other nodes are unhealthy. If the number of healthy nodes exceeds the amount that is specified in the minHealthy field of the NodeHealthCheck CR, the controller creates a remediation CR from the details that are provided in the external remediation template by the remediation provider. After remediation, the node’s health status is updated accordingly.

When the node turns healthy, the controller deletes the external remediation template and updates the node’s health status.

Installing the Node Health Check Operator by using the web console

You can use the OKD web console to install the Node Health Check Operator.

Prerequisites

  • Log in as a user with cluster-admin privileges.

Procedure

  1. In the OKD web console, navigate to OperatorsOperatorHub.

  2. Search for the Node Health Check Operator, then click Install.

  3. Keep the default selection of Installation mode and namespace to ensure that the Operator will be installed to the openshift-operators namespace.

  4. Click Install.

Verification

To confirm that the installation is successful:

  1. Navigate to the OperatorsInstalled Operators page.

  2. Check that the Operator is installed in the openshift-operators namespace and that its status is Succeeded.

If the Operator is not installed successfully:

  1. Navigate to the OperatorsInstalled Operators page and inspect the Status column for any errors or failures.

  2. Navigate to the WorkloadsPods page and check the logs in any pods in the openshift-operators project that are reporting issues.

Installing the Node Health Check Operator by using the CLI

You can use the OpenShift CLI (oc) to install the Node Health Check Operator.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a Namespace custom resource (CR) for the Node Health Check Operator:

    1. Define the Namespace CR and save the YAML file, for example, node-health-check-namespace.yaml:

      1. apiVersion: v1
      2. kind: Namespace
      3. metadata:
      4. name: openshift-operators
    2. To create the Namespace CR, run the following command:

      1. $ oc create -f node-health-check-namespace.yaml
  2. Create an OperatorGroup CR:

    1. Define the OperatorGroup CR and save the YAML file, for example, node-health-check-operator-group.yaml:

      1. apiVersion: operators.coreos.com/v1
      2. kind: OperatorGroup
      3. metadata:
      4. name: node-health-check-operator
      5. namespace: openshift-operators
      6. spec:
      7. targetNamespaces:
      8. - openshift-operators
    2. To create the OperatorGroup CR, run the following command:

      1. $ oc create -f node-health-check-operator-group.yaml
  3. Create a Subscription CR:

    1. Define the Subscription CR and save the YAML file, for example, node-health-check-subscription.yaml:

      1. apiVersion: operators.coreos.com/v1alpha1
      2. kind: Subscription
      3. metadata:
      4. name: node-health-check-operator
      5. namespace: openshift-operators
      6. spec:
      7. channel: alpha
      8. name: node-healthcheck-operator
      9. source: redhat-operators
      10. sourceNamespace: openshift-marketplace
      11. package: node-health-check-operator
    2. To create the Subscription CR, run the following command:

      1. $ oc create -f node-health-check-subscription.yaml

Verification

  1. Verify that the installation succeeded by inspecting the CSV resource:

    1. $ oc get csv -n openshift-operators

    Example output

    1. NAME DISPLAY VERSION REPLACES PHASE
    2. node-health-check-operator.v0.1.1 Node Health Check Operator 0.1.1 Succeeded
  2. Verify that the Node Health Check Operator is up and running:

    1. $ oc get deploy -n openshift-operators

    Example output

    1. NAME READY UP-TO-DATE AVAILABLE AGE
    2. node-health-check-operator-controller-manager 1/1 1 1 10d