- Remediating nodes with Node Health Checks
Remediating nodes with Node Health Checks
You can use the Node Health Check Operator to identify unhealthy nodes. The Operator uses the Self Node Remediation Operator to remediate the unhealthy nodes.
Additional resources
Remediating nodes with the Self Node Remediation Operator
About the Node Health Check Operator
The Node Health Check Operator detects the health of the nodes in a cluster. The NodeHealthCheck
controller creates the NodeHealthCheck
custom resource (CR), which defines a set of criteria and thresholds to determine the health of a node.
The Node Health Check Operator also installs the Self Node Remediation Operator as a default remediation provider.
When the Node Health Check Operator detects an unhealthy node, it creates a remediation CR that triggers the remediation provider. For example, the controller creates the SelfNodeRemediation
CR, which triggers the Self Node Remediation Operator to remediate the unhealthy node.
The NodeHealthCheck
CR resembles the following YAML file:
apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
name: nodehealthcheck-sample
spec:
minHealthy: 51% (1)
pauseRequests: (2)
- <pause-test-cluster>
remediationTemplate: (3)
apiVersion: self-node-remediation.medik8s.io/v1alpha1
name: self-node-remediation-resource-deletion-template
namespace: openshift-operators
kind: SelfNodeRemediationTemplate
selector: (4)
matchExpressions:
- key: node-role.kubernetes.io/worker
operator: Exists
unhealthyConditions: (5)
- type: Ready
status: "False"
duration: 300s (6)
- type: Ready
status: Unknown
duration: 300s (6)
1 | Specifies the amount of healthy nodes(in percentage or number) required for a remediation provider to concurrently remediate nodes in the targeted pool. If the number of healthy nodes equals to or exceeds the limit set by minHealthy , remediation occurs. The default value is 51%. | ||
2 | Prevents any new remediation from starting, while allowing any ongoing remediations to persist. The default value is empty. However, you can enter an array of strings that identify the cause of pausing the remediation. For example, pause-test-cluster .
| ||
3 | Specifies a remediation template from the remediation provider. For example, from the Self Node Remediation Operator. | ||
4 | Specifies a selector that matches labels or expressions that you want to check. The default value is empty, which selects all nodes. | ||
5 | Specifies a list of the conditions that determine whether a node is considered unhealthy. | ||
6 | Specifies the timeout duration for a node condition. If a condition is met for the duration of the timeout, the node will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy node. |
Understanding the Node Health Check Operator workflow
When a node is identified as unhealthy, the Node Health Check Operator checks how many other nodes are unhealthy. If the number of healthy nodes exceeds the amount that is specified in the minHealthy
field of the NodeHealthCheck
CR, the controller creates a remediation CR from the details that are provided in the external remediation template by the remediation provider. After remediation, the kubelet updates the node’s health status.
When the node turns healthy, the controller deletes the external remediation template.
About how node health checks prevent conflicts with machine health checks
When both, node health checks and machine health checks are deployed, the node health check avoids conflict with the machine health check.
OKD deploys |
The following list summarizes the system behavior when node health checks and machine health checks are deployed:
If only the default machine health check exists, the node health check continues to identify unhealthy nodes. However, the node health check ignores unhealthy nodes in a Terminating state. The default machine health check handles the unhealthy nodes with a Terminating state.
Example log message
INFO MHCChecker ignoring unhealthy Node, it is terminating and will be handled by MHC {"NodeName": "node-1.example.com"}
If the default machine health check is modified (for example, the
unhealthyConditions
isReady
), or if additional machine health checks are created, the node health check is disabled.Example log message
INFO controllers.NodeHealthCheck disabling NHC in order to avoid conflict with custom MHCs configured in the cluster {"NodeHealthCheck": "/nhc-worker-default"}
When, again, only the default machine health check exists, the node health check is re-enabled.
Example log message
INFO controllers.NodeHealthCheck re-enabling NHC, no conflicting MHC configured in the cluster {"NodeHealthCheck": "/nhc-worker-default"}
Control plane fencing
In earlier releases, you could enable Self Node Remediation and Node Health Check on worker nodes. In the event of node failure, you can now also follow remediation strategies on control plane nodes.
Do not use the same NodeHealthCheck
CR for worker nodes and control plane nodes. Grouping worker nodes and control plane nodes together can result in incorrect evaluation of the minimum healthy node count, and cause unexpected or missing remediations. This is because of the way the Node Health Check Operator handles control plane nodes. You should group the control plane nodes in their own group and the worker nodes in their own group. If required, you can also create multiple groups of worker nodes.
Considerations for remediation strategies:
Avoid Node Health Check configurations that involve multiple configurations overlapping the same nodes because they can result in unexpected behavior. This suggestion applies to both worker and control plane nodes.
The Node Health Check Operator implements a hardcoded limitation of remediating a maximum of one control plane node at a time. Multiple control plane nodes should not be remediated at the same time.
Installing the Node Health Check Operator by using the web console
You can use the OKD web console to install the Node Health Check Operator.
Prerequisites
- Log in as a user with
cluster-admin
privileges.
Procedure
In the OKD web console, navigate to Operators → OperatorHub.
Search for the Node Health Check Operator, then click Install.
Keep the default selection of Installation mode and namespace to ensure that the Operator will be installed to the
openshift-operators
namespace.Ensure that the Console plug-in is set to
Enable
.Click Install.
Verification
To confirm that the installation is successful:
Navigate to the Operators → Installed Operators page.
Check that the Operator is installed in the
openshift-operators
namespace and that its status isSucceeded
.
If the Operator is not installed successfully:
Navigate to the Operators → Installed Operators page and inspect the
Status
column for any errors or failures.Navigate to the Workloads → Pods page and check the logs in any pods in the
openshift-operators
project that are reporting issues.
Installing the Node Health Check Operator by using the CLI
You can use the OpenShift CLI (oc
) to install the Node Health Check Operator.
To install the Operator in your own namespace, follow the steps in the procedure.
To install the Operator in the openshift-operators
namespace, skip to step 3 of the procedure because the steps to create a new Namespace
custom resource (CR) and an OperatorGroup
CR are not required.
Prerequisites
Install the OpenShift CLI (
oc
).Log in as a user with
cluster-admin
privileges.
Procedure
Create a
Namespace
custom resource (CR) for the Node Health Check Operator:Define the
Namespace
CR and save the YAML file, for example,node-health-check-namespace.yaml
:apiVersion: v1
kind: Namespace
metadata:
name: node-health-check
To create the
Namespace
CR, run the following command:$ oc create -f node-health-check-namespace.yaml
Create an
OperatorGroup
CR:Define the
OperatorGroup
CR and save the YAML file, for example,node-health-check-operator-group.yaml
:apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: node-health-check-operator
namespace: node-health-check
To create the
OperatorGroup
CR, run the following command:$ oc create -f node-health-check-operator-group.yaml
Create a
Subscription
CR:Define the
Subscription
CR and save the YAML file, for example,node-health-check-subscription.yaml
:apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: node-health-check-operator
namespace: node-health-check (1)
spec:
channel: stable (2)
installPlanApproval: Manual (3)
name: node-healthcheck-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
package: node-healthcheck-operator
1 Specify the Namespace
where you want to install the Node Health Check Operator. To install the Node Health Check Operator in theopenshift-operators
namespace, specifyopenshift-operators
in theSubscription
CR.2 Specify the channel name for your subscription. To upgrade to the latest version of the Node Health Check Operator, you must manually change the channel name for your subscription from candidate
tostable
.3 Set the approval strategy to Manual in case your specified version is superseded by a later version in the catalog. This plan prevents an automatic upgrade to a later version and requires manual approval before the starting CSV can complete the installation. To create the
Subscription
CR, run the following command:$ oc create -f node-health-check-subscription.yaml
Verification
Verify that the installation succeeded by inspecting the CSV resource:
$ oc get csv -n openshift-operators
Example output
NAME DISPLAY VERSION REPLACES PHASE
node-healthcheck-operator.v0.2.0. Node Health Check Operator 0.2.0 Succeeded
Verify that the Node Health Check Operator is up and running:
$ oc get deploy -n openshift-operators
Example output
NAME READY UP-TO-DATE AVAILABLE AGE
node-health-check-operator-controller-manager 1/1 1 1 10d
Creating a node health check
Using the web console, you can create a node health check to identify unhealthy nodes and specify the remediation type and strategy to fix them.
Procedure
From the Administrator perspective of the OKD web console, click Compute → NodeHealthChecks → CreateNodeHealthCheck.
Specify whether to configure the node health check using the Form view or the YAML view.
Enter a Name for the node health check. The name must consist of lower case, alphanumeric characters, ‘-‘ or ‘.’, and must start and end with an alphanumeric character.
Specify the Remediator type, and Self node remediation or Other. The Self node remediation option is part of the Self Node Remediation Operator that is installed with the Node Health Check Operator. Selecting Other requires an API version, Kind, Name, and Namespace to be entered, which then points to the remediation template resource of a remediator.
Make a Nodes selection by specifying the labels of the nodes you want to remediate. The selection matches labels that you want to check. If more than one label is specified, the nodes must contain each label. The default value is empty, which selects both worker and control-plane nodes.
When creating a node health check with the Self Node Remediation Operator, you must select either
node-role.kubernetes.io/worker
ornode-role.kubernetes.io/control-plane
as the value.Specify the minimum number of healthy nodes, using either a percentage or a number, required for a NodeHealthCheck to remediate nodes in the targeted pool. If the number of healthy nodes equals to or exceeds the limit set by Min healthy, remediation occurs. The default value is 51%.
Specify a list of Unhealthy conditions that if a node meets determines whether the node is considered unhealthy, and requires remediation. You can specify the Type, Status and Duration. You can also create your own custom type.
Click Create to create the node health check.
Verification
- Navigate to the Compute → NodeHealthCheck page and verify that the corresponding node health check is listed, and their status displayed. Once created, node health checks can be paused, modified, and deleted.
Gathering data about the Node Health Check Operator
To collect debugging information about the Node Health Check Operator, use the must-gather
tool. For information about the must-gather
image for the Node Health Check Operator, see Gathering data about specific features.