Using Self Node Remediation

You can use the Self Node Remediation Operator to automatically reboot unhealthy nodes. This remediation strategy minimizes downtime for stateful applications and ReadWriteOnce (RWO) volumes, and restores compute capacity if transient failures occur.

About the Self Node Remediation Operator

The Self Node Remediation Operator runs on the cluster nodes and reboots nodes that are identified as unhealthy. The Operator uses the MachineHealthCheck or NodeHealthCheck controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the MachineHealthCheck or the NodeHealthCheck resource creates the SelfNodeRemediation custom resource (CR), which triggers the Self Node Remediation Operator.

The SelfNodeRemediation CR resembles the following YAML file:

  1. apiVersion: self-node-remediation.medik8s.io/v1alpha1
  2. kind: SelfNodeRemediation
  3. metadata:
  4. name: selfnoderemediation-sample
  5. namespace: openshift-operators
  6. spec:
  7. status:
  8. lastError: <last_error_message> (1)
1Displays the last error that occurred during remediation. When remediation succeeds or if no errors occur, the field is left empty.

The Self Node Remediation Operator minimizes downtime for stateful applications and restores compute capacity if transient failures occur. You can use this Operator regardless of the management interface, such as IPMI or an API to provision a node, and regardless of the cluster installation type, such as installer-provisioned infrastructure or user-provisioned infrastructure.

About watchdog devices

Watchdog devices can be any of the following:

  • Independently powered hardware devices

  • Hardware devices that share power with the hosts they control

  • Virtual devices implemented in software, or softdog

Hardware watchdog and softdog devices have electronic or software timers, respectively. These watchdog devices are used to ensure that the machine enters a safe state when an error condition is detected. The cluster is required to repeatedly reset the watchdog timer to prove that it is in a healthy state. This timer might elapse due to fault conditions, such as deadlocks, CPU starvation, and loss of network or disk access. If the timer expires, the watchdog device assumes that a fault has occurred and the device triggers a forced reset of the node.

Hardware watchdog devices are more reliable than softdog devices.

Understanding Self Node Remediation Operator behavior with watchdog devices

The Self Node Remediation Operator determines the remediation strategy based on the watchdog devices that are present.

If a hardware watchdog device is configured and available, the Operator uses it for remediation. If a hardware watchdog device is not configured, the Operator enables and uses a softdog device for remediation.

If neither watchdog devices are supported, either by the system or by the configuration, the Operator remediates nodes by using software reboot.

Additional resources

Configuring a watchdog

Control plane fencing

In earlier releases, you could enable Self Node Remediation and Node Health Check on worker nodes. In the event of node failure, you can now also follow remediation strategies on control plane nodes.

Self Node Remediation occurs in two primary scenarios.

  • API Server Connectivity

    • In this scenario, the control plane node to be remediated is not isolated. It can be directly connected to the API Server, or it can be indirectly connected to the API Server through worker nodes or control-plane nodes, that are directly connected to the API Server.

    • When there is API Server Connectivity, the control plane node is remediated only if the Node Health Check Operator has created a SelfNodeRemediation custom resource (CR) for the node.

  • No API Server Connectivity

    • In this scenario, the control plane node to be remediated is isolated from the API Server. The node cannot connect directly or indirectly to the API Server.

    • When there is no API Server Connectivity, the control plane node will be remediated as outlined with these steps:

      • Check the status of the control plane node with the majority of the peer worker nodes. If the majority of the peer worker nodes cannot be reached, the node will be analyzed further.

        • Self-diagnose the status of the control plane node

          • If self diagnostics passed, no action will be taken.

          • If self diagnostics failed, the node will be fenced and remediated.

          • The self diagnostics currently supported are checking the kubelet service status, and checking endpoint availability using opt in configuration.

  1. - If the node did not manage to communicate to most of its worker peers, check the connectivity of the control plane node with other control plane nodes. If the node can communicate with any other control plane peer, no action will be taken. Otherwise, the node will be fenced and remediated.

Installing the Self Node Remediation Operator by using the web console

You can use the OKD web console to install the Self Node Remediation Operator.

The Node Health Check Operator also installs the Self Node Remediation Operator as a default remediation provider.

Prerequisites

  • Log in as a user with cluster-admin privileges.

Procedure

  1. In the OKD web console, navigate to OperatorsOperatorHub.

  2. Search for the Self Node Remediation Operator from the list of available Operators, and then click Install.

  3. Keep the default selection of Installation mode and namespace to ensure that the Operator is installed to the openshift-operators namespace.

  4. Click Install.

Verification

To confirm that the installation is successful:

  1. Navigate to the OperatorsInstalled Operators page.

  2. Check that the Operator is installed in the openshift-operators namespace and its status is Succeeded.

If the Operator is not installed successfully:

  1. Navigate to the OperatorsInstalled Operators page and inspect the Status column for any errors or failures.

  2. Navigate to the WorkloadsPods page and check the logs in any pods in the self-node-remediation-controller-manager project that are reporting issues.

Installing the Self Node Remediation Operator by using the CLI

You can use the OpenShift CLI (oc) to install the Self Node Remediation Operator.

You can install the Self Node Remediation Operator in your own namespace or in the openshift-operators namespace.

To install the Operator in your own namespace, follow the steps in the procedure.

To install the Operator in the openshift-operators namespace, skip to step 3 of the procedure because the steps to create a new Namespace custom resource (CR) and an OperatorGroup CR are not required.

Prerequisites

  • Install the OpenShift CLI (oc).

  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a Namespace custom resource (CR) for the Self Node Remediation Operator:

    1. Define the Namespace CR and save the YAML file, for example, self-node-remediation-namespace.yaml:

      1. apiVersion: v1
      2. kind: Namespace
      3. metadata:
      4. name: self-node-remediation
    2. To create the Namespace CR, run the following command:

      1. $ oc create -f self-node-remediation-namespace.yaml
  2. Create an OperatorGroup CR:

    1. Define the OperatorGroup CR and save the YAML file, for example, self-node-remediation-operator-group.yaml:

      1. apiVersion: operators.coreos.com/v1
      2. kind: OperatorGroup
      3. metadata:
      4. name: self-node-remediation-operator
      5. namespace: self-node-remediation
    2. To create the OperatorGroup CR, run the following command:

      1. $ oc create -f self-node-remediation-operator-group.yaml
  3. Create a Subscription CR:

    1. Define the Subscription CR and save the YAML file, for example, self-node-remediation-subscription.yaml:

      1. apiVersion: operators.coreos.com/v1alpha1
      2. kind: Subscription
      3. metadata:
      4. name: self-node-remediation-operator
      5. namespace: self-node-remediation (1)
      6. spec:
      7. channel: stable
      8. installPlanApproval: Manual (2)
      9. name: self-node-remediation-operator
      10. source: redhat-operators
      11. sourceNamespace: openshift-marketplace
      12. package: self-node-remediation
      1Specify the Namespace where you want to install the Self Node Remediation Operator. To install the Self Node Remediation Operator in the openshift-operators namespace, specify openshift-operators in the Subscription CR.
      2Set the approval strategy to Manual in case your specified version is superseded by a later version in the catalog. This plan prevents an automatic upgrade to a later version and requires manual approval before the starting CSV can complete the installation.
    2. To create the Subscription CR, run the following command:

      1. $ oc create -f self-node-remediation-subscription.yaml

Verification

  1. Verify that the installation succeeded by inspecting the CSV resource:

    1. $ oc get csv -n self-node-remediation

    Example output

    1. NAME DISPLAY VERSION REPLACES PHASE
    2. self-node-remediation.v.0.4.0 Self Node Remediation Operator v.0.4.0 Succeeded
  2. Verify that the Self Node Remediation Operator is up and running:

    1. $ oc get deploy -n self-node-remediation

    Example output

    1. NAME READY UP-TO-DATE AVAILABLE AGE
    2. self-node-remediation-controller-manager 1/1 1 1 28h
  3. Verify that the Self Node Remediation Operator created the SelfNodeRemediationConfig CR:

    1. $ oc get selfnoderemediationconfig -n self-node-remediation

    Example output

    1. NAME AGE
    2. self-node-remediation-config 28h
  4. Verify that each self node remediation pod is scheduled and running on each worker node:

    1. $ oc get daemonset -n self-node-remediation

    Example output

    1. NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
    2. self-node-remediation-ds 3 3 3 3 3 <none> 28h

    This command is unsupported for the control plane nodes.

Configuring the Self Node Remediation Operator

The Self Node Remediation Operator creates the SelfNodeRemediationConfig CR and the SelfNodeRemediationTemplate Custom Resource Definition (CRD).

Understanding the Self Node Remediation Operator configuration

The Self Node Remediation Operator creates the SelfNodeRemediationConfig CR with the name self-node-remediation-config. The CR is created in the namespace of the Self Node Remediation Operator.

A change in the SelfNodeRemediationConfig CR re-creates the Self Node Remediation daemon set.

The SelfNodeRemediationConfig CR resembles the following YAML file:

  1. apiVersion: self-node-remediation.medik8s.io/v1alpha1
  2. kind: SelfNodeRemediationConfig
  3. metadata:
  4. name: self-node-remediation-config
  5. namespace: openshift-operators
  6. spec:
  7. safeTimeToAssumeNodeRebootedSeconds: 180 (1)
  8. watchdogFilePath: /dev/watchdog (2)
  9. isSoftwareRebootEnabled: true (3)
  10. apiServerTimeout: 15s (4)
  11. apiCheckInterval: 5s (5)
  12. maxApiErrorThreshold: 3 (6)
  13. peerApiServerTimeout: 5s (7)
  14. peerDialTimeout: 5s (8)
  15. peerRequestTimeout: 5s (9)
  16. peerUpdateInterval: 15m (10)
1Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value.
2Specify the file path of the watchdog device in the nodes. If you enter an incorrect path to the watchdog device, the Self Node Remediation Operator automatically detects the softdog device path.

If a watchdog device is unavailable, the SelfNodeRemediationConfig CR uses a software reboot.

3Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of isSoftwareRebootEnabled is set to true. To disable the software reboot, set the parameter value to false.
4Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. The timeout duration must be greater than or equal to 10 milliseconds.
5Specify the frequency to check connectivity with each API server. The timeout duration must be greater than or equal to 1 second.
6Specify a threshold value. After reaching this threshold, the node starts contacting its peers. The threshold value must be greater than or equal to 1 second.
7Specify the duration of the timeout for the peer to connect the API server. The timeout duration must be greater than or equal to 10 milliseconds.
8Specify the duration of the timeout for establishing connection with the peer. The timeout duration must be greater than or equal to 10 milliseconds.
9Specify the duration of the timeout to get a response from the peer. The timeout duration must be greater than or equal to 10 milliseconds.
10Specify the frequency to update peer information, such as IP address. The timeout duration must be greater than or equal to 10 seconds.

You can edit the self-node-remediation-config CR that is created by the Self Node Remediation Operator. However, when you try to create a new CR for the Self Node Remediation Operator, the following message is displayed in the logs:

  1. controllers.SelfNodeRemediationConfig
  2. ignoring selfnoderemediationconfig CRs that are not named self-node-remediation-config
  3. or not in the namespace of the operator:
  4. openshift-operators {“selfnoderemediationconfig”:
  5. openshift-operators/selfnoderemediationconfig-copy”}

Understanding the Self Node Remediation Template configuration

The Self Node Remediation Operator also creates the SelfNodeRemediationTemplate Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes. The following remediation strategies are available:

ResourceDeletion

This remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster. ResourceDeletion is the default remediation strategy.

NodeDeletion

This remediation strategy is deprecated and will be removed in a future release. In the current release, the ResourceDeletion strategy is used even if the NodeDeletion strategy is selected.

The Self Node Remediation Operator creates the SelfNodeRemediationTemplate CR for the strategy self-node-remediation-resource-deletion-template, which the ResourceDeletion remediation strategy uses.

The SelfNodeRemediationTemplate CR resembles the following YAML file:

  1. apiVersion: self-node-remediation.medik8s.io/v1alpha1
  2. kind: SelfNodeRemediationTemplate
  3. metadata:
  4. creationTimestamp: "2022-03-02T08:02:40Z"
  5. name: self-node-remediation-<remediation_object>-deletion-template (1)
  6. namespace: openshift-operators
  7. spec:
  8. template:
  9. spec:
  10. remediationStrategy: <remediation_strategy> (2)
1Specifies the type of remediation template based on the remediation strategy. Replace <remediation_object> with either resource or node; for example, self-node-remediation-resource-deletion-template.
2Specifies the remediation strategy. The remediation strategy is ResourceDeletion.

Troubleshooting the Self Node Remediation Operator

General troubleshooting

Issue

You want to troubleshoot issues with the Self Node Remediation Operator.

Resolution

Check the Operator logs.

Checking the daemon set

Issue

The Self Node Remediation Operator is installed but the daemon set is not available.

Resolution

Check the Operator logs for errors or warnings.

Unsuccessful remediation

Issue

An unhealthy node was not remediated.

Resolution

Verify that the SelfNodeRemediation CR was created by running the following command:

  1. $ oc get snr -A

If the MachineHealthCheck controller did not create the SelfNodeRemediation CR when the node turned unhealthy, check the logs of the MachineHealthCheck controller. Additionally, ensure that the MachineHealthCheck CR includes the required specification to use the remediation template.

If the SelfNodeRemediation CR was created, ensure that its name matches the unhealthy node or the machine object.

Daemon set and other Self Node Remediation Operator resources exist even after uninstalling the Operator

Issue

The Self Node Remediation Operator resources, such as the daemon set, configuration CR, and the remediation template CR, exist even after after uninstalling the Operator.

Resolution

To remove the Self Node Remediation Operator resources, delete the resources by running the following commands for each resource type:

  1. $ oc delete ds <self-node-remediation-ds> -n <namespace>
  1. $ oc delete snrc <self-node-remediation-config> -n <namespace>
  1. $ oc delete snrt <self-node-remediation-template> -n <namespace>

Gathering data about the Self Node Remediation Operator

To collect debugging information about the Self Node Remediation Operator, use the must-gather tool. For information about the must-gather image for the Self Node Remediation Operator, see Gathering data about specific features.

Additional resources