Troubleshooting operating system issues

OKD runs on FCOS. You can follow these procedures to troubleshoot problems related to the operating system.

Investigating kernel crashes

The kdump service, included in the kexec-tools package, provides a crash-dumping mechanism. You can use this service to save the contents of a system’s memory for later analysis.

The x86_64 architecture supports kdump in General Availability (GA) status, whereas other architectures support kdump in Technology Preview (TP) status.

The following table provides details about the support level of kdump for different architectures.

Table 1. Kdump support in FCOS
ArchitectureSupport level

x86_64

GA

aarch64

TP

s390x

TP

ppc64le

TP

Kdump support, for the preceding three architectures in the table, is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Enabling kdump

FCOS ships with the kexec-tools package, but manual configuration is required to enable the kdump service.

Procedure

Perform the following steps to enable kdump on FCOS.

  1. To reserve memory for the crash kernel during the first kernel booting, provide kernel arguments by entering the following command:

    1. # rpm-ostree kargs --append='crashkernel=256M'

    For the ppc64le platform, the recommended value for crashkernel is crashkernel=2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G.

  2. Optional: To write the crash dump over the network or to some other location, rather than to the default local /var/crash location, edit the /etc/kdump.conf configuration file.

    If your node uses LUKS-encrypted devices, you must use network dumps as kdump does not support saving crash dumps to LUKS-encrypted devices.

    For details on configuring the kdump service, see the comments in /etc/sysconfig/kdump, /etc/kdump.conf, and the kdump.conf manual page.

  3. Enable the kdump systemd service.

    1. # systemctl enable kdump.service
  4. Reboot your system.

    1. # systemctl reboot
  5. Ensure that kdump has loaded a crash kernel by checking that the kdump.service systemd service has started and exited successfully and that the command, cat /sys/kernel/kexec_crash_loaded, prints the value 1.

Enabling kdump on day-1

The kdump service is intended to be enabled per node to debug kernel problems. Because there are costs to having kdump enabled, and these costs accumulate with each additional kdump-enabled node, it is recommended that the kdump service only be enabled on each node as needed. Potential costs of enabling the kdump service on each node include:

  • Less available RAM due to memory being reserved for the crash kernel.

  • Node unavailability while the kernel is dumping the core.

  • Additional storage space being used to store the crash dumps.

If you are aware of the downsides and trade-offs of having the kdump service enabled, it is possible to enable kdump in a cluster-wide fashion. Although machine-specific machine configs are not yet supported, you can use a systemd unit in a MachineConfig object as a day-1 customization and have kdump enabled on all nodes in the cluster. You can create a MachineConfig object and inject that object into the set of manifest files used by Ignition during cluster setup.

See “Customizing nodes” in the Installing → Installation configuration section for more information and examples on how to use Ignition configs.

Procedure

Create a MachineConfig object for cluster-wide configuration:

  1. Create a Butane config file, 99-worker-kdump.bu, that configures and enables kdump:

    1. variant: openshift
    2. version: 4.0
    3. metadata:
    4. name: 99-worker-kdump (1)
    5. labels:
    6. machineconfiguration.openshift.io/role: worker (1)
    7. openshift:
    8. kernel_arguments: (2)
    9. - crashkernel=256M
    10. storage:
    11. files:
    12. - path: /etc/kdump.conf (3)
    13. mode: 0644
    14. overwrite: true
    15. contents:
    16. inline: |
    17. path /var/crash
    18. core_collector makedumpfile -l --message-level 7 -d 31
    19. - path: /etc/sysconfig/kdump (4)
    20. mode: 0644
    21. overwrite: true
    22. contents:
    23. inline: |
    24. KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug quiet log_buf_len swiotlb"
    25. KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable" (5)
    26. KEXEC_ARGS="-s"
    27. KDUMP_IMG="vmlinuz"
    28. systemd:
    29. units:
    30. - name: kdump.service
    31. enabled: true
    1Replace worker with master in both locations when creating a MachineConfig object for control plane nodes.
    2Provide kernel arguments to reserve memory for the crash kernel. You can add other kernel arguments if necessary. For the ppc64le platform, the recommended value for crashkernel is crashkernel=2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G.
    3If you want to change the contents of /etc/kdump.conf from the default, include this section and modify the inline subsection accordingly.
    4If you want to change the contents of /etc/sysconfig/kdump from the default, include this section and modify the inline subsection accordingly.
    5For the ppc64le platform, replace nr_cpus=1 with maxcpus=1, which is not supported on this platform.
  2. Use Butane to generate a machine config YAML file, 99-worker-kdump.yaml, containing the configuration to be delivered to the nodes:

    1. $ butane 99-worker-kdump.bu -o 99-worker-kdump.yaml
  3. Put the YAML file into the <installation_directory>/manifests/ directory during cluster setup. You can also create this MachineConfig object after cluster setup with the YAML file:

    1. $ oc create -f 99-worker-kdump.yaml

Testing the kdump configuration

See the Capturing the Dump section in the Fedora documentation for kdump.

Analyzing a core dump

See the Dump Analysis section in the Fedora documentation for kdump.

It is recommended to perform vmcore analysis on a separate Fedora system.

Additional resources

Debugging Ignition failures

If a machine cannot be provisioned, Ignition fails and FCOS will boot into the emergency shell. Use the following procedure to get debugging information.

Procedure

  1. Run the following command to show which service units failed:

    1. $ systemctl --failed
  2. Optional: Run the following command on an individual service unit to find out more information:

    1. $ journalctl -u <unit>.service