Troubleshooting

Troubleshooting

You can debug Velero custom resources (CRs) by using the OpenShift CLI tool or the Velero CLI tool. The Velero CLI tool provides more detailed logs and information.

You can check installation issues, backup and restore CR issues, and Restic issues.

You can collect logs, CR information, and Prometheus metric data by using the must-gather tool.

Debugging Velero resources with the OpenShift CLI tool

You can debug a failed backup or restore by checking Velero custom resources (CRs) and the Velero pod log with the OpenShift CLI tool.

Velero CRs

Use the oc describe command to retrieve a summary of warnings and errors associated with a Backup or Restore CR:

$ oc describe <velero_cr> <cr_name>

Velero pod logs

Use the oc logs command to retrieve the Velero pod logs:

$ oc logs pod/<velero>

Velero pod debug logs

Use the oc edit command to set the Velero pod logs to debug level:

Edit the Velero deployment:

$ oc edit deployment/velero -n {namespace}

Add --log-level and debug to the spec.template.spec.containers.velero.args array:

apiVersion: apps/v1
kind: Deployment
...
spec:
  template:
    spec:
      containers:
      - name: velero
        image: velero/velero:latest
        command:
          - /velero
        args:
          - server
          - --log-level
          - debug
...

Debugging Velero resources with the Velero CLI tool

You can debug Backup and Restore custom resources (CRs) and retrieve logs with the Velero CLI tool.

The Velero CLI tool provides more detailed information than the OpenShift CLI tool.

Syntax

Use the oc exec command to run a Velero CLI command:

$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
  -- ./velero <backup_restore_cr> <command> <cr_name>

Example

$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
  -- ./velero backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql

You can specify velero-<pod> -n openshift-adp in place of $(oc get pods -n openshift-adp -o name | grep velero).

Example

$ oc exec velero-<pod> -n openshift-adp -- ./velero backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql

Help option

Use the velero --help option to list all Velero CLI commands:

$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) -- ./velero --help

Describe command

Use the velero describe command to retrieve a summary of warnings and errors associated with a Backup or Restore CR:

$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
  -- ./velero <backup_restore_cr> describe <cr_name>

Example

$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
  -- ./velero backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql

Logs command

Use the velero logs command to retrieve the logs of a Backup or Restore CR:

$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
  -- ./velero <backup_restore_cr> logs <cr_name>

Example

$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
  -- ./velero restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf

Installation issues

You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application.

Backup storage contains invalid directories

The Velero pod log displays the error message, Backup storage contains invalid top-level directories.

Cause

The object storage contains top-level directories that are not Velero directories.

Solution

If the object storage is not dedicated to Velero, you must specify a prefix for the bucket by setting the spec.backupLocations.velero.objectStorage.prefix parameter in the DataProtectionApplication manifest.

Incorrect AWS credentials

The oadp-aws-registry pod log displays the error message, InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.

The Velero pod log displays the error message, NoCredentialProviders: no valid providers in chain.

Cause

The credentials-velero file used to create the Secret object is incorrectly formatted.

Solution

Ensure that the credentials-velero file is correctly formatted, as in the following example:

Example credentials-velero file

[default] (1)
aws_access_key_id=AKIAIOSFODNN7EXAMPLE (2)
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

1	AWS default profile.
2	Do not enclose the values with quotation marks (`“`, `‘`).

Backup and Restore CR issues

You might encounter these common issues with Backup and Restore custom resources (CRs).

Backup CR cannot retrieve volume

The Backup CR displays the error message, InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist.

Cause

The persistent volume (PV) and the snapshot locations are in different regions.

Solution

Edit the value of the spec.snapshotLocations.velero.config.region key in the DataProtectionApplication manifest so that the snapshot location is in the same region as the PV.
Create a new Backup CR.

Backup CR status remains in progress

The status of a Backup CR remains in the InProgress phase and does not complete.

Cause

If a backup is interrupted, it cannot be resumed.

Solution

Retrieve the details of the Backup CR:

$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
  -- ./velero backup describe <backup>

Delete the Backup CR:
```
$ oc delete backup <backup> -n openshift-adp
```
You do not need to clean up the backup location because a Backup CR in progress has not uploaded files to object storage.
Create a new Backup CR.

Restic issues

You might encounter these issues when you back up applications with Restic.

Restic permission error for NFS data volumes with root_squash enabled

The Restic pod log displays the error message, controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied".

Cause

If your NFS data volumes have root_squash enabled, Restic maps to nfsnobody and does not have permission to create backups.

Solution

You can resolve this issue by creating a supplemental group for Restic and adding the group ID to the DataProtectionApplication manifest:

Create a supplemental group for Restic on the NFS data volume.
Set the setgid bit on the NFS directories so that group ownership is inherited.
Add the spec.configuration.restic.supplementalGroups parameter and the group ID to the DataProtectionApplication manifest, as in the following example:
```
spec:
  configuration:
    restic:
      enable: true
      supplementalGroups:
      - <group_id> (1)
```
1 Specify the supplemental group ID.
Wait for the Restic pods to restart so that the changes are applied.

Restore CR of Restic backup is “PartiallyFailed”, “Failed”, or remains “InProgress”

The Restore CR of a Restic backup completes with a PartiallyFailed or Failed status or it remains InProgress and does not complete.

If the status is PartiallyFailed or Failed, the Velero pod log displays the error message, level=error msg="unable to successfully complete restic restores of pod’s volumes".

If the status is InProgress, the Restore CR logs are unavailable and no errors appear in the Restic pod logs.

Cause

The DeploymentConfig object redeploys the Restore pod, causing the Restore CR to fail.

Solution

Create a Restore CR that excludes the ReplicationController and DeploymentConfig resources:

$ velero restore create --from-backup=<backup> -n openshift-adp \ (1)
  --include-namespaces <namespace> \ (2)
  --exclude-resources replicationcontroller,deploymentconfig \
  --restore-volumes=true

1	Specify the name of the `Backup` CR.
2	Specify the `include-namespaces` in the `Backup` CR.

Verify that the status of the Restore CR is Completed:

$ oc get restore -n openshift-adp <restore> -o jsonpath='{.status.phase}'

Create a Restore CR that includes the ReplicationController and DeploymentConfig resources:

$ velero restore create --from-backup=<backup> -n openshift-adp \
  --include-namespaces <namespace> \
  --include-resources replicationcontroller,deploymentconfig \
  --restore-volumes=true

Verify that the status of the Restore CR is Completed:

$ oc get restore -n openshift-adp <restore> -o jsonpath='{.status.phase}'

Verify that the backup resources have been restored:
```
$ oc get all -n <namespace>
```

Restic Backup CR cannot be recreated after bucket is emptied

If you create a Restic Backup CR for a namespace, empty the S3 bucket, and then recreate the Backup CR for the same namespace, the recreated Backup CR fails.

The velero pod log displays the error message, msg="Error checking repository for stale locks".

Cause

Velero does not create the Restic repository from the ResticRepository manifest if the Restic directories are deleted on object storage. See (Velero issue 4421) for details.

Using the must-gather tool

You can collect logs, metrics, and information about OADP custom resources by using the must-gather tool.

The must-gather data must be attached to all customer cases.

You can run the must-gather tool with the following data collection options:

Full must-gather data collection collects Prometheus metrics, pod logs, and Velero CR information for all namespaces where the OADP Operator is installed.
Essential must-gather data collection collects pod logs and Velero CR information for a specific duration of time, for example, one hour or 24 hours. Prometheus metrics and duplicate logs are not included.
must-gather data collection with timeout. Data collection can take a long time if there are many failed Backup CRs. You can improve performance by setting a timeout value.
Prometheus metrics data dump downloads an archive file containing the metrics data collected by Prometheus.

Prerequisites

You must be logged in to the OKD cluster as a user with the cluster-admin role.
You must have the OpenShift CLI (oc) installed.

Procedure

Navigate to the directory where you want to store the must-gather data.
Run the oc adm must-gather command for one of the following data collection options:
- Full must-gather data collection, including Prometheus metrics:
```
$ oc adm must-gather --image=registry.access.redhat.com/oadp-operator/oadp-must-gather-rhel8:v1.0
```
  The data is saved as must-gather/must-gather.tar.gz. You can upload this file to a support case on the Red Hat Customer Portal.
- Essential must-gather data collection, without Prometheus metrics, for a specific time duration:
```
$ oc adm must-gather --image=registry.access.redhat.com/oadp-operator/oadp-must-gather-rhel8:v1.0 \
  -- /usr/bin/gather_<time>_essential (1)
```
  1 Specify the time in hours. Allowed values are 1h, 6h, 24h, 72h, or all, for example, gather_1h_essential or gather_all_essential.
- must-gather data collection with timeout:
```
$ oc adm must-gather --image=registry.access.redhat.com/oadp-operator/oadp-must-gather-rhel8:v1.0 \
  -- /usr/bin/gather_with_timeout <timeout> (1)
```
  1 Specify a timeout value in seconds.
- Prometheus metrics data dump:
```
$ oc adm must-gather --image=registry.access.redhat.com/oadp-operator/oadp-must-gather-rhel8:v1.0 \
  -- /usr/bin/gather_metrics_dump
```
  This operation can take a long time. The data is saved as must-gather/metrics/prom_data.tar.gz.

Viewing metrics data with the Prometheus console

You can view the metrics data with the Prometheus console.

Procedure

Decompress the prom_data.tar.gz file:

$ tar -xvzf must-gather/metrics/prom_data.tar.gz

Create a local Prometheus instance:
```
$ make prometheus-run
```
The command outputs the Prometheus URL.

Output
```
Started Prometheus on http://localhost:9090
```
Launch a web browser and navigate to the URL to view the data by using the Prometheus web console.
After you have viewed the data, delete the Prometheus instance and data:
```
$ make prometheus-cleanup
```