Troubleshooting
You can debug Velero custom resources (CRs) by using the OpenShift CLI tool or the Velero CLI tool. The Velero CLI tool provides more detailed logs and information.
You can check installation issues, backup and restore CR issues, and Restic issues.
You can collect logs, CR information, and Prometheus metric data by using the must-gather tool.
Debugging Velero resources with the OpenShift CLI tool
You can debug a failed backup or restore by checking Velero custom resources (CRs) and the Velero
pod log with the OpenShift CLI tool.
Velero CRs
Use the oc describe
command to retrieve a summary of warnings and errors associated with a Backup
or Restore
CR:
$ oc describe <velero_cr> <cr_name>
Velero pod logs
Use the oc logs
command to retrieve the Velero
pod logs:
$ oc logs pod/<velero>
Velero pod debug logs
Use the oc edit
command to set the Velero
pod logs to debug level:
Edit the
Velero
deployment:$ oc edit deployment/velero -n {namespace}
Add
--log-level
anddebug
to thespec.template.spec.containers.velero.args
array:apiVersion: apps/v1
kind: Deployment
...
spec:
template:
spec:
containers:
- name: velero
image: velero/velero:latest
command:
- /velero
args:
- server
- --log-level
- debug
...
Debugging Velero resources with the Velero CLI tool
You can debug Backup
and Restore
custom resources (CRs) and retrieve logs with the Velero CLI tool.
The Velero CLI tool provides more detailed information than the OpenShift CLI tool.
Syntax
Use the oc exec
command to run a Velero CLI command:
$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
-- ./velero <backup_restore_cr> <command> <cr_name>
Example
$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
-- ./velero backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql
You can specify velero-<pod> -n openshift-adp
in place of $(oc get pods -n openshift-adp -o name | grep velero)
.
Example
$ oc exec velero-<pod> -n openshift-adp -- ./velero backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql
Help option
Use the velero --help
option to list all Velero CLI commands:
$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) -- ./velero --help
Describe command
Use the velero describe
command to retrieve a summary of warnings and errors associated with a Backup
or Restore
CR:
$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
-- ./velero <backup_restore_cr> describe <cr_name>
Example
$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
-- ./velero backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql
Logs command
Use the velero logs
command to retrieve the logs of a Backup
or Restore
CR:
$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
-- ./velero <backup_restore_cr> logs <cr_name>
Example
$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
-- ./velero restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf
Installation issues
You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application.
Backup storage contains invalid directories
The Velero
pod log displays the error message, Backup storage contains invalid top-level directories
.
Cause
The object storage contains top-level directories that are not Velero directories.
Solution
If the object storage is not dedicated to Velero, you must specify a prefix for the bucket by setting the spec.backupLocations.velero.objectStorage.prefix
parameter in the DataProtectionApplication
manifest.
Incorrect AWS credentials
The oadp-aws-registry
pod log displays the error message, InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.
The Velero
pod log displays the error message, NoCredentialProviders: no valid providers in chain
.
Cause
The credentials-velero
file used to create the Secret
object is incorrectly formatted.
Solution
Ensure that the credentials-velero
file is correctly formatted, as in the following example:
Example credentials-velero
file
[default] (1)
aws_access_key_id=AKIAIOSFODNN7EXAMPLE (2)
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
1 | AWS default profile. |
2 | Do not enclose the values with quotation marks (“ , ‘ ). |
Backup and Restore CR issues
You might encounter these common issues with Backup
and Restore
custom resources (CRs).
Backup CR cannot retrieve volume
The Backup
CR displays the error message, InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist
.
Cause
The persistent volume (PV) and the snapshot locations are in different regions.
Solution
Edit the value of the
spec.snapshotLocations.velero.config.region
key in theDataProtectionApplication
manifest so that the snapshot location is in the same region as the PV.Create a new
Backup
CR.
Backup CR status remains in progress
The status of a Backup
CR remains in the InProgress
phase and does not complete.
Cause
If a backup is interrupted, it cannot be resumed.
Solution
Retrieve the details of the
Backup
CR:$ oc exec $(oc get pods -n openshift-adp -o name | grep velero) \
-- ./velero backup describe <backup>
Delete the
Backup
CR:$ oc delete backup <backup> -n openshift-adp
You do not need to clean up the backup location because a
Backup
CR in progress has not uploaded files to object storage.Create a new
Backup
CR.
Restic issues
You might encounter these issues when you back up applications with Restic.
Restic permission error for NFS data volumes with root_squash enabled
The Restic
pod log displays the error message, controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied"
.
Cause
If your NFS data volumes have root_squash
enabled, Restic
maps to nfsnobody
and does not have permission to create backups.
Solution
You can resolve this issue by creating a supplemental group for Restic
and adding the group ID to the DataProtectionApplication
manifest:
Create a supplemental group for
Restic
on the NFS data volume.Set the
setgid
bit on the NFS directories so that group ownership is inherited.Add the
spec.configuration.restic.supplementalGroups
parameter and the group ID to theDataProtectionApplication
manifest, as in the following example:spec:
configuration:
restic:
enable: true
supplementalGroups:
- <group_id> (1)
1 Specify the supplemental group ID. Wait for the
Restic
pods to restart so that the changes are applied.
Restore CR of Restic backup is “PartiallyFailed”, “Failed”, or remains “InProgress”
The Restore
CR of a Restic backup completes with a PartiallyFailed
or Failed
status or it remains InProgress
and does not complete.
If the status is PartiallyFailed
or Failed
, the Velero
pod log displays the error message, level=error msg="unable to successfully complete restic restores of pod’s volumes"
.
If the status is InProgress
, the Restore
CR logs are unavailable and no errors appear in the Restic
pod logs.
Cause
The DeploymentConfig
object redeploys the Restore
pod, causing the Restore
CR to fail.
Solution
Create a
Restore
CR that excludes theReplicationController
andDeploymentConfig
resources:$ velero restore create --from-backup=<backup> -n openshift-adp \ (1)
--include-namespaces <namespace> \ (2)
--exclude-resources replicationcontroller,deploymentconfig \
--restore-volumes=true
1 Specify the name of the Backup
CR.2 Specify the include-namespaces
in theBackup
CR.Verify that the status of the
Restore
CR isCompleted
:$ oc get restore -n openshift-adp <restore> -o jsonpath='{.status.phase}'
Create a
Restore
CR that includes theReplicationController
andDeploymentConfig
resources:$ velero restore create --from-backup=<backup> -n openshift-adp \
--include-namespaces <namespace> \
--include-resources replicationcontroller,deploymentconfig \
--restore-volumes=true
Verify that the status of the
Restore
CR isCompleted
:$ oc get restore -n openshift-adp <restore> -o jsonpath='{.status.phase}'
Verify that the backup resources have been restored:
$ oc get all -n <namespace>
Restic Backup CR cannot be recreated after bucket is emptied
If you create a Restic Backup
CR for a namespace, empty the S3 bucket, and then recreate the Backup
CR for the same namespace, the recreated Backup
CR fails.
The velero
pod log displays the error message, msg="Error checking repository for stale locks"
.
Cause
Velero does not create the Restic repository from the ResticRepository
manifest if the Restic directories are deleted on object storage. See (Velero issue 4421) for details.
Using the must-gather tool
You can collect logs, metrics, and information about OADP custom resources by using the must-gather
tool.
The must-gather
data must be attached to all customer cases.
You can run the must-gather
tool with the following data collection options:
Full
must-gather
data collection collects Prometheus metrics, pod logs, and Velero CR information for all namespaces where the OADP Operator is installed.Essential
must-gather
data collection collects pod logs and Velero CR information for a specific duration of time, for example, one hour or 24 hours. Prometheus metrics and duplicate logs are not included.must-gather
data collection with timeout. Data collection can take a long time if there are many failedBackup
CRs. You can improve performance by setting a timeout value.Prometheus metrics data dump downloads an archive file containing the metrics data collected by Prometheus.
Prerequisites
You must be logged in to the OKD cluster as a user with the
cluster-admin
role.You must have the OpenShift CLI (
oc
) installed.
Procedure
Navigate to the directory where you want to store the
must-gather
data.Run the
oc adm must-gather
command for one of the following data collection options:Full
must-gather
data collection, including Prometheus metrics:$ oc adm must-gather --image=registry.access.redhat.com/oadp-operator/oadp-must-gather-rhel8:v1.0
The data is saved as
must-gather/must-gather.tar.gz
. You can upload this file to a support case on the Red Hat Customer Portal.Essential
must-gather
data collection, without Prometheus metrics, for a specific time duration:$ oc adm must-gather --image=registry.access.redhat.com/oadp-operator/oadp-must-gather-rhel8:v1.0 \
-- /usr/bin/gather_<time>_essential (1)
1 Specify the time in hours. Allowed values are 1h
,6h
,24h
,72h
, orall
, for example,gather_1h_essential
orgather_all_essential
.must-gather
data collection with timeout:$ oc adm must-gather --image=registry.access.redhat.com/oadp-operator/oadp-must-gather-rhel8:v1.0 \
-- /usr/bin/gather_with_timeout <timeout> (1)
1 Specify a timeout value in seconds. Prometheus metrics data dump:
$ oc adm must-gather --image=registry.access.redhat.com/oadp-operator/oadp-must-gather-rhel8:v1.0 \
-- /usr/bin/gather_metrics_dump
This operation can take a long time. The data is saved as
must-gather/metrics/prom_data.tar.gz
.
Viewing metrics data with the Prometheus console
You can view the metrics data with the Prometheus console.
Procedure
Decompress the
prom_data.tar.gz
file:$ tar -xvzf must-gather/metrics/prom_data.tar.gz
Create a local Prometheus instance:
$ make prometheus-run
The command outputs the Prometheus URL.
Output
Started Prometheus on http://localhost:9090
Launch a web browser and navigate to the URL to view the data by using the Prometheus web console.
After you have viewed the data, delete the Prometheus instance and data:
$ make prometheus-cleanup