- Topology Aware Lifecycle Manager for cluster updates
- About the Topology Aware Lifecycle Manager configuration
- About managed policies used with Topology Aware Lifecycle Manager
- Installing the Topology Aware Lifecycle Manager by using the web console
- Installing the Topology Aware Lifecycle Manager by using the CLI
- About the ClusterGroupUpgrade CR
- Update policies on managed clusters
- Using the container image pre-cache feature
- Troubleshooting the Topology Aware Lifecycle Manager
Topology Aware Lifecycle Manager for cluster updates
You can use the Topology Aware Lifecycle Manager (TALM) to manage the software lifecycle of multiple single-node OpenShift clusters. TALM uses Red Hat Advanced Cluster Management (RHACM) policies to perform changes on the target clusters.
Topology Aware Lifecycle Manager is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/. |
About the Topology Aware Lifecycle Manager configuration
The Topology Aware Lifecycle Manager (TALM) manages the deployment of Red Hat Advanced Cluster Management (RHACM) policies for one or more OKD clusters. Using TALM in a large network of clusters allows the phased rollout of policies to the clusters in limited batches. This helps to minimize possible service disruptions when updating. With TALM, you can control the following actions:
The timing of the update
The number of RHACM-managed clusters
The subset of managed clusters to apply the policies to
The update order of the clusters
The set of policies remediated to the cluster
The order of policies remediated to the cluster
TALM supports the orchestration of the OKD y-stream and z-stream updates, and day-two operations on y-streams and z-streams.
About managed policies used with Topology Aware Lifecycle Manager
The Topology Aware Lifecycle Manager (TALM) uses RHACM policies for cluster updates.
TALM can be used to manage the rollout of any policy CR where the remediationAction
field is set to inform
. Supported use cases include the following:
Manual user creation of policy CRs
Automatically generated policies from the
PolicyGenTemplate
custom resource definition (CRD)
For policies that update an Operator subscription with manual approval, TALM provides additional functionality that approves the installation of the updated Operator.
For more information about managed policies, see Policy Overview in the RHACM documentation.
For more information about the PolicyGenTemplate
CRD, see the “About the PolicyGenTemplate” section in “Deploying distributed units at scale in a disconnected environment”.
Installing the Topology Aware Lifecycle Manager by using the web console
You can use the OKD web console to install the Topology Aware Lifecycle Manager.
Prerequisites
Install the latest version of the RHACM Operator.
Set up a hub cluster with disconnected regitry.
Log in as a user with
cluster-admin
privileges.
Procedure
In the OKD web console, navigate to Operators → OperatorHub.
Search for the Topology Aware Lifecycle Manager from the list of available Operators, and then click Install.
Keep the default selection of Installation mode [“All namespaces on the cluster (default)”] and Installed Namespace (“openshift-operators”) to ensure that the Operator is installed properly.
Click Install.
Verification
To confirm that the installation is successful:
Navigate to the Operators → Installed Operators page.
Check that the Operator is installed in the
All Namespaces
namespace and its status isSucceeded
.
If the Operator is not installed successfully:
Navigate to the Operators → Installed Operators page and inspect the
Status
column for any errors or failures.Navigate to the Workloads → Pods page and check the logs in any containers in the
cluster-group-upgrades-controller-manager
pod that are reporting issues.
Installing the Topology Aware Lifecycle Manager by using the CLI
You can use the OpenShift CLI (oc
) to install the Topology Aware Lifecycle Manager (TALM).
Prerequisites
Install the OpenShift CLI (
oc
).Install the latest version of the RHACM Operator.
Set up a hub cluster with disconnected registry.
Log in as a user with
cluster-admin
privileges.
Procedure
Create a
Subscription
CR:Define the
Subscription
CR and save the YAML file, for example,talm-subscription.yaml
:apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: openshift-topology-aware-lifecycle-manager-subscription
namespace: openshift-operators
spec:
channel: "stable"
name: topology-aware-lifecycle-manager
source: redhat-operators
sourceNamespace: openshift-marketplace
Create the
Subscription
CR by running the following command:$ oc create -f talm-subscription.yaml
Verification
Verify that the installation succeeded by inspecting the CSV resource:
$ oc get csv -n openshift-operators
Example output
NAME DISPLAY VERSION REPLACES PHASE
topology-aware-lifecycle-manager.4.10.0-202206301927 Topology Aware Lifecycle Manager 4.10.0-202206301927 Succeeded
Verify that the TALM is up and running:
$ oc get deploy -n openshift-operators
Example output
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
openshift-operators cluster-group-upgrades-controller-manager 1/1 1 1 14s
About the ClusterGroupUpgrade CR
The Topology Aware Lifecycle Manager (TALM) builds the remediation plan from the ClusterGroupUpgrade
CR for a group of clusters. You can define the following specifications in a ClusterGroupUpgrade
CR:
Clusters in the group
Blocking
ClusterGroupUpgrade
CRsApplicable list of managed policies
Number of concurrent updates
Applicable canary updates
Actions to perform before and after the update
Update timing
As TALM works through remediation of the policies to the specified clusters, the ClusterGroupUpgrade
CR can have the following states:
UpgradeNotStarted
UpgradeCannotStart
UpgradeNotComplete
UpgradeTimedOut
UpgradeCompleted
PrecachingRequired
After TALM completes a cluster update, the cluster does not update again under the control of the same
|
The UpgradeNotStarted state
The initial state of the ClusterGroupUpgrade
CR is UpgradeNotStarted
.
TALM builds a remediation plan based on the following fields:
The
clusterSelector
field specifies the labels of the clusters that you want to update.The
clusters
field specifies a list of clusters to update.The
canaries
field specifies the clusters for canary updates.The
maxConcurrency
field specifies the number of clusters to update in a batch.
You can use the clusters
and the clusterSelector
fields together to create a combined list of clusters.
The remediation plan starts with the clusters listed in the canaries
field. Each canary cluster forms a single-cluster batch.
Any failures during the update of a canary cluster stops the update process. |
The ClusterGroupUpgrade
CR transitions to the UpgradeNotCompleted
state after the remediation plan is successfully created and after the enable
field is set to true
. At this point, TALM starts to update the non-compliant clusters with the specified managed policies.
You can only make changes to the |
Sample ClusterGroupUpgrade
CR in the UpgradeNotStarted
state
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-upgrade-complete
namespace: default
spec:
clusters: (1)
- spoke1
enable: false
managedPolicies: (2)
- policy1-common-cluster-version-policy
- policy2-common-nto-sub-policy
remediationStrategy: (3)
canaries: (4)
- spoke1
maxConcurrency: 1 (5)
timeout: 240
status: (6)
conditions:
- message: The ClusterGroupUpgrade CR is not enabled
reason: UpgradeNotStarted
status: "False"
type: Ready
copiedPolicies:
- cgu-upgrade-complete-policy1-common-cluster-version-policy
- cgu-upgrade-complete-policy2-common-nto-sub-policy
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy2-common-nto-sub-policy
namespace: default
placementBindings:
- cgu-upgrade-complete-policy1-common-cluster-version-policy
- cgu-upgrade-complete-policy2-common-nto-sub-policy
placementRules:
- cgu-upgrade-complete-policy1-common-cluster-version-policy
- cgu-upgrade-complete-policy2-common-nto-sub-policy
remediationPlan:
- - spoke1
1 | Defines the list of clusters to update. |
2 | Lists the user-defined set of policies to remediate. |
3 | Defines the specifics of the cluster updates. |
4 | Defines the clusters for canary updates. |
5 | Defines the maximum number of concurrent updates in a batch. The number of remediation batches is the number of canary clusters, plus the number of clusters, except the canary clusters, divided by the maxConcurrency value. The clusters that are already compliant with all the managed policies are excluded from the remediation plan. |
6 | Displays information about the status of the updates. |
The UpgradeCannotStart state
In the UpgradeCannotStart
state, the update cannot start because of the following reasons:
Blocking CRs are missing from the system
Blocking CRs have not yet finished
The UpgradeNotCompleted state
In the UpgradeNotCompleted
state, TALM enforces the policies following the remediation plan defined in the UpgradeNotStarted
state.
Enforcing the policies for subsequent batches starts immediately after all the clusters of the current batch are compliant with all the managed policies. If the batch times out, TALM moves on to the next batch. The timeout value of a batch is the spec.timeout
field divided by the number of batches in the remediation plan.
The managed policies apply in the order that they are listed in the |
Sample ClusterGroupUpgrade
CR in the UpgradeNotCompleted
state
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-upgrade-complete
namespace: default
spec:
clusters:
- spoke1
enable: true (1)
managedPolicies:
- policy1-common-cluster-version-policy
- policy2-common-nto-sub-policy
remediationStrategy:
maxConcurrency: 1
timeout: 240
status: (2)
conditions:
- message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant
reason: UpgradeNotCompleted
status: "False"
type: Ready
copiedPolicies:
- cgu-upgrade-complete-policy1-common-cluster-version-policy
- cgu-upgrade-complete-policy2-common-nto-sub-policy
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy2-common-nto-sub-policy
namespace: default
placementBindings:
- cgu-upgrade-complete-policy1-common-cluster-version-policy
- cgu-upgrade-complete-policy2-common-nto-sub-policy
placementRules:
- cgu-upgrade-complete-policy1-common-cluster-version-policy
- cgu-upgrade-complete-policy2-common-nto-sub-policy
remediationPlan:
- - spoke1
status:
currentBatch: 1
remediationPlanForBatch: (3)
spoke1: 0
1 | The update starts when the value of the spec.enable field is true . |
2 | The status fields change accordingly when the update begins. |
3 | Lists the clusters in the batch and the index of the policy that is being currently applied to each cluster. The index of the policies starts with 0 and the index follows the order of the status.managedPoliciesForUpgrade list. |
The UpgradeTimedOut state
In the UpgradeTimedOut
state, TALM checks every hour if all the policies for the ClusterGroupUpgrade
CR are compliant. The checks continue until the ClusterGroupUpgrade
CR is deleted or the updates are completed. The periodic checks allow the updates to complete if they get prolonged due to network, CPU, or other issues.
TALM transitions to the UpgradeTimedOut
state in two cases:
When the current batch contains canary updates and the cluster in the batch does not comply with all the managed policies within the batch timeout.
When the clusters do not comply with the managed policies within the
timeout
value specified in theremediationStrategy
field.
If the policies are compliant, TALM transitions to the UpgradeCompleted
state.
The UpgradeCompleted state
In the UpgradeCompleted
state, the cluster updates are complete.
Sample ClusterGroupUpgrade
CR in the UpgradeCompleted
state
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-upgrade-complete
namespace: default
spec:
actions:
afterCompletion:
deleteObjects: true (1)
clusters:
- spoke1
enable: true
managedPolicies:
- policy1-common-cluster-version-policy
- policy2-common-nto-sub-policy
remediationStrategy:
maxConcurrency: 1
timeout: 240
status: (2)
conditions:
- message: The ClusterGroupUpgrade CR has all clusters compliant with all the managed policies
reason: UpgradeCompleted
status: "True"
type: Ready
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy2-common-nto-sub-policy
namespace: default
remediationPlan:
- - spoke1
status:
remediationPlanForBatch:
spoke1: -2 (3)
1 | The value of spec.action.afterCompletion.deleteObjects field is true by default. After the update is completed, TALM deletes the underlying RHACM objects that were created during the update. This option is to prevent the RHACM hub from continuously checking for compliance after a successful update. |
2 | The status fields show that the updates completed successfully. |
3 | Displays that all the policies are applied to the cluster. |
The PrecachingRequired state
In the PrecachingRequired
state, the clusters need to have images pre-cached before the update can start. For more information about pre-caching, see the “Using the container image pre-cache feature” section.
Blocking ClusterGroupUpgrade CRs
You can create multiple ClusterGroupUpgrade
CRs and control their order of application.
For example, if you create ClusterGroupUpgrade
CR C that blocks the start of ClusterGroupUpgrade
CR A, then ClusterGroupUpgrade
CR A cannot start until the status of ClusterGroupUpgrade
CR C becomes UpgradeComplete
.
One ClusterGroupUpgrade
CR can have multiple blocking CRs. In this case, all the blocking CRs must complete before the upgrade for the current CR can start.
Prerequisites
Install the Topology Aware Lifecycle Manager (TALM).
Provision one or more managed clusters.
Log in as a user with
cluster-admin
privileges.Create RHACM policies in the hub cluster.
Procedure
Save the content of the
ClusterGroupUpgrade
CRs in thecgu-a.yaml
,cgu-b.yaml
, andcgu-c.yaml
files.apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-a
namespace: default
spec:
blockingCRs: (1)
- name: cgu-c
namespace: default
clusters:
- spoke1
- spoke2
- spoke3
enable: false
managedPolicies:
- policy1-common-cluster-version-policy
- policy2-common-pao-sub-policy
- policy3-common-ptp-sub-policy
remediationStrategy:
canaries:
- spoke1
maxConcurrency: 2
timeout: 240
status:
conditions:
- message: The ClusterGroupUpgrade CR is not enabled
reason: UpgradeNotStarted
status: "False"
type: Ready
copiedPolicies:
- cgu-a-policy1-common-cluster-version-policy
- cgu-a-policy2-common-pao-sub-policy
- cgu-a-policy3-common-ptp-sub-policy
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy2-common-pao-sub-policy
namespace: default
- name: policy3-common-ptp-sub-policy
namespace: default
placementBindings:
- cgu-a-policy1-common-cluster-version-policy
- cgu-a-policy2-common-pao-sub-policy
- cgu-a-policy3-common-ptp-sub-policy
placementRules:
- cgu-a-policy1-common-cluster-version-policy
- cgu-a-policy2-common-pao-sub-policy
- cgu-a-policy3-common-ptp-sub-policy
remediationPlan:
- - spoke1
- - spoke2
1 Defines the blocking CRs. The cgu-a
update cannot start untilcgu-c
is complete.apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-b
namespace: default
spec:
blockingCRs: (1)
- name: cgu-a
namespace: default
clusters:
- spoke4
- spoke5
enable: false
managedPolicies:
- policy1-common-cluster-version-policy
- policy2-common-pao-sub-policy
- policy3-common-ptp-sub-policy
- policy4-common-sriov-sub-policy
remediationStrategy:
maxConcurrency: 1
timeout: 240
status:
conditions:
- message: The ClusterGroupUpgrade CR is not enabled
reason: UpgradeNotStarted
status: "False"
type: Ready
copiedPolicies:
- cgu-b-policy1-common-cluster-version-policy
- cgu-b-policy2-common-pao-sub-policy
- cgu-b-policy3-common-ptp-sub-policy
- cgu-b-policy4-common-sriov-sub-policy
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy2-common-pao-sub-policy
namespace: default
- name: policy3-common-ptp-sub-policy
namespace: default
- name: policy4-common-sriov-sub-policy
namespace: default
placementBindings:
- cgu-b-policy1-common-cluster-version-policy
- cgu-b-policy2-common-pao-sub-policy
- cgu-b-policy3-common-ptp-sub-policy
- cgu-b-policy4-common-sriov-sub-policy
placementRules:
- cgu-b-policy1-common-cluster-version-policy
- cgu-b-policy2-common-pao-sub-policy
- cgu-b-policy3-common-ptp-sub-policy
- cgu-b-policy4-common-sriov-sub-policy
remediationPlan:
- - spoke4
- - spoke5
status: {}
1 The cgu-b
update cannot start untilcgu-a
is complete.apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-c
namespace: default
spec: (1)
clusters:
- spoke6
enable: false
managedPolicies:
- policy1-common-cluster-version-policy
- policy2-common-pao-sub-policy
- policy3-common-ptp-sub-policy
- policy4-common-sriov-sub-policy
remediationStrategy:
maxConcurrency: 1
timeout: 240
status:
conditions:
- message: The ClusterGroupUpgrade CR is not enabled
reason: UpgradeNotStarted
status: "False"
type: Ready
copiedPolicies:
- cgu-c-policy1-common-cluster-version-policy
- cgu-c-policy4-common-sriov-sub-policy
managedPoliciesCompliantBeforeUpgrade:
- policy2-common-pao-sub-policy
- policy3-common-ptp-sub-policy
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy4-common-sriov-sub-policy
namespace: default
placementBindings:
- cgu-c-policy1-common-cluster-version-policy
- cgu-c-policy4-common-sriov-sub-policy
placementRules:
- cgu-c-policy1-common-cluster-version-policy
- cgu-c-policy4-common-sriov-sub-policy
remediationPlan:
- - spoke6
status: {}
1 The cgu-c
update does not have any blocking CRs. TALM starts thecgu-c
update when theenable
field is set totrue
.Create the
ClusterGroupUpgrade
CRs by running the following command for each relevant CR:$ oc apply -f <name>.yaml
Start the update process by running the following command for each relevant CR:
$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/<name> \
--type merge -p '{"spec":{"enable":true}}'
The following examples show
ClusterGroupUpgrade
CRs where theenable
field is set totrue
:Example for
cgu-a
with blocking CRsapiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-a
namespace: default
spec:
blockingCRs:
- name: cgu-c
namespace: default
clusters:
- spoke1
- spoke2
- spoke3
enable: true
managedPolicies:
- policy1-common-cluster-version-policy
- policy2-common-pao-sub-policy
- policy3-common-ptp-sub-policy
remediationStrategy:
canaries:
- spoke1
maxConcurrency: 2
timeout: 240
status:
conditions:
- message: 'The ClusterGroupUpgrade CR is blocked by other CRs that have not yet
completed: [cgu-c]' (1)
reason: UpgradeCannotStart
status: "False"
type: Ready
copiedPolicies:
- cgu-a-policy1-common-cluster-version-policy
- cgu-a-policy2-common-pao-sub-policy
- cgu-a-policy3-common-ptp-sub-policy
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy2-common-pao-sub-policy
namespace: default
- name: policy3-common-ptp-sub-policy
namespace: default
placementBindings:
- cgu-a-policy1-common-cluster-version-policy
- cgu-a-policy2-common-pao-sub-policy
- cgu-a-policy3-common-ptp-sub-policy
placementRules:
- cgu-a-policy1-common-cluster-version-policy
- cgu-a-policy2-common-pao-sub-policy
- cgu-a-policy3-common-ptp-sub-policy
remediationPlan:
- - spoke1
- - spoke2
status: {}
1 Shows the list of blocking CRs. Example for
cgu-b
with blocking CRsapiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-b
namespace: default
spec:
blockingCRs:
- name: cgu-a
namespace: default
clusters:
- spoke4
- spoke5
enable: true
managedPolicies:
- policy1-common-cluster-version-policy
- policy2-common-pao-sub-policy
- policy3-common-ptp-sub-policy
- policy4-common-sriov-sub-policy
remediationStrategy:
maxConcurrency: 1
timeout: 240
status:
conditions:
- message: 'The ClusterGroupUpgrade CR is blocked by other CRs that have not yet
completed: [cgu-a]' (1)
reason: UpgradeCannotStart
status: "False"
type: Ready
copiedPolicies:
- cgu-b-policy1-common-cluster-version-policy
- cgu-b-policy2-common-pao-sub-policy
- cgu-b-policy3-common-ptp-sub-policy
- cgu-b-policy4-common-sriov-sub-policy
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy2-common-pao-sub-policy
namespace: default
- name: policy3-common-ptp-sub-policy
namespace: default
- name: policy4-common-sriov-sub-policy
namespace: default
placementBindings:
- cgu-b-policy1-common-cluster-version-policy
- cgu-b-policy2-common-pao-sub-policy
- cgu-b-policy3-common-ptp-sub-policy
- cgu-b-policy4-common-sriov-sub-policy
placementRules:
- cgu-b-policy1-common-cluster-version-policy
- cgu-b-policy2-common-pao-sub-policy
- cgu-b-policy3-common-ptp-sub-policy
- cgu-b-policy4-common-sriov-sub-policy
remediationPlan:
- - spoke4
- - spoke5
status: {}
1 Shows the list of blocking CRs. Example for
cgu-c
with blocking CRsapiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-c
namespace: default
spec:
clusters:
- spoke6
enable: true
managedPolicies:
- policy1-common-cluster-version-policy
- policy2-common-pao-sub-policy
- policy3-common-ptp-sub-policy
- policy4-common-sriov-sub-policy
remediationStrategy:
maxConcurrency: 1
timeout: 240
status:
conditions:
- message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant (1)
reason: UpgradeNotCompleted
status: "False"
type: Ready
copiedPolicies:
- cgu-c-policy1-common-cluster-version-policy
- cgu-c-policy4-common-sriov-sub-policy
managedPoliciesCompliantBeforeUpgrade:
- policy2-common-pao-sub-policy
- policy3-common-ptp-sub-policy
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy4-common-sriov-sub-policy
namespace: default
placementBindings:
- cgu-c-policy1-common-cluster-version-policy
- cgu-c-policy4-common-sriov-sub-policy
placementRules:
- cgu-c-policy1-common-cluster-version-policy
- cgu-c-policy4-common-sriov-sub-policy
remediationPlan:
- - spoke6
status:
currentBatch: 1
remediationPlanForBatch:
spoke6: 0
1 The cgu-c
update does not have any blocking CRs.
Update policies on managed clusters
The Topology Aware Lifecycle Manager (TALM) remediates a set of inform
policies for the clusters specified in the ClusterGroupUpgrade
CR. TALM remediates inform
policies by making enforce
copies of the managed RHACM policies. Each copied policy has its own corresponding RHACM placement rule and RHACM placement binding.
One by one, TALM adds each cluster from the current batch to the placement rule that corresponds with the applicable managed policy. If a cluster is already compliant with a policy, TALM skips applying that policy on the compliant cluster. TALM then moves on to applying the next policy to the non-compliant cluster. After TALM completes the updates in a batch, all clusters are removed from the placement rules associated with the copied policies. Then, the update of the next batch starts.
If a spoke cluster does not report any compliant state to RHACM, the managed policies on the hub cluster can be missing status information that TALM needs. TALM handles these cases in the following ways:
If a policy’s
status.compliant
field is missing, TALM ignores the policy and adds a log entry. Then, TALM continues looking at the policy’sstatus.status
field.If a policy’s
status.status
is missing, TALM produces an error.If a cluster’s compliance status is missing in the policy’s
status.status
field, TALM considers that cluster to be non-compliant with that policy.
For more information about RHACM policies, see Policy overview.
Additional resources
For more information about PolicyGenTemplate
CRD, see About the PolicyGenTemplate.
Applying update policies to managed clusters
You can update your managed clusters by applying your policies.
Prerequisites
Install the Topology Aware Lifecycle Manager (TALM).
Provision one or more managed clusters.
Log in as a user with
cluster-admin
privileges.Create RHACM policies in the hub cluster.
Procedure
Save the contents of the
ClusterGroupUpgrade
CR in thecgu-1.yaml
file.apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-1
namespace: default
spec:
managedPolicies: (1)
- policy1-common-cluster-version-policy
- policy2-common-nto-sub-policy
- policy3-common-ptp-sub-policy
- policy4-common-sriov-sub-policy
enable: false
clusters: (2)
- spoke1
- spoke2
- spoke5
- spoke6
remediationStrategy:
maxConcurrency: 2 (3)
timeout: 240 (4)
1 The name of the policies to apply. 2 The list of clusters to update. 3 The maxConcurrency
field signifies the number of clusters updated at the same time.4 The update timeout in minutes. Create the
ClusterGroupUpgrade
CR by running the following command:$ oc create -f cgu-1.yaml
Check if the
ClusterGroupUpgrade
CR was created in the hub cluster by running the following command:$ oc get cgu --all-namespaces
Example output
NAMESPACE NAME AGE
default cgu-1 8m55s
Check the status of the update by running the following command:
$ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq
Example output
{
"computedMaxConcurrency": 2,
"conditions": [
{
"lastTransitionTime": "2022-02-25T15:34:07Z",
"message": "The ClusterGroupUpgrade CR is not enabled", (1)
"reason": "UpgradeNotStarted",
"status": "False",
"type": "Ready"
}
],
"copiedPolicies": [
"cgu-policy1-common-cluster-version-policy",
"cgu-policy2-common-nto-sub-policy",
"cgu-policy3-common-ptp-sub-policy",
"cgu-policy4-common-sriov-sub-policy"
],
"managedPoliciesContent": {
"policy1-common-cluster-version-policy": "null",
"policy2-common-nto-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"node-tuning-operator\",\"namespace\":\"openshift-cluster-node-tuning-operator\"}]",
"policy3-common-ptp-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"ptp-operator-subscription\",\"namespace\":\"openshift-ptp\"}]",
"policy4-common-sriov-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"sriov-network-operator-subscription\",\"namespace\":\"openshift-sriov-network-operator\"}]"
},
"managedPoliciesForUpgrade": [
{
"name": "policy1-common-cluster-version-policy",
"namespace": "default"
},
{
"name": "policy2-common-nto-sub-policy",
"namespace": "default"
},
{
"name": "policy3-common-ptp-sub-policy",
"namespace": "default"
},
{
"name": "policy4-common-sriov-sub-policy",
"namespace": "default"
}
],
"managedPoliciesNs": {
"policy1-common-cluster-version-policy": "default",
"policy2-common-nto-sub-policy": "default",
"policy3-common-ptp-sub-policy": "default",
"policy4-common-sriov-sub-policy": "default"
},
"placementBindings": [
"cgu-policy1-common-cluster-version-policy",
"cgu-policy2-common-nto-sub-policy",
"cgu-policy3-common-ptp-sub-policy",
"cgu-policy4-common-sriov-sub-policy"
],
"placementRules": [
"cgu-policy1-common-cluster-version-policy",
"cgu-policy2-common-nto-sub-policy",
"cgu-policy3-common-ptp-sub-policy",
"cgu-policy4-common-sriov-sub-policy"
],
"precaching": {
"spec": {}
},
"remediationPlan": [
[
"spoke1",
"spoke2"
],
[
"spoke5",
"spoke6"
]
],
"status": {}
}
1 The spec.enable
field in theClusterGroupUpgrade
CR is set tofalse
.Check the status of the policies by running the following command:
$ oc get policies -A
Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
default cgu-policy1-common-cluster-version-policy enforce 17m (1)
default cgu-policy2-common-nto-sub-policy enforce 17m
default cgu-policy3-common-ptp-sub-policy enforce 17m
default cgu-policy4-common-sriov-sub-policy enforce 17m
default policy1-common-cluster-version-policy inform NonCompliant 15h
default policy2-common-nto-sub-policy inform NonCompliant 15h
default policy3-common-ptp-sub-policy inform NonCompliant 18m
default policy4-common-sriov-sub-policy inform NonCompliant 18m
1 The spec.remediationAction
field of policies currently applied on the clusters is set toenforce
. The managed policies ininform
mode from theClusterGroupUpgrade
CR remain ininform
mode during the update.
Change the value of the
spec.enable
field totrue
by running the following command:$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-1 \
--patch '{"spec":{"enable":true}}' --type=merge
Verification
Check the status of the update again by running the following command:
$ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq
Example output
{
"computedMaxConcurrency": 2,
"conditions": [ (1)
{
"lastTransitionTime": "2022-02-25T15:34:07Z",
"message": "The ClusterGroupUpgrade CR has upgrade policies that are still non compliant",
"reason": "UpgradeNotCompleted",
"status": "False",
"type": "Ready"
}
],
"copiedPolicies": [
"cgu-policy1-common-cluster-version-policy",
"cgu-policy2-common-nto-sub-policy",
"cgu-policy3-common-ptp-sub-policy",
"cgu-policy4-common-sriov-sub-policy"
],
"managedPoliciesContent": {
"policy1-common-cluster-version-policy": "null",
"policy2-common-nto-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"node-tuning-operator\",\"namespace\":\"openshift-cluster-node-tuning-operator\"}]",
"policy3-common-ptp-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"ptp-operator-subscription\",\"namespace\":\"openshift-ptp\"}]",
"policy4-common-sriov-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"sriov-network-operator-subscription\",\"namespace\":\"openshift-sriov-network-operator\"}]"
},
"managedPoliciesForUpgrade": [
{
"name": "policy1-common-cluster-version-policy",
"namespace": "default"
},
{
"name": "policy2-common-nto-sub-policy",
"namespace": "default"
},
{
"name": "policy3-common-ptp-sub-policy",
"namespace": "default"
},
{
"name": "policy4-common-sriov-sub-policy",
"namespace": "default"
}
],
"managedPoliciesNs": {
"policy1-common-cluster-version-policy": "default",
"policy2-common-nto-sub-policy": "default",
"policy3-common-ptp-sub-policy": "default",
"policy4-common-sriov-sub-policy": "default"
},
"placementBindings": [
"cgu-policy1-common-cluster-version-policy",
"cgu-policy2-common-nto-sub-policy",
"cgu-policy3-common-ptp-sub-policy",
"cgu-policy4-common-sriov-sub-policy"
],
"placementRules": [
"cgu-policy1-common-cluster-version-policy",
"cgu-policy2-common-nto-sub-policy",
"cgu-policy3-common-ptp-sub-policy",
"cgu-policy4-common-sriov-sub-policy"
],
"precaching": {
"spec": {}
},
"remediationPlan": [
[
"spoke1",
"spoke2"
],
[
"spoke5",
"spoke6"
]
],
"status": {
"currentBatch": 1,
"currentBatchStartedAt": "2022-02-25T15:54:16Z",
"remediationPlanForBatch": {
"spoke1": 0,
"spoke2": 1
},
"startedAt": "2022-02-25T15:54:16Z"
}
}
1 Reflects the update progress of the current batch. Run this command again to receive updated information about the progress. If the policies include Operator subscriptions, you can check the installation progress directly on the single-node cluster.
Export the
KUBECONFIG
file of the single-node cluster you want to check the installation progress for by running the following command:$ export KUBECONFIG=<cluster_kubeconfig_absolute_path>
Check all the subscriptions present on the single-node cluster and look for the one in the policy you are trying to install through the
ClusterGroupUpgrade
CR by running the following command:$ oc get subs -A | grep -i <subscription_name>
Example output for
cluster-logging
policyNAMESPACE NAME PACKAGE SOURCE CHANNEL
openshift-logging cluster-logging cluster-logging redhat-operators stable
If one of the managed policies includes a
ClusterVersion
CR, check the status of platform updates in the current batch by running the following command against the spoke cluster:$ oc get clusterversion
Example output
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.9.5 True True 43s Working towards 4.9.7: 71 of 735 done (9% complete)
Check the Operator subscription by running the following command:
$ oc get subs -n <operator-namespace> <operator-subscription> -ojsonpath="{.status}"
Check the install plans present on the single-node cluster that is associated with the desired subscription by running the following command:
$ oc get installplan -n <subscription_namespace>
Example output for
cluster-logging
OperatorNAMESPACE NAME CSV APPROVAL APPROVED
openshift-logging install-6khtw cluster-logging.5.3.3-4 Manual true (1)
1 The install plans have their Approval
field set toManual
and theirApproved
field changes fromtrue
tofalse
after TALM approves the install plan.Check if the cluster service version for the Operator of the policy that the
ClusterGroupUpgrade
is installing reached theSucceeded
phase by running the following command:$ oc get csv -n <operator_namespace>
Example output for OpenShift Logging Operator
NAME DISPLAY VERSION REPLACES PHASE
cluster-logging.5.4.2 Red Hat OpenShift Logging 5.4.2 Succeeded
Using the container image pre-cache feature
Clusters might have limited bandwidth to access the container image registry, which can cause a timeout before the updates are completed.
The time of the update is not set by TALM. You can apply the |
The container image pre-caching starts when the preCaching
field is set to true
in the ClusterGroupUpgrade
CR. After a successful pre-caching process, you can start remediating policies. The remediation actions start when the enable
field is set to true
.
The pre-caching process can be in the following statuses:
PrecacheNotStarted
This is the initial state all clusters are automatically assigned to on the first reconciliation pass of the ClusterGroupUpgrade
CR.
In this state, TALM deletes any pre-caching namespace and hub view resources of spoke clusters that remain from previous incomplete updates. TALM then creates a new ManagedClusterView
resource for the spoke pre-caching namespace to verify its deletion in the PrecachePreparing
state.
PrecachePreparing
Cleaning up any remaining resources from previous incomplete updates is in progress.
PrecacheStarting
Pre-caching job prerequisites and the job are created.
PrecacheActive
The job is in “Active” state.
PrecacheSucceeded
The pre-cache job has succeeded.
PrecacheTimeout
The artifact pre-caching has been partially done.
PrecacheUnrecoverableError
The job ends with a non-zero exit code.
Creating a ClusterGroupUpgrade CR with pre-caching
The pre-cache feature allows the required container images to be present on the spoke cluster before the update starts.
Prerequisites
Install the Topology Aware Lifecycle Manager (TALM).
Provision one or more managed clusters.
Log in as a user with
cluster-admin
privileges.
Procedure
Save the contents of the
ClusterGroupUpgrade
CR with thepreCaching
field set totrue
in theclustergroupupgrades-group-du.yaml
file:apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: du-upgrade-4918
namespace: ztp-group-du-sno
spec:
preCaching: true (1)
clusters:
- cnfdb1
- cnfdb2
enable: false
managedPolicies:
- du-upgrade-platform-upgrade
remediationStrategy:
maxConcurrency: 2
timeout: 240
1 The preCaching
field is set totrue
, which enables TALM to pull the container images before starting the update.When you want to start the update, apply the
ClusterGroupUpgrade
CR by running the following command:$ oc apply -f clustergroupupgrades-group-du.yaml
Verification
Check if the
ClusterGroupUpgrade
CR exists in the hub cluster by running the following command:$ oc get cgu -A
Example output
NAMESPACE NAME AGE
ztp-group-du-sno du-upgrade-4918 10s (1)
1 The CR is created. Check the status of the pre-caching task by running the following command:
$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
Example output
{
"conditions": [
{
"lastTransitionTime": "2022-01-27T19:07:24Z",
"message": "Precaching is not completed (required)", (1)
"reason": "PrecachingRequired",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2022-01-27T19:07:24Z",
"message": "Precaching is required and not done",
"reason": "PrecachingNotDone",
"status": "False",
"type": "PrecachingDone"
},
{
"lastTransitionTime": "2022-01-27T19:07:34Z",
"message": "Pre-caching spec is valid and consistent",
"reason": "PrecacheSpecIsWellFormed",
"status": "True",
"type": "PrecacheSpecValid"
}
],
"precaching": {
"clusters": [
"cnfdb1" (2)
],
"spec": {
"platformImage": "image.example.io"},
"status": {
"cnfdb1": "Active"}
}
}
1 Displays that the update is in progress. 2 Displays the list of identified clusters. Check the status of the pre-caching job by running the following command on the spoke cluster:
$ oc get jobs,pods -n openshift-talm-pre-cache
Example output
NAME COMPLETIONS DURATION AGE
job.batch/pre-cache 0/1 3m10s 3m10s
NAME READY STATUS RESTARTS AGE
pod/pre-cache--1-9bmlr 1/1 Running 0 3m10s
Check the status of the
ClusterGroupUpgrade
CR by running the following command:$ oc get cgu -n ztp-group-du-sno du-upgrade-4918 -o jsonpath='{.status}'
Example output
"conditions": [
{
"lastTransitionTime": "2022-01-27T19:30:41Z",
"message": "The ClusterGroupUpgrade CR has all clusters compliant with all the managed policies",
"reason": "UpgradeCompleted",
"status": "True",
"type": "Ready"
},
{
"lastTransitionTime": "2022-01-27T19:28:57Z",
"message": "Precaching is completed",
"reason": "PrecachingCompleted",
"status": "True",
"type": "PrecachingDone" (1)
}
1 The pre-cache tasks are done.
Troubleshooting the Topology Aware Lifecycle Manager
The Topology Aware Lifecycle Manager (TALM) is an OKD Operator that remediates RHACM policies. When issues occur, use the oc adm must-gather
command to gather details and logs and to take steps in debugging the issues.
For more information about related topics, see the following documentation:
Red Hat Advanced Cluster Management for Kubernetes 2.4 Support Matrix
The “Troubleshooting Operator issues” section
General troubleshooting
You can determine the cause of the problem by reviewing the following questions:
Is the configuration that you are applying supported?
Are the RHACM and the OKD versions compatible?
Are the TALM and RHACM versions compatible?
Which of the following components is causing the problem?
To ensure that the ClusterGroupUpgrade
configuration is functional, you can do the following:
Create the
ClusterGroupUpgrade
CR with thespec.enable
field set tofalse
.Wait for the status to be updated and go through the troubleshooting questions.
If everything looks as expected, set the
spec.enable
field totrue
in theClusterGroupUpgrade
CR.
After you set the |
Cannot modify the ClusterUpgradeGroup CR
Issue
You cannot edit the ClusterUpgradeGroup
CR after enabling the update.
Resolution
Restart the procedure by performing the following steps:
Remove the old
ClusterGroupUpgrade
CR by running the following command:$ oc delete cgu -n <ClusterGroupUpgradeCR_namespace> <ClusterGroupUpgradeCR_name>
Check and fix the existing issues with the managed clusters and policies.
Ensure that all the clusters are managed clusters and available.
Ensure that all the policies exist and have the
spec.remediationAction
field set toinform
.
Create a new
ClusterGroupUpgrade
CR with the correct configurations.$ oc apply -f <ClusterGroupUpgradeCR_YAML>
Managed policies
Checking managed policies on the system
Issue
You want to check if you have the correct managed policies on the system.
Resolution
Run the following command:
$ oc get cgu lab-upgrade -ojsonpath='{.spec.managedPolicies}'
Example output
["group-du-sno-validator-du-validator-policy", "policy2-common-nto-sub-policy", "policy3-common-ptp-sub-policy"]
Checking remediationAction mode
Issue
You want to check if the remediationAction
field is set to inform
in the spec
of the managed policies.
Resolution
Run the following command:
$ oc get policies --all-namespaces
Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
default policy1-common-cluster-version-policy inform NonCompliant 5d21h
default policy2-common-nto-sub-policy inform Compliant 5d21h
default policy3-common-ptp-sub-policy inform NonCompliant 5d21h
default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
Checking policy compliance state
Issue
You want to check the compliance state of policies.
Resolution
Run the following command:
$ oc get policies --all-namespaces
Example output
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
default policy1-common-cluster-version-policy inform NonCompliant 5d21h
default policy2-common-nto-sub-policy inform Compliant 5d21h
default policy3-common-ptp-sub-policy inform NonCompliant 5d21h
default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
Clusters
Checking if managed clusters are present
Issue
You want to check if the clusters in the ClusterGroupUpgrade
CR are managed clusters.
Resolution
Run the following command:
$ oc get managedclusters
Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
local-cluster true https://api.hub.example.com:6443 True Unknown 13d
spoke1 true https://api.spoke1.example.com:6443 True True 13d
spoke3 true https://api.spoke3.example.com:6443 True True 27h
Alternatively, check the TALM manager logs:
Get the name of the TALM manager by running the following command:
$ oc get pod -n openshift-operators
Example output
NAME READY STATUS RESTARTS AGE
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp 2/2 Running 0 45m
Check the TALM manager logs by running the following command:
$ oc logs -n openshift-operators \
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
Example output
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} (1)
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
1 The error message shows that the cluster is not a managed cluster.
Checking if managed clusters are available
Issue
You want to check if the managed clusters specified in the ClusterGroupUpgrade
CR are available.
Resolution
Run the following command:
$ oc get managedclusters
Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
local-cluster true https://api.hub.testlab.com:6443 True Unknown 13d
spoke1 true https://api.spoke1.testlab.com:6443 True True 13d (1)
spoke3 true https://api.spoke3.testlab.com:6443 True True 27h (1)
1 | The value of the AVAILABLE field is True for the managed clusters. |
Checking clusterSelector
Issue
You want to check if the clusterSelector
field is specified in the ClusterGroupUpgrade
CR in at least one of the managed clusters.
Resolution
Run the following command:
$ oc get managedcluster --selector=upgrade=true (1)
1 | The label for the clusters you want to update is upgrade:true . |
Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
spoke1 true https://api.spoke1.testlab.com:6443 True True 13d
spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
Checking if canary clusters are present
Issue
You want to check if the canary clusters are present in the list of clusters.
Example ClusterGroupUpgrade
CR
spec:
clusters:
- spoke1
- spoke3
clusterSelector:
- upgrade2=true
remediationStrategy:
canaries:
- spoke3
maxConcurrency: 2
timeout: 240
Resolution
Run the following commands:
$ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'
Example output
["spoke1", "spoke3"]
Check if the canary clusters are present in the list of clusters that match
clusterSelector
labels by running the following command:$ oc get managedcluster --selector=upgrade=true
Example output
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
spoke1 true https://api.spoke1.testlab.com:6443 True True 13d
spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
A cluster can be present in |
Checking the pre-caching status on spoke clusters
Check the status of pre-caching by running the following command on the spoke cluster:
$ oc get jobs,pods -n openshift-talo-pre-cache
Remediation Strategy
Checking if remediationStrategy is present in the ClusterGroupUpgrade CR
Issue
You want to check if the remediationStrategy
is present in the ClusterGroupUpgrade
CR.
Resolution
Run the following command:
$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy}'
Example output
{"maxConcurrency":2, "timeout":240}
Checking if maxConcurrency is specified in the ClusterGroupUpgrade CR
Issue
You want to check if the maxConcurrency
is specified in the ClusterGroupUpgrade
CR.
Resolution
Run the following command:
$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy.maxConcurrency}'
Example output
2
Topology Aware Lifecycle Manager
Checking condition message and status in the ClusterGroupUpgrade CR
Issue
You want to check the value of the status.conditions
field in the ClusterGroupUpgrade
CR.
Resolution
Run the following command:
$ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'
Example output
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"The ClusterGroupUpgrade CR has managed policies that are missing:[policyThatDoesntExist]", "reason":"UpgradeCannotStart", "status":"False", "type":"Ready"}
Checking corresponding copied policies
Issue
You want to check if every policy from status.managedPoliciesForUpgrade
has a corresponding policy in status.copiedPolicies
.
Resolution
Run the following command:
$ oc get cgu lab-upgrade -oyaml
Example output
status:
…
copiedPolicies:
- lab-upgrade-policy3-common-ptp-sub-policy
managedPoliciesForUpgrade:
- name: policy3-common-ptp-sub-policy
namespace: default
Checking if status.remediationPlan was computed
Issue
You want to check if status.remediationPlan
is computed.
Resolution
Run the following command:
$ oc get cgu lab-upgrade -ojsonpath='{.status.remediationPlan}'
Example output
[["spoke2", "spoke3"]]
Errors in the TALM manager container
Issue
You want to check the logs of the manager container of TALM.
Resolution
Run the following command:
$ oc logs -n openshift-operators \
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
Example output
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} (1)
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
1 | Displays the error. |
Additional resources
For information about troubleshooting, see OpenShift Container Platform Troubleshooting Operator Issues.
For more information about using Topology Aware Lifecycle Manager in the ZTP workflow, see Updating managed policies with Topology Aware Lifecycle Manager.