从 v1.0.2 升级到 v1.0.3
- 通用信息
- 已知问题

从 v1.0.2 升级到 v1.0.3

通用信息

Harvester GUI Dashboard 页面有一个升级按钮。有关详细信息，请参阅开始升级。

对于离线环境升级，请参阅准备离线升级。

已知问题

1. 下载升级镜像失败

说明

无法完成升级镜像的下载或失败并出现错误。

相关问题

[BUG]创建升级镜像失败

解决方法

删除当前升级并重新开始。请参阅重新开始升级。

2. 升级卡住，节点处于 “Pre-drained” 状态（案例 1）

说明

用户可能会看到节点停留在 Pre-drained 状态一段时间（> 30 分钟）。

这可能是由于节点 harvester-z7j2g 上的 instance-manager-r-* Pod 无法清空造成的。要验证上述情况：

检查 Rancher Server 日志：

kubectl logs deployment/rancher -n cattle-system

示例输出：

error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-r-10dd59c4
error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-r-10dd59c4
error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod longhorn-system/instance-manager-r-10dd59c4
error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

验证 Pod longhorn-system/instance-manager-r-10dd59c4 是否位于卡住的节点上：

kubectl get pod instance-manager-r-10dd59c4 -n longhorn-system -o=jsonpath='{.spec.nodeName}'

示例输出：

harvester-z7j2g

检查降级的卷：

kubectl get volumes -n longhorn-system

示例输出：

NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE              AGE
pvc-08c34593-8225-4be6-9899-10a978df6ea1   attached   healthy      True        10485760      harvester-279l2   3d13h
pvc-526600f5-bde2-4244-bb8e-7910385cbaeb   attached   healthy      True        21474836480   harvester-x9jqw   3d1h
pvc-7b3fc2c3-30eb-48b8-8a98-11913f8314c2   attached   healthy      True        10737418240   harvester-x9jqw   3d
pvc-8065ed6c-a077-472c-920e-5fe9eacff96e   attached   healthy      True        21474836480   harvester-x9jqw   3d
pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599   attached   degraded     True        10737418240   harvester-x9jqw   2d23h
pvc-9a6539b8-44e5-430e-9b24-ea8290cb13b7   attached   healthy      True        53687091200   harvester-x9jqw   3d13h

我们可以看到卷 pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599 已降级。

备注

用户需要检查所有降级的卷。

检查降级卷的副本状态：

kubectl get replicas -n longhorn-system --selector longhornvolume=pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599 -o json | jq '.items[] | {replica: .metadata.name, healthyAt: .spec.healthyAt, nodeID: .spec.nodeID, state: .status.currentState}'

示例输出：

{
  "replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-15e31246",
  "healthyAt": "2022-07-25T07:33:16Z",
  "nodeID": "harvester-z7j2g",
  "state": "running"
}
{
  "replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-22974d0f",
  "healthyAt": "",
  "nodeID": "harvester-279l2",
  "state": "running"
}
{
  "replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5",
  "healthyAt": "",
  "nodeID": "harvester-x9jqw",
  "state": "stopped"
}

这里唯一健康的副本是 pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-15e31246，它位于 harvester-z7j2g 节点上。因此，我们可以确认 instance-manager-r-* Pod 位于 harvester-z7j2g 节点上并避免了清空。

相关问题

[BUG] 升级：longhorn-system 无法被驱逐

解决方法

我们需要启动 “Stopped” 状态的副本。在前面的示例中，停止的副本的名称是 pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5。

检查 Longhorn 管理器日志，我们会看到一个副本在等待 backing 镜像。首先，我们需要获取管理器的名称：

kubectl get pods -n longhorn-system --selector app=longhorn-manager --field-selector spec.nodeName=harvester-x9jqw

示例输出：


NAME                     READY   STATUS    RESTARTS   AGE
longhorn-manager-zmfbw   1/1     Running   0          3d10h

获取 Pod 日志：

kubectl logs longhorn-manager-zmfbw -n longhorn-system | grep pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5

示例输出：

(...)
time="2022-07-28T04:35:34Z" level=debug msg="Prepare to create instance pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5"
time="2022-07-28T04:35:34Z" level=debug msg="Replica pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5 is waiting for backing image harvester-system-harvester-iso-n7bxh downloading file to node harvester-x9jqw disk 3830342d-c13d-4e55-ac74-99cad529e9d4, the current state is in-progress" controller=longhorn-replica dataPath= node=harvester-x9jqw nodeID=harvester-x9jqw ownerID=harvester-x9jqw replica=pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5
time="2022-07-28T04:35:34Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5\", UID:\"c511630f-2fe2-4cf9-97a4-21bce73782b1\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"632926\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5"

在这里，我们可以确定副本正在等待 backing 镜像 harvester-system-harvester-iso-n7bxh。

从 backing 镜像中获取磁盘文件映射：

kubectl describe backingimage harvester-system-harvester-iso-n7bxh -n longhorn-system

示例输出：

(...)
Disk File Status Map:
    3830342d-c13d-4e55-ac74-99cad529e9d4:
      Last State Transition Time:  2022-07-25T08:30:34Z
      Message:
      Progress:                    29
      State:                       in-progress
    3aa804e1-229d-4141-8816-1f6a7c6c3096:
      Last State Transition Time:  2022-07-25T08:33:20Z
      Message:
      Progress:                    100
      State:                       ready
    92726efa-bfb3-478e-8553-3206ad34ce70:
      Last State Transition Time:  2022-07-28T04:31:49Z
      Message:
      Progress:                    100
      State:                       ready

UUID 3830342d-c13d-4e55-ac74-99cad529e9d4 的磁盘文件状态为 in-progress。

接下来，我们需要找到包含这个磁盘文件的 backing-image-manager：

kubectl get pod -n longhorn-system --selector=longhorn.io/disk-uuid=3830342d-c13d-4e55-ac74-99cad529e9d4

示例输出：

NAME                              READY   STATUS    RESTARTS   AGE
backing-image-manager-c00e-3830   1/1     Running   0          3d1h

通过删除 Pod 重新启动 backing-image-manager：

kubectl delete pod -n longhorn-system backing-image-manager-c00e-3830

3. 升级卡住，节点处于 “Pre-drained” 状态（案例 2）

说明

用户可能会看到节点停留在 Pre-drained 状态一段时间（> 30 分钟）。

以下是验证是否发生了此问题的步骤：

访问 Longhorn GUI：https://{{VIP}}/k8s/clusters/local/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/#/volume（用适当的值替换 VIP）并检查降级的卷。降级的卷可能只包含一个健康的副本（蓝色背景），并且健康的副本位于 “Pre-drained” 节点上：
将鼠标悬停在红色的 scheduled 图标上，可以看到原因是 toomanysnapshots：

相关问题

[BUG] 升级卡在 “Predrained” 状态（卷带有太多系统快照）

解决方法

在 “Snapshots and Backup” 面板中，切换 “Show System Hidden” 开关并删除最新的系统快照（就在 “Volume Head” 前面）：
卷将继续重建以恢复升级。