从 v1.0.2 升级到 v1.0.3

通用信息

Harvester GUI Dashboard 页面有一个升级按钮。有关详细信息,请参阅开始升级

对于离线环境升级,请参阅准备离线升级

已知问题


1. 下载升级镜像失败

说明

无法完成升级镜像的下载或失败并出现错误。

从 v1.0.2 升级到 v1.0.3 - 图1

相关问题

解决方法

删除当前升级并重新开始。请参阅重新开始升级


2. 升级卡住,节点处于 “Pre-drained” 状态(案例 1)

说明

用户可能会看到节点停留在 Pre-drained 状态一段时间(> 30 分钟)。

从 v1.0.2 升级到 v1.0.3 - 图2

这可能是由于节点 harvester-z7j2g 上的 instance-manager-r-* Pod 无法清空造成的。要验证上述情况:

  • 检查 Rancher Server 日志:

    1. kubectl logs deployment/rancher -n cattle-system

    示例输出:

    1. error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    2. evicting pod longhorn-system/instance-manager-r-10dd59c4
    3. error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    4. evicting pod longhorn-system/instance-manager-r-10dd59c4
    5. error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    6. evicting pod longhorn-system/instance-manager-r-10dd59c4
    7. error when evicting pods/"instance-manager-r-10dd59c4" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
  • 验证 Pod longhorn-system/instance-manager-r-10dd59c4 是否位于卡住的节点上:

    1. kubectl get pod instance-manager-r-10dd59c4 -n longhorn-system -o=jsonpath='{.spec.nodeName}'

    示例输出:

    1. harvester-z7j2g
  • 检查降级的卷:

    1. kubectl get volumes -n longhorn-system

    示例输出:

    1. NAME STATE ROBUSTNESS SCHEDULED SIZE NODE AGE
    2. pvc-08c34593-8225-4be6-9899-10a978df6ea1 attached healthy True 10485760 harvester-279l2 3d13h
    3. pvc-526600f5-bde2-4244-bb8e-7910385cbaeb attached healthy True 21474836480 harvester-x9jqw 3d1h
    4. pvc-7b3fc2c3-30eb-48b8-8a98-11913f8314c2 attached healthy True 10737418240 harvester-x9jqw 3d
    5. pvc-8065ed6c-a077-472c-920e-5fe9eacff96e attached healthy True 21474836480 harvester-x9jqw 3d
    6. pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599 attached degraded True 10737418240 harvester-x9jqw 2d23h
    7. pvc-9a6539b8-44e5-430e-9b24-ea8290cb13b7 attached healthy True 53687091200 harvester-x9jqw 3d13h

    我们可以看到卷 pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599 已降级。

从 v1.0.2 升级到 v1.0.3 - 图3备注

用户需要检查所有降级的卷。

  • 检查降级卷的副本状态:

    1. kubectl get replicas -n longhorn-system --selector longhornvolume=pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599 -o json | jq '.items[] | {replica: .metadata.name, healthyAt: .spec.healthyAt, nodeID: .spec.nodeID, state: .status.currentState}'

    示例输出:

    1. {
    2. "replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-15e31246",
    3. "healthyAt": "2022-07-25T07:33:16Z",
    4. "nodeID": "harvester-z7j2g",
    5. "state": "running"
    6. }
    7. {
    8. "replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-22974d0f",
    9. "healthyAt": "",
    10. "nodeID": "harvester-279l2",
    11. "state": "running"
    12. }
    13. {
    14. "replica": "pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5",
    15. "healthyAt": "",
    16. "nodeID": "harvester-x9jqw",
    17. "state": "stopped"
    18. }

    这里唯一健康的副本是 pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-15e31246,它位于 harvester-z7j2g 节点上。因此,我们可以确认 instance-manager-r-* Pod 位于 harvester-z7j2g 节点上并避免了清空。

相关问题

解决方法

我们需要启动 “Stopped” 状态的副本。在前面的示例中,停止的副本的名称是 pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5

  • 检查 Longhorn 管理器日志,我们会看到一个副本在等待 backing 镜像。首先,我们需要获取管理器的名称:

    1. kubectl get pods -n longhorn-system --selector app=longhorn-manager --field-selector spec.nodeName=harvester-x9jqw

    示例输出:

    1. NAME READY STATUS RESTARTS AGE
    2. longhorn-manager-zmfbw 1/1 Running 0 3d10h
  • 获取 Pod 日志:

    1. kubectl logs longhorn-manager-zmfbw -n longhorn-system | grep pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5

    示例输出:

    1. (...)
    2. time="2022-07-28T04:35:34Z" level=debug msg="Prepare to create instance pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5"
    3. time="2022-07-28T04:35:34Z" level=debug msg="Replica pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5 is waiting for backing image harvester-system-harvester-iso-n7bxh downloading file to node harvester-x9jqw disk 3830342d-c13d-4e55-ac74-99cad529e9d4, the current state is in-progress" controller=longhorn-replica dataPath= node=harvester-x9jqw nodeID=harvester-x9jqw ownerID=harvester-x9jqw replica=pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5
    4. time="2022-07-28T04:35:34Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5\", UID:\"c511630f-2fe2-4cf9-97a4-21bce73782b1\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"632926\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-9a40e5b9-543a-4c90-aafd-ac78b05d7599-r-bc6f7fa5"

    在这里,我们可以确定副本正在等待 backing 镜像 harvester-system-harvester-iso-n7bxh

  • 从 backing 镜像中获取磁盘文件映射:

    1. kubectl describe backingimage harvester-system-harvester-iso-n7bxh -n longhorn-system

    示例输出:

    1. (...)
    2. Disk File Status Map:
    3. 3830342d-c13d-4e55-ac74-99cad529e9d4:
    4. Last State Transition Time: 2022-07-25T08:30:34Z
    5. Message:
    6. Progress: 29
    7. State: in-progress
    8. 3aa804e1-229d-4141-8816-1f6a7c6c3096:
    9. Last State Transition Time: 2022-07-25T08:33:20Z
    10. Message:
    11. Progress: 100
    12. State: ready
    13. 92726efa-bfb3-478e-8553-3206ad34ce70:
    14. Last State Transition Time: 2022-07-28T04:31:49Z
    15. Message:
    16. Progress: 100
    17. State: ready

    UUID 3830342d-c13d-4e55-ac74-99cad529e9d4 的磁盘文件状态为 in-progress

  • 接下来,我们需要找到包含这个磁盘文件的 backing-image-manager:

    1. kubectl get pod -n longhorn-system --selector=longhorn.io/disk-uuid=3830342d-c13d-4e55-ac74-99cad529e9d4

    示例输出:

    1. NAME READY STATUS RESTARTS AGE
    2. backing-image-manager-c00e-3830 1/1 Running 0 3d1h
  • 通过删除 Pod 重新启动 backing-image-manager:

    1. kubectl delete pod -n longhorn-system backing-image-manager-c00e-3830

3. 升级卡住,节点处于 “Pre-drained” 状态(案例 2)

说明

用户可能会看到节点停留在 Pre-drained 状态一段时间(> 30 分钟)。

从 v1.0.2 升级到 v1.0.3 - 图4

以下是验证是否发生了此问题的步骤:

  • 访问 Longhorn GUI:https://{{VIP}}/k8s/clusters/local/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/#/volume(用适当的值替换 VIP)并检查降级的卷。降级的卷可能只包含一个健康的副本(蓝色背景),并且健康的副本位于 “Pre-drained” 节点上:

    从 v1.0.2 升级到 v1.0.3 - 图5

  • 将鼠标悬停在红色的 scheduled 图标上,可以看到原因是 toomanysnapshots

    从 v1.0.2 升级到 v1.0.3 - 图6

相关问题

解决方法

  • 在 “Snapshots and Backup” 面板中,切换 “Show System Hidden” 开关并删除最新的系统快照(就在 “Volume Head” 前面):

    从 v1.0.2 升级到 v1.0.3 - 图7

  • 卷将继续重建以恢复升级。