对于 v1.0 前版本,OVN 的 RAFT 模式存在机器重启后无法重新恢复的问题,建议升级至 1.0 后的版本。在 1.0 前版本进行修复需要重启启动所有容器网络下的 Pod,并且重启机器仍然有概率出现该问题。

    检查对应机器的 /var/log/openvswitch/ovsdb-server-nb.log 和 /var/log/openvswitch/ovsdb-server-sb.log,如果有下列类似的 ERR log 即可确认为是 OVN Raft 实现导致的问题

    1. 2020-05-15T02:52:55.703Z|00335|raft|ERR|Dropped 15 log messages in last 13 seconds (most recently, 2 seconds ago) due to excessive rate
    2. 2020-05-15T02:52:55.703Z|00336|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53161 last_log_index=61188 last_log_term=52219
    3. 2020-05-15T02:53:06.803Z|00337|raft|ERR|Dropped 15 log messages in last 11 seconds (most recently, 2 seconds ago) due to excessive rate
    4. 2020-05-15T02:53:06.803Z|00338|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53169 last_log_index=61188 last_log_term=52219
    5. 2020-05-15T02:53:18.409Z|00339|raft|ERR|Dropped 13 log messages in last 12 seconds (most recently, 2 seconds ago) due to excessive rate
    6. 2020-05-15T02:53:18.409Z|00340|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53176 last_log_index=61188 last_log_term=52219
    7. 2020-05-15T02:53:30.920Z|00341|raft|ERR|Dropped 15 log messages in last 12 seconds (most recently, 1 seconds ago) due to excessive rate
    8. 2020-05-15T02:53:30.920Z|00342|raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 61188 is past last synced index 0: 3a8b vote_request: term=53184 last_log_index=61188 last_log_term=52219

    修复方式:

    1. 记录 ovn-central 和 kube-ovn-controller 当前的实例数和所在机器,停掉 ovn-central 和 kube-ovn-controller
    1. kubectl scale deployment -n kube-ovn --replicas=0 ovn-central
    2. kubectl scale deployment -n kube-ovn --replicas=0 kube-ovn-controller
    1. 删除所有 ovn-central 所在机器 /etc/origin/openvswitch 目录下的 db 数据
    1. rm -rf /etc/origin/openvswitch/ovnnb_db.db
    2. rm -rf /etc/origin/openvswitch/ovnsb_db.db
    1. 备份并删除 metis 的 webhook 避免 pod 无法创建
    1. kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io metis -o yaml metis.yaml
    2. kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io metis
    1. scale ovn-central 至之前的实例数,并等待所有Pod ready
    1. kubectl scale deployment -n kube-ovn ovn-central --replicas=<XXX>
    1. sclae kube-ovn-controller 至之前的实例数,并等待所有 Pod Ready
    1. kubectl scale deployment -n kube-ovn kube-ovn-controller --replicas=<XXX>
    1. 删除所有容器网络模式下的 Pod 进行重建
    1. for ns in $(kubectl get ns --no-headers -o custom-columns=NAME:.metadata.name); do
    2. for pod in $(kubectl get pod --no-headers -n "$ns" --field-selector spec.restartPolicy=Always -o custom-columns=NAME:.metadata.name,HOST:spec.hostNetwork | awk '{if ($2!="true") print $1}'); do
    3. kubectl delete pod "$pod" -n "$ns"
    4. done
    5. done
    1. 重建 metis 的 webhook 删除掉 metis.yaml 中的 createionTimestamp,resourceVersion,seflLink,uid 等字段,进行重建
    1. kubectl apply -f metis.yaml