
Koordinator defines a CRD-based Pod migration API called PodMigrationJob, through which the descheduler or other automatic fault recovery components can evict or delete Pods more safely.


Migrating Pods is an important capability that many components (such as deschedulers) rely on, and can be used to optimize scheduling or help resolve workload runtime quality issues. We believe that pod migration is a complex process, involving steps such as auditing, resource allocation, and application startup, and is mixed with application upgrading, scaling scenarios, and resource operation and maintenance operations by cluster administrators. Therefore, how to manage the stability risk of this process to ensure that the application does not fail due to the migration of Pods is a very critical issue that must be resolved.

Based on the final state-oriented migration capability of the PodMigrationJob CRD, we can track the status of each process during the migration process, perceive scenarios such as application upgrades and scaling to ensure the stability of the workload.



  • Kubernetes >= 1.18
  • Koordinator >= 0.6


Please make sure Koordinator components are correctly installed in your cluster. If not, please refer to Installation.


PodMigrationJob is Enabled by default. You can use it without any modification on the koord-descheduler config.

Use PodMigrationJob

Quick Start

  1. Create a Deployment pod-demo with the YAML file below.
  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: pod-demo
  5. namespace: default
  6. spec:
  7. progressDeadlineSeconds: 600
  8. replicas: 1
  9. revisionHistoryLimit: 10
  10. selector:
  11. matchLabels:
  12. app: pod-demo
  13. strategy:
  14. rollingUpdate:
  15. maxSurge: 25%
  16. maxUnavailable: 25%
  17. type: RollingUpdate
  18. template:
  19. metadata:
  20. creationTimestamp: null
  21. labels:
  22. app: pod-demo
  23. name: stress
  24. spec:
  25. containers:
  26. - args:
  27. - -c
  28. - "1"
  29. command:
  30. - stress
  31. image: polinux/stress
  32. imagePullPolicy: Always
  33. name: stress
  34. resources:
  35. limits:
  36. cpu: "2"
  37. memory: 4Gi
  38. requests:
  39. cpu: 200m
  40. memory: 400Mi
  41. restartPolicy: Always
  42. schedulerName: koord-scheduler
  1. $ kubectl create -f pod-demo.yaml
  2. deployment.apps/pod-demo created
  1. Check the scheduled result of the pod pod-demo-0.
  1. $ kubectl get pod -o wide
  3. pod-demo-5f9b977566-c7lvk 1/1 Running 0 41s node-0 <none> <none>

pod-demo-5f9b977566-c7lvk is scheduled on the node node-0.

  1. Create a PodMigrationJob with the YAML file below to migrate pod-demo-0.
  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: PodMigrationJob
  3. metadata:
  4. name: migrationjob-demo
  5. spec:
  6. paused: false
  7. ttl: 5m
  8. mode: ReservationFirst
  9. podRef:
  10. namespace: default
  11. name: pod-demo-5f9b977566-c7lvk
  12. status:
  13. phase: Pending
  1. $ kubectl create -f migrationjob-demo.yaml
  2. podmigrationjob.scheduling.koordinator.sh/migrationjob-demo created
  1. Query migration status
  1. $ kubectl get podmigrationjob migrationjob-demo
  3. migrationjob-demo Succeed Complete 37s node-1 d56659ab-ba16-47a2-821d-22d6ba49258e default pod-demo-5f9b977566-c7lvk pod-demo-5f9b977566-nxjdf 5m0s

From the above results, it can be observed that:

  • PHASE is Succeed, STATUS is Complete, indicating that the migration is successful.
  • NODE node-1 indicates the node where the new Pod is scheduled after the migration.
  • RESERVATION d56659ab-ba16-47a2-821d-22d6ba49258e is the Reservation created during migration. The PodMigrationJob Controller will try to create the reserved resource for the Reservation before starting to evict the Pod. After the reservation is successful, the eviction will be initiated, which can ensure that the new Pod must be expelled. There are resources available.
  • PODNAMESPACE default represents the namespace where the migrated Pod is located,
  • POD pod-demo-5f9b977566-c7lvk represents the Pod to be migrated,
  • NEWPOD pod-demo-5f9b977566-nxjdf is the newly created Pod after migration.
  • TTL indicates the TTL period of the current Job.
  1. Query migration events

PodMigrationJob Controller will create Events for important steps in the migration process to help users diagnose migration problems

  1. $ kubectl describe podmigrationjob migrationjob-demo
  2. ...
  3. Events:
  4. Type Reason Age From Message
  5. ---- ------ ---- ---- -------
  6. Normal ReservationCreated 8m33s koord-descheduler Successfully create Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e"
  7. Normal ReservationScheduled 8m33s koord-descheduler Assigned Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e" to node "node-1"
  8. Normal Evicting 8m33s koord-descheduler Try to evict Pod "default/pod-demo-5f9b977566-c7lvk"
  9. Normal EvictComplete 8m koord-descheduler Pod "default/pod-demo-5f9b977566-c7lvk" has been evicted
  10. Normal Complete 8m koord-descheduler Bind Pod "default/pod-demo-5f9b977566-nxjdf" in Reservation "d56659ab-ba16-47a2-821d-22d6ba49258e"

Advanced Configurations

The latest API can be found in pod_migration_job_types.go.

Example: Manually confirm whether the migration is allowed

Eviction or migration operations that bring risks to the stability, so it is hoped to manually check and confirm that there is no error before initiating the migration operation, and then initiate the migration.

Therefore, when creating a PodMigrationJob, set spec.paused to true, and set spec.paused to false after manually confirming that execution is allowed. If you refuse to execute, you can update status.phase=Failed to terminate the execution of the PodMigrationJob immediately or wait for the PodMigrationJob to expire automatically.

  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: PodMigrationJob
  3. metadata:
  4. name: migrationjob-demo
  5. spec:
  6. # paused indicates whether the PodMigrationJob should to work or not.
  7. paused: true
  8. # ttl controls the PodMigrationJob timeout duration.
  9. ttl: 5m
  10. mode: ReservationFirst
  11. podRef:
  12. namespace: default
  13. name: pod-demo-5f9b977566-c7lvk
  14. status:
  15. phase: Pending

Example: Just want to evict Pods, no need to reserve resources

PodMigrationJob provides two migration modes:

  • EvictDirectly is directly evict Pod, no need to reserve resources,
  • ReservationFirst reserves resources first to ensure that resources can be allocated before initiating eviction.

If just want to evict Pods, just set spec.mode to EvictDirectly

  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: PodMigrationJob
  3. metadata:
  4. name: migrationjob-demo
  5. spec:
  6. paused: false
  7. ttl: 5m
  8. mode: EvictDirectly
  9. podRef:
  10. namespace: default
  11. name: pod-demo-5f9b977566-c7lvk
  12. status:
  13. phase: Pending

Example: Use reserved resources when migrating

In some scenarios, resources are reserved first, and then a PodMigrationJob is created after success. The arbitration mechanism provided by the PodMigrationJob Controller (BTW: will be implemented in v0.7) is reused to ensure workload stability.

  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: PodMigrationJob
  3. metadata:
  4. name: migrationjob-demo
  5. spec:
  6. paused: false
  7. ttl: 5m
  8. mode: ReservationFirst
  9. podRef:
  10. namespace: default
  11. name: pod-demo-5f9b977566-c7lvk
  12. reservationOptions:
  13. # the reservation-0 created before creating PodMigrationJob
  14. reservationRef:
  15. name: reservation-0
  16. status:
  17. phase: Pending

Example: Evicting Pods Gracefully

PodMigrationJob supports graceful eviction of pods.

  1. apiVersion: scheduling.koordinator.sh/v1alpha1
  2. kind: PodMigrationJob
  3. metadata:
  4. name: migrationjob-demo
  5. spec:
  6. paused: true
  7. ttl: 5m
  8. mode: ReservationFirst
  9. podRef:
  10. namespace: default
  11. name: pod-demo-5f9b977566-c7lvk
  12. deleteOptions:
  13. # The duration in seconds before the object should be deleted. Value must be non-negative integer.
  14. # The value zero indicates delete immediately. If this value is nil, the default grace period for the
  15. # specified type will be used.
  16. # Defaults to a per object value if not specified. zero means delete immediately.
  17. gracePeriodSeconds: 60
  18. status:
  19. phase: Pending

Known Issues