Pod Group Status

In Coscheduling v1alph1 design, PodGroup’s statusonly includes counters of related pods which is not enough for PodGroup lifecycle management. More informationabout PodGroup’s status will be introduced in this design doc for lifecycle management, e.g. PodGroupPhase.

Function Detail

To include more information for PodGroup current status/phase, the following types are introduced:

  1. // PodGroupPhase is the phase of a pod group at the current time.
  2. type PodGroupPhase string
  3. // These are the valid phase of podGroups.
  4. const (
  5. // PodPending means the pod group has been accepted by the system, but scheduler can not allocate
  6. // enough resources to it.
  7. PodGroupPending PodGroupPhase = "Pending"
  8. // PodRunning means `spec.minMember` pods of PodGroups has been in running phase.
  9. PodGroupRunning PodGroupPhase = "Running"
  10. // PodGroupUnknown means part of `spec.minMember` pods are running but the other part can not
  11. // be scheduled, e.g. not enough resource; scheduler will wait for related controller to recover it.
  12. PodGroupUnknown PodGroupPhase = "Unknown"
  13. )
  14. type PodGroupConditionType string
  15. const (
  16. PodGroupUnschedulableType PodGroupConditionType = "Unschedulable"
  17. )
  18. // PodGroupCondition contains details for the current state of this pod group.
  19. type PodGroupCondition struct {
  20. // Type is the type of the condition
  21. Type PodGroupConditionType `json:"type,omitempty" protobuf:"bytes,1,opt,name=type"`
  22. // Status is the status of the condition.
  23. Status v1.ConditionStatus `json:"status,omitempty" protobuf:"bytes,2,opt,name=status"`
  24. // The ID of condition transition.
  25. TransitionID string `json:"transitionID,omitempty" protobuf:"bytes,3,opt,name=transitionID"`
  26. // Last time the phase transitioned from another to current phase.
  27. // +optional
  28. LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty" protobuf:"bytes,4,opt,name=lastTransitionTime"`
  29. // Unique, one-word, CamelCase reason for the phase's last transition.
  30. // +optional
  31. Reason string `json:"reason,omitempty" protobuf:"bytes,5,opt,name=reason"`
  32. // Human-readable message indicating details about last transition.
  33. // +optional
  34. Message string `json:"message,omitempty" protobuf:"bytes,6,opt,name=message"`
  35. }
  36. const (
  37. // PodFailedReason is probed if pod of PodGroup failed
  38. PodFailedReason string = "PodFailed"
  39. // PodDeletedReason is probed if pod of PodGroup deleted
  40. PodDeletedReason string = "PodDeleted"
  41. // NotEnoughResourcesReason is probed if there're not enough resources to schedule pods
  42. NotEnoughResourcesReason string = "NotEnoughResources"
  43. // NotEnoughPodsReason is probed if there're not enough tasks compared to `spec.minMember`
  44. NotEnoughPodsReason string = "NotEnoughTasks"
  45. )
  46. // PodGroupStatus represents the current state of a pod group.
  47. type PodGroupStatus struct {
  48. // Current phase of PodGroup.
  49. Phase PodGroupPhase `json:"phase,omitempty" protobuf:"bytes,1,opt,name=phase"`
  50. // The conditions of PodGroup.
  51. // +optional
  52. Conditions []PodGroupCondition `json:"conditions,omitempty" protobuf:"bytes,2,opt,name=conditions"`
  53. // The number of actively running pods.
  54. // +optional
  55. Running int32 `json:"running,omitempty" protobuf:"bytes,3,opt,name=running"`
  56. // The number of pods which reached phase Succeeded.
  57. // +optional
  58. Succeeded int32 `json:"succeeded,omitempty" protobuf:"bytes,4,opt,name=succeeded"`
  59. // The number of pods which reached phase Failed.
  60. // +optional
  61. Failed int32 `json:"failed,omitempty" protobuf:"bytes,5,opt,name=failed"`
  62. }

According to the PodGroup’s lifecycle, the following phase/state transactions are reasonable. And relatedreasons will be appended to Reason field.

FromToReason
PendingRunningWhen every pods of spec.minMember are running
RunningUnknownWhen some pods of spec.minMember are restarted but can not be rescheduled
UnknownPendingWhen all pods (spec.minMember) in PodGroups are deleted

Feature Interaction

Cluster AutoScale

Cluster Autoscaler is a tool thatautomatically adjusts the size of the Kubernetes cluster when one of the following conditions is true:

  • there are pods that failed to run in the cluster due to insufficient resources,
  • there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing nodes.When Cluster-Autoscaler scale-out a new node, it leverage predicates in scheduler to check whether the new node can bescheduled. But Coscheduling is not an implementation of predicates for now; so it’ll not work well together withCluster-Autoscaler right now. Alternative solution will be proposed later for that.

Operators/Controllers

The lifecycle of PodGroup are managed by operators/controllers, the scheduler only probes related state forcontrollers. For example, if PodGroup is Unknown for MPI job, the controller need to re-start all pods in PodGroup.

Reference