WorkloadSpread
FEATURE STATE: Kruise v0.10.0
WorkloadSpread can distribute Pods of workload to different types of Node according to some polices, which empowers single workload the abilities for multi-domain deployment and elastic deployment.
Some common policies include:
- fault toleration spread (for example, spread evenly among hosts, az, etc)
- spread according to the specified ratio (for example, deploy Pod to several specified az according to the proportion)
- subset management with priority, such as
- deploy Pods to ecs first, and then deploy to eci when its resources are insufficient.
- deploy a fixed number of Pods to ecs first, and the rest Pods are deployed to eci.
- subset management with customization, such as
- control how many pods in a workload are deployed in different cpu arch
- enable pods in different cpu arch to have different resource requirements
The feature of WorkloadSpread is similar with UnitedDeployment in OpenKruise community. Each WorkloadSpread defines multi-domain called subset. Each domain may provide the limit to run the replicas number of pods called maxReplicas. WorkloadSpread injects the domain configuration into the Pod by Webhook, and it also controls the order of scale in and scale out.
Currently, supported workload: CloneSet、Deployment、ReplicaSet.
Demo
apiVersion: apps.kruise.io/v1alpha1kind: WorkloadSpreadmetadata:name: workloadspread-demospec:targetRef:apiVersion: apps/v1 | apps.kruise.io/v1alpha1kind: Deployment | CloneSetname: workload-xxxsubsets:- name: subset-arequiredNodeSelectorTerm:matchExpressions:- key: topology.kubernetes.io/zoneoperator: Invalues:- zone-apreferredNodeSelectorTerms:- weight: 1preference:matchExpressions:- key: another-node-label-keyoperator: Invalues:- another-node-label-valuemaxReplicas: 3tolertions: []patch:metadata:labels:xxx-specific-label: xxx- name: subset-brequiredNodeSelectorTerm:matchExpressions:- key: topology.kubernetes.io/zoneoperator: Invalues:- zone-bscheduleStrategy:type: Adaptive | Fixedadaptive:rescheduleCriticalSeconds: 30
targetRef: specify the target workload. Can not be mutated,and one workload can only correspond to one WorkloadSpread.
subsets
subsets consists of multiple domain called subset, and each topology has different configuration.
sub-fields
name: the name ofsubset, it is distinct in a WorkloadSpread, which represents a topology.maxReplicas:the replicas limit ofsubset, and must be Integer and >= 0. There is no replicas limit while themaxReplicasis nil.Don’t support percentage type in current version.
requiredNodeSelectorTerm: match zone hardly。preferredNodeSelectorTerms: match zone softly。
Caution:requiredNodeSelectorTerm corresponds the requiredDuringSchedulingIgnoredDuringExecution of nodeAffinity. preferredNodeSelectorTerms corresponds the preferredDuringSchedulingIgnoredDuringExecution of nodeAffinity.
tolerations: the tolerations of Pod insubset.
tolerations:- key: "key1"operator: "Equal"value: "value1"effect: "NoSchedule"
patch: customize the Pod configuration ofsubset, such as Annotations, Labels, Env.
Example:
# patch pod with a topology label:patch:metadata:labels:topology.application.deploy/zone: "zone-a"
# patch pod container resources:patch:spec:containers:- name: mainresources:limit:cpu: "2"memory: 800Mi
# patch pod container env with a zone name:patch:spec:containers:- name: mainenv:- name: K8S_AZ_NAMEvalue: zone-a
Schedule strategy
WorkloadSpread provides two kind strategies, the default strategy is Fixed.
scheduleStrategy:type: Adaptive | Fixedadaptive:rescheduleCriticalSeconds: 30
Fixed:
Workload is strictly spread according to the definition of the subset.
Adaptive:
Reschedule: Kruise will check the unschedulable Pods of subset. If it exceeds the defined duration, the failed Pods will be rescheduled to the other
subset.
Requirements
WorkloadSpread defaults to be disabled. You have to configure the feature-gate WorkloadSpread when install or upgrade Kruise:
$ helm install kruise https://... --set featureGates="WorkloadSpread=true"
Pod Webhook
WorkloadSpread uses webhook to inject fault domain rules.
If the PodWebhook feature-gate is set to false, WorkloadSpread will also be disabled.
deletion-cost feature
CloneSet has supported deletion-cost feature in the latest versions.
The other native workload need kubernetes version >= 1.21. (In 1.21, users need to enable PodDeletionCost feature-gate, and since 1.22 it will be enabled by default)
Scale order:
The workload managed by WorkloadSpread will scale according to the defined order in spec.subsets.
The order of subset in spec.subsets can be changed, which can adjust the scale order of workload.
Scale out
- The Pods are scheduled in the subset order defined in the
spec.subsets. It will be scheduled in the nextsubsetwhile the replica number reaches the maxReplicas ofsubset
Scale in
- When the replica number of the
subsetis greater than themaxReplicas, the extra Pods will be removed in a high priority. - According to the
subsetorder in thespec.subsets, the Pods of thesubsetat the back are deleted before the Pods at the front.
# subset-a subset-b subset-c# maxReplicas 10 10 nil# pods number 10 10 10# deletion order: c -> b -> a# subset-a subset-b subset-c# maxReplicas 10 10 nil# pods number 20 20 20# deletion order: b -> a -> c
feature-gates
WorkloadSpread feature is turned off by default, if you want to turn it on set feature-gates WorkloadSpread.
$ helm install kruise https://... --set featureGates="WorkloadSpread=true"
Example
Elastic deployment
zone-a(ACK) holds 100 Pods, zone-b(ECI) as an elastic zone holds additional Pods.
- Create a WorkloadSpread instance.
apiVersion: apps.kruise.io/v1alpha1kind: WorkloadSpreadmetadta:name: ws-demonamespace: deployspec:targetRef: # workload in the same namespaceapiVersion: apps.kruise.io/v1alpha1kind: CloneSetname: workload-xxxsubsets:- name: ACK # zone ACKrequiredNodeSelectorTerm:matchExpressions:- key: topology.kubernetes.io/zoneoperator: Invalues:- ackmaxReplicas: 100patch: # inject label.metadata:labels:deploy/zone: ack- name: ECI # zone ECIrequiredNodeSelectorTerm:matchExpressions:- key: topology.kubernetes.io/zoneoperator: Invalues:- ecipatch:metadata:labels:deploy/zone: eci
- Creat a corresponding workload, the number of replicas ca be adjusted freely.
Effect
- When the number of
replicas<= 100, the Pods are scheduled inACKzone. - When the number of
replicas> 100, the 100 Pods are inACKzone, the extra Pods are scheduled inECIzone. - The Pods in
ECIelastic zone are removed first when scaling in.
Multi-domain deployment
Deploy 100 Pods to two zone(zone-a, zone-b) separately.
- Create a WorkloadSpread instance.
apiVersion: apps.kruise.io/v1alpha1kind: WorkloadSpreadmetadta:name: ws-demonamespace: deployspec:targetRef:apiVersion: apps.kruise.io/v1alpha1kind: CloneSetname: workload-xxxsubsets:- name: subset-arequiredNodeSelectorTerm:matchExpressions:- key: topology.kubernetes.io/zoneoperator: Invalues:- zone-amaxReplicas: 100patch:metadata:labels:deploy/zone: zone-a- name: subset-brequiredNodeSelectorTerm:matchExpressions:- key: topology.kubernetes.io/zoneoperator: Invalues:- zone-bmaxReplicas: 100patch:metadata:labels:deploy/zone: zone-b
Creat a corresponding workload with a 200 replicas, or perform a rolling update on an existing workload.
If the spread of zone needs to be changed, first adjust the
maxReplicasofsubset, and then change thereplicasof workload.