SLO 配置

简介

Koordinator 使用一个 ConfigMap 管理 SLO 配置。该 ConfigMap 被 slo-controller 所使用,它的名字和命名空间可以在 koord-manager 的启 动参数中指定(默认为 koordinator-system/slo-controller-config)。它分别包含了以下键值:

  • colocation-config:混部配置。例如,是否开启混部 Batch 资源,混部水位线。
  • resource-threshold-config:基于阈值的压制/驱逐策略的配置。例如,CPU 压制的阈值,内存驱逐的阈值。
  • resource-qos-config:QoS 特性的配置。例如,BE pods 的 Group Identity,LS pods 的内存 QoS,BE pods 的末级缓存划分。
  • cpu-burst-config:CPU Burst 特性的配置。例如,pod 的最大 burst 比例。
  • system-config:系统设定的配置。例如,全局内存最低水位线系数 min_free_kbytes

配置层级

每个配置定义为集群级别和节点级别的形式。

例如,

  1. type ColocationCfg struct {
  2. ColocationStrategy `json:",inline"`
  3. NodeConfigs []NodeColocationCfg `json:"nodeConfigs,omitempty"`
  4. }
  5. type ResourceQOSCfg struct {
  6. ClusterStrategy *slov1alpha1.ResourceQOSStrategy `json:"clusterStrategy,omitempty"`
  7. NodeStrategies []NodeResourceQOSStrategy `json:"nodeStrategies,omitempty"`
  8. }

集群级别配置用于设置全局配置,而节点级别则供用户调整部分节点的配置,特别是灰度部署的情况。

请注意,大部分可配置的字段都在组件内部(koordlet、koord-manager)有默认值,所以通常仅需要编辑变更的参数。

NodeSLO

SLO 配置的 data 字段会被 koord-manager 解析。Koord-manager 会检查配置数据是否合法,然后用解析后的配置更新到每个节点的 NodeSLO 对象中。 如果解析失败,koord-manager 会在 ConfigMap 对象上记录 Events,以警示 unmarshal 错误。对于 agent 组件 koordlet,它会 watch NodeSLO 的 Spec,并对节点的 QoS 特性进行调谐。

  1. apiVersion: slo.koordinator.sh/v1alpha1
  2. kind: NodeSLO
  3. metadata:
  4. name: test-node
  5. spec:
  6. cpuBurstStrategy: {}
  7. extensions: {}
  8. resourceQOSStrategy: {}
  9. systemStrategy: {}
  10. # parsed from the `resource-threshold-config` data
  11. resourceUsedThresholdWithBE:
  12. cpuSuppressPolicy: cpuset
  13. cpuSuppressThresholdPercent: 65
  14. enable: true
  15. memoryEvictThresholdPercent: 70

配置

参考版本:Koordinator v1.2

SLO 配置的模板如下:

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: slo-controller-config
  5. namespace: koordinator-system
  6. data:
  7. # colocation-config is the configuration for colocation.
  8. # Related features: Dynamic resource over-commitment, Load-aware scheduling, Load-aware descheduling.
  9. # - enable: whether to enable the colocation. If false, the reclaimed resources of the node allocatable (e.g. `kubernetes.io/batch-cpu`) will be removed.
  10. # - metricAggregateDurationSeconds: the aggregated duration of node metrics reporting.
  11. # - metricReportIntervalSeconds: the reporting interval of the node metrics.
  12. # - metricAggregatePolicy: policies of reporting node metrics in different durations.
  13. # - cpuReclaimThresholdPercent: the reclaim threshold for calculating the reclaimed cpu resource. Basically, the reclaimed resource cannot reclaim the unused resources which are exceeding the threshold.
  14. # - memoryReclaimThresholdPercent: the reclaim threshold for calculating the reclaimed memory resource. Basically, the reclaimed resource cannot reclaim the unused resources which are exceeding the threshold.
  15. # - memoryCalculatePolicy: the policy for calculating the reclaimable memory resource. If set to `request`, only unallocated memory resource of high-priority pods are reclaimable, and no allocated memory can be reclaimed.
  16. # - degradeTimeMinutes: the threshold duration to degrade the colocation for which the node metrics has not been updated.
  17. # - updateTimeThresholdSeconds: the threshold duration to force updating the reclaimed resources with the latest calculated result.
  18. # - resourceDiffThreshold: the threshold to update the reclaimed resources than which the calculated reclaimed resources is different from the current.
  19. # - nodeConfigs: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration.
  20. colocation-config: |
  21. {
  22. "enable": false,
  23. "metricAggregateDurationSeconds": 300,
  24. "metricReportIntervalSeconds": 60,
  25. "metricAggregatePolicy": {
  26. "durations": [
  27. "5m",
  28. "10m",
  29. "15m"
  30. ]
  31. },
  32. "cpuReclaimThresholdPercent": 60,
  33. "memoryReclaimThresholdPercent": 65,
  34. "memoryCalculatePolicy": "usage",
  35. "degradeTimeMinutes": 15,
  36. "updateTimeThresholdSeconds": 300,
  37. "resourceDiffThreshold": 0.1,
  38. "nodeConfigs": [
  39. {
  40. "name": "anolis",
  41. "nodeSelector": {
  42. "matchLabels": {
  43. "kubernetes.io/kernel": "anolis"
  44. }
  45. },
  46. "updateTimeThresholdSeconds": 360,
  47. "resourceDiffThreshold": 0.2
  48. }
  49. ]
  50. }
  51. # The configuration for threshold-based strategies.
  52. # Related features: BECPUSuppress, BEMemoryEvict, BECPUEvict.
  53. # - clusterStrategy: the cluster-level configuration.
  54. # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration.
  55. # - enable: whether to enable the threshold-based strategies or not. If false, all threshold-based strategies are disabled. If set to true, CPU Suppress and Memory Evict are enabled by default.
  56. # - cpuSuppressThresholdPercent: the node cpu utilization threshold to suppress BE pods' usage.
  57. # - cpuSuppressPolicy: the policy of cpu suppression. If set to `cpuset`, the BE pods' `cpuset.cpus` will be reconciled when suppression. If set to `cfsQuota`, the BE pods' `cpu.cfs_quota_us` will be reconciled.
  58. # - memoryEvictThresholdPercent: the node memory utilization threshold to evict BE pods.
  59. # - memoryEvictLowerPercent: the node memory utilization threshold to stop the memory eviction. By default, `lowerPercent = thresholdPercent - 2`.
  60. # - cpuEvictBESatisfactionLowerPercent: the cpu satisfaction threshold to start the cpu eviction (also require to meet the BE util threshold).
  61. # - cpuEvictBEUsageThresholdPercent: the BE utilization (BEUsage / BERealLimit) threshold to start the cpu eviction (also require to meet the cpu satisfaction threshold).
  62. # - cpuEvictBESatisfactionUpperPercent: the cpu satisfaction threshold to stop the cpu eviction.
  63. # - cpuEvictTimeWindowSeconds: the time window of the cpu metrics for the cpu eviction.
  64. resource-threshold-config: |
  65. {
  66. "clusterStrategy": {
  67. "enable": false,
  68. "cpuSuppressThresholdPercent": 65,
  69. "cpuSuppressPolicy": "cpuset",
  70. "memoryEvictThresholdPercent": 70,
  71. "memoryEvictLowerPercent": 65,
  72. "cpuEvictBESatisfactionUpperPercent": 90,
  73. "cpuEvictBESatisfactionLowerPercent": 60,
  74. "cpuEvictBEUsageThresholdPercent": 90
  75. },
  76. "nodeStrategies": [
  77. {
  78. "name": "anolis",
  79. "nodeSelector": {
  80. "matchLabels": {
  81. "kubernetes.io/kernel": "anolis"
  82. }
  83. },
  84. "cpuEvictBEUsageThresholdPercent": 80
  85. }
  86. ]
  87. }
  88. # The configuration for QoS-based features.
  89. # Related features: CPUQoS (GroupIdentity), MemoryQoS (CgroupReconcile), ResctrlQoS.
  90. # - clusterStrategy: the cluster-level configuration.
  91. # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration.
  92. # - lsrClass/lsClass/beClass: the configuration for pods of QoS LSR/LS/BE respectively.
  93. # - cpuQOS: the configuration of CPU QoS.
  94. # - enable: whether to enable CPU QoS. If set to `false`, the related cgroup configs will be reset to the system default.
  95. # - groupIdentity: the priority level of the Group Identity ([-1, 2]). `2` means the highest priority, while `-1` means the lowest priority. Anolis OS required.
  96. # - memoryQOS: the configuration of Memory QoS.
  97. # - enable: whether to enable Memory QoS. If set to `false`, the related cgroup configs will be reset to the system default.
  98. # - minLimitPercent: the scale percentage for setting the `memory.min` based on the container's request. It enables the memory protection from the Linux memory reclaim.
  99. # - lowLimitPercent: the scale percentage for setting the `memory.low` based on the container's request. It enables the memory soft protection from the Linux memory reclaim.
  100. # - throttlingPercent: the scale percentage for setting the `memory.high` based on the container's limit. It enables the memory throttling in cgroup level.
  101. # - wmarkRatio: the ratio of container-level asynchronous memory reclaim based on the container's limit. Anolis OS required.
  102. # - wmarkScalePermill: the per-mill of container memory to reclaim in once asynchronous memory reclaim. Anolis OS required.
  103. # - wmarkMinAdj: the adjustment percentage of global memory min watermark. It affects the reclaim priority when the node memory free is quite a few. Anolis OS required.
  104. # - resctrlQOS: the configuration of Resctrl (Intel RDT) QoS.
  105. # - enable: whether to enable Resctrl QoS.
  106. # - catRangeStartPercent: the starting percentage of the L3 Cache way partitioning. L3 CAT required.
  107. # - catRangeEndPercent: the ending percentage of the L3 Cache way partitioning. L3 CAT required.
  108. # - mbaPercent: the allocation percentage of the memory bandwidth. MBA required.
  109. resource-qos-config: |
  110. {
  111. "clusterStrategy": {
  112. "lsrClass": {
  113. "cpuQOS": {
  114. "enable": false,
  115. "groupIdentity": 2
  116. },
  117. "memoryQOS": {
  118. "enable": false,
  119. "minLimitPercent": 0,
  120. "lowLimitPercent": 0,
  121. "throttlingPercent": 0,
  122. "wmarkRatio": 95,
  123. "wmarkScalePermill": 20,
  124. "wmarkMinAdj": -25,
  125. "priorityEnable": 0,
  126. "priority": 0,
  127. "oomKillGroup": 0
  128. },
  129. "resctrlQOS": {
  130. "enable": false,
  131. "catRangeStartPercent": 0,
  132. "catRangeEndPercent": 100,
  133. "mbaPercent": 100
  134. }
  135. },
  136. "lsClass": {
  137. "cpuQOS": {
  138. "enable": false,
  139. "groupIdentity": 2
  140. },
  141. "memoryQOS": {
  142. "enable": false,
  143. "minLimitPercent": 0,
  144. "lowLimitPercent": 0,
  145. "throttlingPercent": 0,
  146. "wmarkRatio": 95,
  147. "wmarkScalePermill": 20,
  148. "wmarkMinAdj": -25,
  149. "priorityEnable": 0,
  150. "priority": 0,
  151. "oomKillGroup": 0
  152. },
  153. "resctrlQOS": {
  154. "enable": false,
  155. "catRangeStartPercent": 0,
  156. "catRangeEndPercent": 100,
  157. "mbaPercent": 100
  158. }
  159. },
  160. "beClass": {
  161. "cpuQOS": {
  162. "enable": false,
  163. "groupIdentity": -1
  164. },
  165. "memoryQOS": {
  166. "enable": false,
  167. "minLimitPercent": 0,
  168. "lowLimitPercent": 0,
  169. "throttlingPercent": 0,
  170. "wmarkRatio": 95,
  171. "wmarkScalePermill": 20,
  172. "wmarkMinAdj": 50,
  173. "priorityEnable": 0,
  174. "priority": 0,
  175. "oomKillGroup": 0
  176. },
  177. "resctrlQOS": {
  178. "enable": false,
  179. "catRangeStartPercent": 0,
  180. "catRangeEndPercent": 30,
  181. "mbaPercent": 100
  182. }
  183. }
  184. },
  185. "nodeStrategies": [
  186. {
  187. "name": "anolis",
  188. "nodeSelector": {
  189. "matchLabels": {
  190. "kubernetes.io/kernel": "anolis"
  191. }
  192. },
  193. "beClass": {
  194. "memoryQOS": {
  195. "wmarkRatio": 90
  196. }
  197. }
  198. }
  199. ]
  200. }
  201. # The configuration for the CPU Burst.
  202. # Related features: CPUBurst.
  203. # - clusterStrategy: the cluster-level configuration.
  204. # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration.
  205. # - policy: the policy of CPU Burst. If set to `none`, the CPU Burst is disabled. If set to `auto`, the CPU Burst is fully enabled. If set to `cpuBurstOnly`, only the Linux CFS Burst feature is enabled.
  206. # - cpuBurstPercent: the percentage of Linux CFS Burst. It affects the value of `cpu.cfs_burst_us` of pod/container cgroups. It specifies the percentage to which the CPU limit can be increased by CPU Burst.
  207. # - cfsQuotaBurstPercent: the percentage of cfs quota burst. It affects the scaled ratio of `cpu.cfs_quota_us` of pod/container cgroups. It specifies the maximum percentage to which the value of cfs_quota in the cgroup parameters can be increased.
  208. # - cfsQuotaBurstPeriodSeconds: the maximum period of once cfs quota burst. It indicates that the time period in which the container can run with an increased CFS quota is unlimited.
  209. # - sharePoolThresholdPercent: the threshold of share pool utilization. If the share pool utilization is too high, CPU Burst will be stopped and reset to avoid machine overload.
  210. cpu-burst-config: |
  211. {
  212. "clusterStrategy": {
  213. "policy": "none",
  214. "cpuBurstPercent": 1000,
  215. "cfsQuotaBurstPercent": 300,
  216. "cfsQuotaBurstPeriodSeconds": -1,
  217. "sharePoolThresholdPercent": 50
  218. },
  219. "nodeStrategies": [
  220. {
  221. "name": "anolis",
  222. "nodeSelector": {
  223. "matchLabels": {
  224. "kubernetes.io/kernel": "anolis"
  225. }
  226. },
  227. "policy": "cfsQuotaBurstOnly",
  228. "cfsQuotaBurstPercent": 400
  229. }
  230. ]
  231. }
  232. # The configuration for system-level settings.
  233. # Related features: SystemConfig.
  234. # - clusterStrategy: the cluster-level configuration.
  235. # - nodeStrategies: the node-level configurations which matches the nodes via the node selector and overrides the cluster configuration.
  236. # - minFreeKbytesFactor: the factor for calculating the global minimum memory free watermark `/proc/sys/vm/min_free_kbytes`. `min_free_kbytes = minFreeKbytesFactor * nodeTotalMemory / 10000`.
  237. # - watermarkScaleFactor: the reclaim factor `/proc/sys/vm/watermark_scale_factor` in once global memory reclaim.
  238. # - memcgReapBackGround: whether to enable the reaper for orphan memory cgroups.
  239. system-config: |-
  240. {
  241. "clusterStrategy": {
  242. "minFreeKbytesFactor": 100,
  243. "watermarkScaleFactor": 150,
  244. "memcgReapBackGround": 0
  245. }
  246. "nodeStrategies": [
  247. {
  248. "name": "anolis",
  249. "nodeSelector": {
  250. "matchLabels": {
  251. "kubernetes.io/kernel": "anolis"
  252. }
  253. },
  254. "minFreeKbytesFactor": 100,
  255. "watermarkScaleFactor": 150
  256. }
  257. ]
  258. }
  259. # The configuration for host application settings.
  260. # - name: name of the host application.
  261. # - qos: QoS class of the application.
  262. # - cgroupPath: cgroup path of the application, the directory equals to `${base}/${parentDir}/${relativePath}`.
  263. # - cgroupPath.base: cgroup base dir of the application, the format is various across cgroup drivers.
  264. # - cgroupPath.parentDir: cgroup parent path under base dir. By default it is "host-latency-sensitive/" for LS and "host-latency-sensitive/" for BE.
  265. # - cgroupPath.relativePath: cgroup relative path under parent dir.
  266. host-application-config: |
  267. {
  268. "applications": [
  269. {
  270. "name": "nginx",
  271. "qos": "LS",
  272. "cgroupPath": {
  273. "base": "CgroupRoot",
  274. "parentDir": "host-latency-sensitive/",
  275. "relativePath": "nginx/"
  276. }
  277. }
  278. ]
  279. }

对于更多信息,请查看相关特性的用户手册和设计文档。

快速开始

  1. 通过 ConfigMap koordinator-system/slo-controller-config 检查当前的 SLO 配置。
  1. $ kubectl get configmap -n koordinator-system slo-controller-config -o yaml
  2. apiVersion: v1
  3. kind: ConfigMap
  4. metadata:
  5. annotations:
  6. meta.helm.sh/release-name: koordinator
  7. meta.helm.sh/release-namespace: default
  8. labels:
  9. app.kubernetes.io/managed-by: Helm
  10. name: slo-controller-config
  11. namespace: koordinator-system
  12. data:
  13. colocation-config: |
  14. {
  15. "enable": false,
  16. "metricAggregateDurationSeconds": 300,
  17. "metricReportIntervalSeconds": 60,
  18. "cpuReclaimThresholdPercent": 60,
  19. "memoryReclaimThresholdPercent": 65,
  20. "memoryCalculatePolicy": "usage",
  21. "degradeTimeMinutes": 15,
  22. "updateTimeThresholdSeconds": 300,
  23. "resourceDiffThreshold": 0.1
  24. }
  25. resource-threshold-config: |
  26. {
  27. "clusterStrategy": {
  28. "enable": false
  29. }
  30. }
  1. 编辑 ConfigMap koordinator-system/slo-controller-config 来修改 SLO 配置。
  1. $ kubectl edit configmap -n koordinator-system slo-controller-config

例如,ConfigMap 编辑如下:

  1. data:
  2. # ...
  3. resource-threshold-config: |
  4. {
  5. "clusterStrategy": {
  6. "enable": true,
  7. "cpuSuppressThresholdPercent": 60,
  8. "cpuSuppressPolicy": "cpuset",
  9. "memoryEvictThresholdPercent": 60
  10. }
  11. }
  1. 确认 NodeSLO 是否成功下发。

注意:默认值会在 NodeSLO 中省略。

  1. $ kubectl get nodeslo.slo.koordinator.sh test-node -o yaml
  2. apiVersion: slo.koordinator.sh/v1alpha1
  3. kind: NodeSLO
  4. metadata:
  5. name: test-node
  6. spec:
  7. # ...
  8. extensions: {}
  9. resourceUsedThresholdWithBE:
  10. cpuSuppressPolicy: cpuset
  11. cpuSuppressThresholdPercent: 60
  12. enable: true
  13. memoryEvictThresholdPercent: 60