Autoscale InferenceService with inference workload

InferenceService with target concurrency

Create InferenceService

Apply the tensorflow example CR with scaling target set to 1. Annotation autoscaling.knative.dev/target is the soft limit rather than a strictly enforced limit, if there is sudden burst of the requests, this value can be exceeded.

yaml

  1. apiVersion: "serving.kserve.io/v1beta1"
  2. kind: "InferenceService"
  3. metadata:
  4. name: "flowers-sample"
  5. annotations:
  6. autoscaling.knative.dev/target: "1"
  7. spec:
  8. predictor:
  9. tensorflow:
  10. storageUri: "gs://kfserving-samples/models/tensorflow/flowers"

kubectl

  1. kubectl apply -f autoscale.yaml

Expected Output

  1. $ inferenceservice.serving.kserve.io/flowers-sample created

Predict InferenceService with concurrent requests

The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT

Send traffic in 30 seconds spurts maintaining 5 in-flight requests.

  1. MODEL_NAME=flowers-sample
  2. INPUT_PATH=input.json
  3. SERVICE_HOSTNAME=$(kubectl get inferenceservice $MODEL_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  4. hey -z 30s -c 5 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict

Expected Output

  1. Summary:
  2. Total: 30.0193 secs
  3. Slowest: 10.1458 secs
  4. Fastest: 0.0127 secs
  5. Average: 0.0364 secs
  6. Requests/sec: 137.4449
  7. Total data: 1019122 bytes
  8. Size/request: 247 bytes
  9. Response time histogram:
  10. 0.013 [1] |
  11. 1.026 [4120] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  12. 2.039 [0] |
  13. 3.053 [0] |
  14. 4.066 [0] |
  15. 5.079 [0] |
  16. 6.093 [0] |
  17. 7.106 [0] |
  18. 8.119 [0] |
  19. 9.133 [0] |
  20. 10.146 [5] |
  21. Latency distribution:
  22. 10% in 0.0178 secs
  23. 25% in 0.0188 secs
  24. 50% in 0.0199 secs
  25. 75% in 0.0210 secs
  26. 90% in 0.0231 secs
  27. 95% in 0.0328 secs
  28. 99% in 0.1501 secs
  29. Details (average, fastest, slowest):
  30. DNS+dialup: 0.0002 secs, 0.0127 secs, 10.1458 secs
  31. DNS-lookup: 0.0002 secs, 0.0000 secs, 0.1502 secs
  32. req write: 0.0000 secs, 0.0000 secs, 0.0020 secs
  33. resp wait: 0.0360 secs, 0.0125 secs, 9.9791 secs
  34. resp read: 0.0001 secs, 0.0000 secs, 0.0021 secs
  35. Status code distribution:
  36. [200] 4126 responses

Check the number of running pods now, Kserve uses Knative Serving autoscaler which is based on the average number of in-flight requests per pod(concurrency). As the scaling target is set to 1 and we load the service with 5 concurrent requests, so the autoscaler tries scaling up to 5 pods. Notice that out of all the requests there are 5 requests on the histogram that take around 10s, that’s the cold start time cost to initially spawn the pods and download model to be readyto serve. The cold start may take longer(to pull the serving image) if the image is not cached on the node that the pod is scheduled on.

  1. $ kubectl get pods
  2. NAME READY STATUS RESTARTS AGE
  3. flowers-sample-default-7kqt6-deployment-75d577dcdb-sr5wd 3/3 Running 0 42s
  4. flowers-sample-default-7kqt6-deployment-75d577dcdb-swnk5 3/3 Running 0 62s
  5. flowers-sample-default-7kqt6-deployment-75d577dcdb-t2njf 3/3 Running 0 62s
  6. flowers-sample-default-7kqt6-deployment-75d577dcdb-vdlp9 3/3 Running 0 64s
  7. flowers-sample-default-7kqt6-deployment-75d577dcdb-vm58d 3/3 Running 0 42s

Check Dashboard

View the Knative Serving Scaling dashboards (if configured).

kubectl

  1. kubectl port-forward --namespace knative-monitoring $(kubectl get pods --namespace knative-monitoring --selector=app=grafana --output=jsonpath="{.items..metadata.name}") 3000

scaling dashboard

InferenceService with target QPS

Create the InferenceService

Apply the same tensorflow example CR

kubectl

  1. kubectl apply -f autoscale.yaml

Expected Output

  1. $ inferenceservice.serving.kserve.io/flowers-sample created

Predict InferenceService with target QPS

The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT

Send 30 seconds of traffic maintaining 50 qps.

  1. MODEL_NAME=flowers-sample
  2. INPUT_PATH=input.json
  3. SERVICE_HOSTNAME=$(kubectl get inferenceservice $MODEL_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  4. hey -z 30s -q 50 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict

Expected Output

  1. Summary:
  2. Total: 30.0264 secs
  3. Slowest: 10.8113 secs
  4. Fastest: 0.0145 secs
  5. Average: 0.0731 secs
  6. Requests/sec: 683.5644
  7. Total data: 5069675 bytes
  8. Size/request: 247 bytes
  9. Response time histogram:
  10. 0.014 [1] |
  11. 1.094 [20474] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  12. 2.174 [0] |
  13. 3.254 [0] |
  14. 4.333 [0] |
  15. 5.413 [0] |
  16. 6.493 [0] |
  17. 7.572 [0] |
  18. 8.652 [0] |
  19. 9.732 [0] |
  20. 10.811 [50] |
  21. Latency distribution:
  22. 10% in 0.0284 secs
  23. 25% in 0.0334 secs
  24. 50% in 0.0408 secs
  25. 75% in 0.0527 secs
  26. 90% in 0.0765 secs
  27. 95% in 0.0949 secs
  28. 99% in 0.1334 secs
  29. Details (average, fastest, slowest):
  30. DNS+dialup: 0.0001 secs, 0.0145 secs, 10.8113 secs
  31. DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0196 secs
  32. req write: 0.0000 secs, 0.0000 secs, 0.0031 secs
  33. resp wait: 0.0728 secs, 0.0144 secs, 10.7688 secs
  34. resp read: 0.0000 secs, 0.0000 secs, 0.0031 secs
  35. Status code distribution:
  36. [200] 20525 responses

Check the number of running pods now, we are loading the service with 50 requests per second, and from the dashboard you can see that it hits the average concurrency 10 and autoscaler tries scaling up to 10 pods.

Check Dashboard

View the Knative Serving Scaling dashboards (if configured).

  1. kubectl port-forward --namespace knative-monitoring $(kubectl get pods --namespace knative-monitoring --selector=app=grafana --output=jsonpath="{.items..metadata.name}") 3000

scaling dashboard

Autoscaler calculates average concurrency over 60 second window so it takes a minute to stabilize at the desired concurrency level,however it also calculates the 6 second panic window and will enter into panic mode if that window reaches 2x target concurrency. From the dashboard you can see that it enters panic mode in which autoscaler operates on shorter and more sensitive window. Once the panic conditions are no longer met for 60 seconds, autoscaler will return back to 60 seconds stable window.

Autoscaling on GPU!

Autoscaling on GPU is hard with GPU metrics, however thanks to Knative’s concurrency based autoscaler scaling on GPU is pretty easy and effective!

Create the InferenceService with GPU resource

Apply the tensorflow gpu example CR

yaml

  1. apiVersion: "serving.kserve.io/v1beta1"
  2. kind: "InferenceService"
  3. metadata:
  4. name: "flowers-sample-gpu"
  5. spec:
  6. predictor:
  7. tensorflow:
  8. storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
  9. runtimeVersion: "1.14.0-gpu"
  10. resources:
  11. limits:
  12. nvidia.com/gpu: 1

kubectl

  1. kubectl apply -f autoscale-gpu.yaml

Predict InferenceService with concurrent requests

The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT

Send 30 seconds of traffic maintaining 5 in-flight requests.

  1. MODEL_NAME=flowers-sample-gpu
  2. INPUT_PATH=input.json
  3. SERVICE_HOSTNAME=$(kubectl get inferenceservice $MODEL_NAME -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  4. hey -z 30s -c 5 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict

Expected Output

  1. Summary:
  2. Total: 30.0152 secs
  3. Slowest: 9.7581 secs
  4. Fastest: 0.0142 secs
  5. Average: 0.0350 secs
  6. Requests/sec: 142.9942
  7. Total data: 948532 bytes
  8. Size/request: 221 bytes
  9. Response time histogram:
  10. 0.014 [1] |
  11. 0.989 [4286] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  12. 1.963 [0] |
  13. 2.937 [0] |
  14. 3.912 [0] |
  15. 4.886 [0] |
  16. 5.861 [0] |
  17. 6.835 [0] |
  18. 7.809 [0] |
  19. 8.784 [0] |
  20. 9.758 [5] |
  21. Latency distribution:
  22. 10% in 0.0181 secs
  23. 25% in 0.0189 secs
  24. 50% in 0.0198 secs
  25. 75% in 0.0210 secs
  26. 90% in 0.0230 secs
  27. 95% in 0.0276 secs
  28. 99% in 0.0511 secs
  29. Details (average, fastest, slowest):
  30. DNS+dialup: 0.0000 secs, 0.0142 secs, 9.7581 secs
  31. DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0291 secs
  32. req write: 0.0000 secs, 0.0000 secs, 0.0023 secs
  33. resp wait: 0.0348 secs, 0.0141 secs, 9.7158 secs
  34. resp read: 0.0001 secs, 0.0000 secs, 0.0021 secs
  35. Status code distribution:
  36. [200] 4292 responses

Autoscaling Customization

Autoscaling with ContainerConcurrency

ContainerConcurrency determines the number of simultaneous requests that can be processed by each replica of the InferenceService at any given time, it is a hard limit and if the concurrency reaches the hard limit surplus requests will be buffered and must wait until enough capacity is free to execute the requests.

yaml

  1. apiVersion: "serving.kserve.io/v1beta1"
  2. kind: "InferenceService"
  3. metadata:
  4. name: "flowers-sample"
  5. spec:
  6. predictor:
  7. containerConcurrency: 10
  8. tensorflow:
  9. storageUri: "gs://kfserving-samples/models/tensorflow/flowers"

kubectl

  1. kubectl apply -f autoscale-custom.yaml

Enable scale down to zero

KServe by default sets minReplicas to 1, if you want to enable scaling down to zero especially for use cases like serving on GPUs you can set minReplicas to 0 so that the pods automatically scale down to zero when no traffic is received.

yaml

  1. apiVersion: "serving.kserve.io/v1beta1"
  2. kind: "InferenceService"
  3. metadata:
  4. name: "flowers-sample"
  5. spec:
  6. predictor:
  7. minReplicas: 0
  8. tensorflow:
  9. storageUri: "gs://kfserving-samples/models/tensorflow/flowers"

kubectl

  1. kubectl apply -f scale-down-to-zero.yaml