Deploy Tensorflow Model with InferenceService

Create the HTTP InferenceService

Create an InferenceService yaml which specifies the framework tensorflow and storageUri that is pointed to a saved tensorflow model, and name it as tensorflow.yaml.

  1. apiVersion: "serving.kserve.io/v1beta1"
  2. kind: "InferenceService"
  3. metadata:
  4. name: "flower-sample"
  5. spec:
  6. predictor:
  7. tensorflow:
  8. storageUri: "gs://kfserving-samples/models/tensorflow/flowers"

Apply the tensorflow.yaml to create the InferenceService, by default it exposes a HTTP/REST endpoint.

kubectl

  1. kubectl apply -f tensorflow.yaml

Expected Output

  1. $ inferenceservice.serving.kserve.io/flower-sample created

Wait for the InferenceService to be in ready state

  1. kubectl get isvc flower-sample
  2. NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
  3. flower-sample http://flower-sample.default.example.com True 100 flower-sample-predictor-default-n9zs6 7m15s

Run a prediction

The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT

  1. MODEL_NAME=flower-sample
  2. INPUT_PATH=@./input.json
  3. SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  4. curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d $INPUT_PATH

Expected Output

  1. * Connected to localhost (::1) port 8080 (#0)
  2. > POST /v1/models/tensorflow-sample:predict HTTP/1.1
  3. > Host: tensorflow-sample.default.example.com
  4. > User-Agent: curl/7.73.0
  5. > Accept: */*
  6. > Content-Length: 16201
  7. > Content-Type: application/x-www-form-urlencoded
  8. >
  9. * upload completely sent off: 16201 out of 16201 bytes
  10. * Mark bundle as not supporting multiuse
  11. < HTTP/1.1 200 OK
  12. < content-length: 222
  13. < content-type: application/json
  14. < date: Sun, 31 Jan 2021 01:01:50 GMT
  15. < x-envoy-upstream-service-time: 280
  16. < server: istio-envoy
  17. <
  18. {
  19. "predictions": [
  20. {
  21. "scores": [0.999114931, 9.20987877e-05, 0.000136786213, 0.000337257545, 0.000300532585, 1.84813616e-05],
  22. "prediction": 0,
  23. "key": " 1"
  24. }
  25. ]
  26. }

Canary Rollout

Canary rollout is a great way to control the risk of rolling out a new model by first moving a small percent of the traffic to it and then gradually increase the percentage. To run a canary rollout, you can apply the canary.yaml with the canaryTrafficPercent field specified.

  1. apiVersion: "serving.kserve.io/v1beta1"
  2. kind: "InferenceService"
  3. metadata:
  4. name: "flower-example"
  5. spec:
  6. predictor:
  7. canaryTrafficPercent: 20
  8. tensorflow:
  9. storageUri: "gs://kfserving-samples/models/tensorflow/flowers-2"

kubectl

  1. kubectl apply -f canary.yaml

To verify if the traffic split percentage is applied correctly, you can run the following command:

kubectl

  1. kubectl get isvc flower-sample
  2. NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
  3. flower-sample http://flower-sample.default.example.com True 80 20 flower-sample-predictor-default-n9zs6 flower-sample-predictor-default-2kwtr 7m15s

As you can see the traffic is split between the last rolled out revision and the current latest ready revision, KServe automatically tracks the last rolled out(stable) revision for you so you do not need to maintain both default and canary on the InferenceService as in v1alpha2.

Create the gRPC InferenceService

Create InferenceService which exposes the gRPC port and by default it listens on port 9000.

  1. apiVersion: "serving.kserve.io/v1beta1"
  2. kind: "InferenceService"
  3. metadata:
  4. name: "flower-grpc"
  5. spec:
  6. predictor:
  7. tensorflow:
  8. storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
  9. ports:
  10. - containerPort: 9000
  11. name: h2c
  12. protocol: TCP

Apply grpc.yaml to create the gRPC InferenceService.

kubectl

  1. kubectl apply -f grpc.yaml

Expected Output

  1. $ inferenceservice.serving.kserve.io/flower-grpc created

Run a prediction

We use a python gRPC client for the prediction, so you need to create a python virtual environment and install the tensorflow-serving-api.

  1. # The prediction script is written in TensorFlow 1.x
  2. pip install tensorflow-serving-api>=1.14.0,<2.0.0

Run prediction script

  1. MODEL_NAME=flower-grpc
  2. INPUT_PATH=./input.json
  3. SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  4. python grpc_client.py --host $INGRESS_HOST --port $INGRESS_PORT --model $MODEL_NAME --hostname $SERVICE_HOSTNAME --input_path $INPUT_PATH

Expected Output

  1. outputs {
  2. key: "key"
  3. value {
  4. dtype: DT_STRING
  5. tensor_shape {
  6. dim {
  7. size: 1
  8. }
  9. }
  10. string_val: " 1"
  11. }
  12. }
  13. outputs {
  14. key: "prediction"
  15. value {
  16. dtype: DT_INT64
  17. tensor_shape {
  18. dim {
  19. size: 1
  20. }
  21. }
  22. int64_val: 0
  23. }
  24. }
  25. outputs {
  26. key: "scores"
  27. value {
  28. dtype: DT_FLOAT
  29. tensor_shape {
  30. dim {
  31. size: 1
  32. }
  33. dim {
  34. size: 6
  35. }
  36. }
  37. float_val: 0.9991149306297302
  38. float_val: 9.209887502947822e-05
  39. float_val: 0.00013678647519554943
  40. float_val: 0.0003372581850271672
  41. float_val: 0.0003005331673193723
  42. float_val: 1.848137799242977e-05
  43. }
  44. }
  45. model_spec {
  46. name: "flowers-sample"
  47. version {
  48. value: 1
  49. }
  50. signature_name: "serving_default"
  51. }