End-to-end Kubeflow on AWS

Running Kubeflow using AWS services

This guide describes how to deploy Kubeflow using AWS services such as EKS and Cognito. It consists of 3 parts, the deployment of the kubernetes infra, the deployment of the kubeflow and finally the deployment of models using KFserving.

The target audience is a member of a SRE team that builds this platform and provides a dashboard to data scientists. In turn, they can run their workflow for training in their dedicated namespace, and serve their models via a public endpoint.

AWS services used

  • Managed kubernetes (EKS) started with eksctl
  • Kubernetes nodegroups (in EC2 auto-scaling groups) managed by eksctl
  • ALB for istio-ingressgateway in front of all virtual services
  • Cognito for user and api authentication
  • Certificate manager for SSL certificates
  • Route53 to manage the domain

Prerequisites

Access to an AWS account via command line is required, make sure you’re able to execute aws cli commands. Install the following programs in the system from which you provision the infra (laptop or conf.management tool):

  • eksctl
  • kubectl
  • istioctl
  • kn
  • kfctl

Deploy the Kubernetes cluster

This step is only required once, when building the infra for the platform.

Create a cluster.yaml file:

  1. apiVersion: eksctl.io/v1alpha5
  2. kind: ClusterConfig
  3. metadata:
  4. name: aiplatform
  5. region: eu-west-1
  6. nodeGroups:
  7. - name: ng
  8. desiredCapacity: 6
  9. instanceType: m5.xlarge

And spin off the cluster using eksctl:

  1. eksctl create cluster -f cluster.yaml

That starts a cloudformation stack for the EKS master and a stack for each nodegroup, in our case one. You can observe the progress of the creation in the cloudformation page in the console.

The cluster is ready when kubectl reports that the nodes are Ready:

  1. kubectl get nodes
  1. NAME STATUS ROLES AGE VERSION
  2. ip-192-168-10-217.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
  3. ip-192-168-28-92.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
  4. ip-192-168-51-201.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
  5. ip-192-168-63-25.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
  6. ip-192-168-68-104.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5
  7. ip-192-168-77-56.eu-west-1.compute.internal Ready <none> 18d v1.14.7-eks-1861c5

If you’d like to change the nodegroup scaling there are two options, either via the EC2 auto-scaling group or using eksctl:

  1. eksctl scale nodegroup --cluster=aiplatform --nodes=4 ng

Deploy the kubernetes dashboard

To deploy the kubernetes dashboard as described in the AWS deploy kubernetes web ui, first download and install the metrics server:

To install the metrics server:

  1. wget https://api.github.com/repos/kubernetes-sigs/metrics-server/tarball/v0.3.6
  2. tar zxvf v0.3.6
  3. kubectl apply -f kubernetes-sigs-metrics-server-d1f4f6f/deploy/1.8+

Validate:

  1. kubectl get deployment metrics-server -n kube-system
  1. NAME READY UP-TO-DATE AVAILABLE AGE
  2. metrics-server 1/1 1 1 18d

To install the dashboard and create a user to access it, first create an eks-admin user using the following file:

  1. apiVersion: v1
  2. kind: ServiceAccount
  3. metadata:
  4. name: eks-admin
  5. namespace: kube-system
  6. ---
  7. apiVersion: rbac.authorization.k8s.io/v1beta1
  8. kind: ClusterRoleBinding
  9. metadata:
  10. name: eks-admin
  11. roleRef:
  12. apiGroup: rbac.authorization.k8s.io
  13. kind: ClusterRole
  14. name: cluster-admin
  15. subjects:
  16. - kind: ServiceAccount
  17. name: eks-admin
  18. namespace: kube-system
  1. kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-rc5/aio/deploy/recommended.yaml
  2. kubectl apply -f eks-admin-service-account.yaml

To access the kubernetes dashboard bring it to your localhost with a proxy:

  1. kubectl proxy

And then visit the dashboard on the kubernetes dashboard ui

Exposing the kubernetes dashboard via an istio virtual service is not recommended.

To login get the token using the following command:

  1. kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep eks-admin | awk '{print $1}')

More information on creating and managing EKS clusters.

Deploy Kubeflow

In this section you will prepare the ecosystem required by kubeflow, and you will configure the kfctl.yaml file with the custom information for your environment.

Cognito and certificates

Route53

It is handy to have a domain managed by Route53 to deal with all the DNS records you will have to add (wildcard for istio-ingressgateway, validation for the certificate manager, etc).

In case your domain.com zone is not managed by Route53, you need to delegate a subdomain management in a Route53 hosted zone, in our example we have delegated the subdomain platform.domain.com. To do that, create a new hosted zone platform.domain.com, copy the NS entries that will be created and in turn create these NS records in the domain.com zone.

In the following case, we have domain.com hosted in Godaddy and we don’t have a subdomain there. We’d like to create a subdomain that uses Amazon route53 as the DNS Service. For more details, please check document. If you already have a subdomain in your domain service, you can use Route 53 as well, check document.

Route53 Hosted Zone

As you can see, there’re four nameservers created and we need to configure them in your domain service. Add namespace record, key should be the subdomain name platform, value is your NS server from Route53.

Note: different domain provider has different settings, you need to check guidance from your domain providers.

Route53 Hosted Zone

In order to make Cognito to use custom domain name, A record is required to resolve platform.domain.com as root domain, which can be a Route53 Alias to the ALB as well. We can use arbitrary ip here now, once we have ALB created, we will update the value later.

If you’re not using Route53, you can point that A record anywhere.

Route53 A Record

The rest records sets in the hosted zone will be created in the next section of this guide.

Certificate Manager

Create two certificates in Certificate Manager for *.platform.domain.com, one in N.Virginia and one in the region of your choice. That is because Cognito requires a certificate in N.Virginia in order to have a custom domain for a user pool. The second is required by the ingress-gateway in case the platform does not run in N.Virginia, in our example Dublin. For the validation of both certificates, you will be asked to create one record in the hosted zone we created above.

Cognito

Create a user pool in Cognito. Type a pool name and choose Review defaults and Create pool.

Create some users in Users and groups, these are the users who will login to the central dashboard.

Add an App client with any name and the default options.

In the App client settings select Authorization code grant flow and email, openid, aws.cognito.signin.user.admin and profile scopes.

Use https://kubeflow.platform.domain.com/oauth2/idpresponse in the Callback URL(s).

Cognito Custom Domain Callback URL

In the Domain name choose Use your domain, type auth.platform.domain.com and select the *.platform.domain.com AWS managed certificate you’ve created in N.Virginia. Creating domain takes up to 15 mins.

Cognito Custom Domain

When it’s created, it will return the Alias target cloudfront address for which you need to create a A Record auth.platform.domain.com in the hosted zone.

Route53 auth A Record

Take note of the following 5 values:

  • The ARN of the certificate from the Certificate Manager of N.Virginia ().
  • The Pool ARN () of the user pool found in Cognito general settings.
  • The App client id (), found in Cognito App clients.
  • The auth.platform.domain.com as the .
  • The name(s) of the created nodegroup(s) using the following command:

    1. aws iam list-roles \
    2. | jq -r ".Roles[] \
    3. | select(.RoleName \
    4. | startswith(\"eksctl-$AWS_CLUSTER_NAME\") and contains(\"NodeInstanceRole\")) \
    5. .RoleName"

Download and edit the kfctl manifest file:

  1. wget https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_aws_cognito.v1.2.0.yaml

At the end of the file we can see the KfAwsPlugin plugin section. In the spec about the cognito, you need to replace the 4 values you recorded above and the nodegroups names in the roles.

  1. - kind: KfAwsPlugin
  2. metadata:
  3. name: aws
  4. spec:
  5. auth:
  6. cognito:
  7. certArn: arn:aws:acm:eu-west-1:xxxxx:certificate/xxxxxxxxxxxxx-xxxx
  8. cognitoAppClientId: xxxxxbxxxxxx
  9. cognitoUserPoolArn: arn:aws:cognito-idp:eu-west-1:xxxxx:userpool/eu-west-1_xxxxxx
  10. cognitoUserPoolDomain: auth.platform.domain.com
  11. region: eu-west-1
  12. roles:
  13. - eksctl-aiplatform-aws-nodegroup-ng-NodeInstanceRole-xxxxx

Now you can build the manifests and then deploy them:

  1. kfctl build -f kfctl_aws_cognito.v1.2.0.yaml -V
  2. kfctl apply -f kfctl_aws_cognito.v1.2.0.yaml -V

That shouldn’t take a long time. There shouldn’t by any errors, and when ready you can validate that you can see the kubeflow namespace.

At this point you will also have an ALB, it takes around 3 minutes to be ready. When ready, copy the DNS name of that load balancer and create 2 CNAME entries to it in Route53:

  • *.platform.domain.com
  • *.default.platform.domain.com

Also remember to update A record for platform.domain.com using ALB DNS name.

Route53 platform A Record

Here’s the full snapshot of record sets in your hosted zone.

Route53 Record Sets

Add more screenshots and clear steps for e2e doc

The central dashboard should now be available at https://kubeflow.platform.domain.com the first time will redirect to Cognito for login.

Deploy knative

Download the knative manifests from https://github.com/kubeflow/manifests/tree/master/knative

Edit configmap config-domain in file knative-serving-install/base/config-map.yaml and use the following config-domain (replace example.com):

  1. apiVersion: v1
  2. data:
  3. platform.domain.com: ""
  4. kind: ConfigMap
  5. metadata:
  6. labels:
  7. serving.knative.dev/release: "v0.8.0"
  8. name: config-domain
  9. namespace: knative-serving

Build and apply knative:

  1. cd knative/knative-serving-crds/base
  2. kustomize build . | kubectl apply -f -
  3. cd -
  4. cd knative/knative-serving-install/base
  5. kustomize build . | kubectl apply -f -
  6. cd -

That will create a knative-serving namespace with all 6 pods running:

  1. NAME READY STATUS RESTARTS AGE
  2. activator-7746448cf9-ggk98 2/2 Running 2 18d
  3. autoscaler-548ccfcc57-zsfpw 2/2 Running 2 18d
  4. autoscaler-hpa-669647f4f4-mx5q7 1/1 Running 0 18d
  5. controller-655b8c8fb8-g89x7 1/1 Running 0 18d
  6. networking-istio-75ff868647-k95mz 1/1 Running 0 18d
  7. webhook-5846486ff4-4ltjq 1/1 Running 0 18d

Deploy kfserving

Install KFserving using the manifest file:

  1. kubectl apply -f https://raw.githubusercontent.com/kubeflow/kfserving/master/install/v0.4.1/kfserving.yaml

That will create a kfserving-system namespace with one pod running.

Deploy models

Deploy a Tensorflow, a PyTorch and a Scikit-learn model using KFserving:

  1. kubectl apply -f https://raw.githubusercontent.com/kubeflow/kfserving/master/docs/samples/v1alpha2/tensorflow/tensorflow.yaml
  2. kubectl apply -f https://raw.githubusercontent.com/kubeflow/kfserving/master/docs/samples/v1alpha2/pytorch/pytorch.yaml
  3. kubectl apply -f https://raw.githubusercontent.com/kubeflow/kfserving/master/docs/samples/v1alpha2/sklearn/sklearn.yaml

Validate that all three inference services are available:

  1. kubectl get inferenceservice

or alternatively through the knative cli:

  1. kn service list
  1. NAME URL LATEST AGE CONDITIONS READY REASON
  2. pytorch-cifar10-predictor-default http://pytorch-cifar10-predictor-default.default.platform.domain.com pytorch-cifar10-predictor-default-vfz8r 18d 3 OK / 3 True
  3. sklearn-iris-predictor-default http://sklearn-iris-predictor-default.default.platform.domain.com sklearn-iris-predictor-default-pbx2x 6d22h 3 OK / 3 True
  4. tensorflow-flowers-predictor-default http://tensorflow-flowers-predictor-default.default.platform.domain.com tensorflow-flowers-predictor-default-6zp4q 18d 3 OK / 3 True

That simple action will load a model from google storage and serve it through the same istio ingress-gateway. It is possible to test an inference request by posting to any endpoint one of its example datapoints, by using the cookie from the browser that visited the central dashboard:

  1. POST https://sklearn-iris-predictor-default.default.platform.domain.com/v1/models/sklearn-iris:predict HTTP/1.1
  2. Host: sklearn-iris-predictor-default.default.platform.domain.com
  3. Content-Type: application/json
  4. Cookie: AWSELBAuthSessionCookie-0=TBLc8+Mz0hSZp...
  5. {
  6. "instances": [
  7. [6.8, 2.8, 4.8, 1.4],
  8. [6.0, 3.4, 4.5, 1.6]
  9. ]
  10. }

that request will run the inference and return the classes for the two data points:

  1. {"predictions": [1, 1]}

Store models in S3 bucket

Copy the models in s3:

  1. gsutil -m cp -r gs://kfserving-samples/models/tensorflow/flowers s3://domain.com-models/flowers

Create a kubernetes secret to access the S3 bucket by creating a kfserving-s3-secret.yaml file:

  1. apiVersion: v1
  2. kind: Secret
  3. metadata:
  4. name: mysecret
  5. annotations:
  6. serving.kubeflow.org/s3-endpoint: s3.eu-west-1.amazonaws.com
  7. serving.kubeflow.org/s3-usehttps: "1"
  8. serving.kubeflow.org/s3-verifyssl: "1"
  9. serving.kubeflow.org/s3-region: eu-west-1
  10. type: Opaque
  11. data:
  12. # echo -ne "AKIAxxx" | base64
  13. awsAccessKeyID: QUtJQVhxxxVXVjQ=
  14. awsSecretAccessKey: QzR0UnxxxVNOd0NQQQ==
  15. ---
  16. apiVersion: v1
  17. kind: ServiceAccount
  18. metadata:
  19. name: sa
  20. secrets:
  21. - name: mysecret

And change the inference service accordingly by creating a tensorflow.yaml file:

  1. apiVersion: "serving.kubeflow.org/v1alpha2"
  2. kind: "InferenceService"
  3. metadata:
  4. name: "tensorflow-flowers"
  5. spec:
  6. default:
  7. predictor:
  8. serviceAccountName: sa
  9. tensorflow:
  10. storageUri: "s3://domain.com-models/flowers"

Apply the changes:

  1. kubectl apply -f kfserving-s3-secret.yaml
  2. kubectl apply -f tensorflow.yaml

Summary and access

Overview of the installed components, endpoints and the tools used:

KFServing

Debug

Custom domain is not a valid subdomain

Route53 needs a A record to resolve root domain, we need to add this record in hosted zone. If you miss this step, check Route53 section.

Coginito Invalid Subdomain

Last modified 04.05.2021: refactor and refresh aws docs (#2688) (ef4cda60)