Troubleshooting Deployments on Amazon EKS

Help diagnose and fix issues you may encounter in your Kubeflow deployment

ALB can not be created

  1. kubectl get ingress -n istio-system
  2. NAME HOSTS ADDRESS PORTS AGE
  3. istio-ingress * 80 3min

If you see your istio-ingress ADDRESS is empty after 3 mins, it must be something wrong in your ALB ingress controller.

  1. E1024 09:02:59.934318 1 :0] kubebuilder/controller "msg"="Reconciler error" "error"="failed to build LoadBalancer configuration due to retrieval of subnets failed to resolve 2 qualified subnets. Subnets must contain the kubernetes.io/cluster/\u003ccluster name\u003e tag with a value of shared or owned and the kubernetes.io/role/elb tag signifying it should be used for ALBs Additionally, there must be at least 2 subnets with unique availability zones as required by ALBs. Either tag subnets to meet this requirement or use the subnets annotation on the ingress resource to explicitly call out what subnets to use for ALB creation. The subnets that did resolve were []" "controller"="alb-ingress-controller" "request"={"Namespace":"istio-system","Name":"istio-ingress"}

If you see this error, you probably didn’t use cluster_name as folder name during setup. Please go to check your cluster setting by kubectl get configmaps aws-alb-ingress-controller-config -n kubeflow -o yaml and make the change. Another reason could be that you did not tag your subnets so that Kubernetes knows to use only those subnets for external load balancers. To fix this ensure the public subnets are tagged with the Key: kubernetes.io/role/elb and Value: 1. See docs here for further details.

Kubeflow Uninstallation Failure

  1. Error: couldn't delete KfApp: (kubeflow.error): Code 500 with message: kfApp Delete failed for kustomize: (kubeflow.error): Code 500 with message: error deleting kustomize manifests: [error evaluating kustomization manifest for knative: Timed out waiting for resource /knative-serving to be deleted. Error deleted resource is not cleaned up yet, error evaluating kustomization manifest for cert-manager: Timed out waiting for resource /cert-manager to be deleted. Error deleted resource is not cleaned up yet]
  2. Usage:
  3. kfctl delete [flags]
  4. Flags:
  5. --delete_storage Set if you want to delete app's storage cluster used for mlpipeline.
  6. -f, --file string The local config file of KfDef.
  7. --force-deletion force-deletion output default is false
  8. -h, --help help for delete
  9. -V, --verbose verbose output default is false
  10. kfctl exited with error: couldn't delete KfApp: (kubeflow.error): Code 500 with message: kfApp Delete failed for kustomize: (kubeflow.error): Code 500 with message: error deleting kustomize manifests: [error evaluating kustomization manifest for knative: Timed out waiting for resource /knative-serving to be deleted. Error deleted resource is not cleaned up yet, error evaluating kustomization manifest for cert-manager: Timed out waiting for resource /cert-manager to be delete

Due to a kustomize issue, deletion will take up to 10 mins and it may show above errors. Don’t worry about it, all resources have been deleted. You can build your kfctl with change to mitigate this issue.

EKS Cluster Creation Failure

There are several problems that could lead to cluster creation failure. If you see some errors when creating your cluster using eksctl, please open the CloudFormation console and check your stacks. To recover from failure, you need to follow the guidance from the eksctl output logs. Once you understand the root cause of your failure, you can delete your cluster and rerun kfctl apply -V -f ${CONFIG_FILE}.

Common issues:

  1. The default VPC limit is 5 VPCs per region
  2. Invalid command arguments
  1. + eksctl create cluster --config-file=/tmp/cluster_config.yaml
  2. [ℹ] using region us-west-2
  3. [ℹ] subnets for us-west-2b - public:192.168.0.0/19 private:192.168.96.0/19
  4. [ℹ] subnets for us-west-2c - public:192.168.32.0/19 private:192.168.128.0/19
  5. [ℹ] subnets for us-west-2d - public:192.168.64.0/19 private:192.168.160.0/19
  6. [ℹ] nodegroup "general" will use "ami-0280ac619ed294a8a" [AmazonLinux2/1.12]
  7. [ℹ] importing SSH public key "/Users/ubuntu/.ssh/id_rsa.pub" as "eksctl-test-cluster-nodegroup-general-11:2a:f6:ba:b0:98:da:b4:24:db:18:3d:e3:3f:f5:fb"
  8. [ℹ] creating EKS cluster "test-cluster" in "us-west-2" region
  9. [ℹ] will create a CloudFormation stack for cluster itself and 1 nodegroup stack(s)
  10. [ℹ] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --name=test-cluster'
  11. [ℹ] building cluster stack "eksctl-test-cluster-cluster"
  12. [✖] unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-test-cluster-cluster"
  13. [ℹ] fetching stack events in attempt to troubleshoot the root cause of the failure
  14. [ℹ] AWS::CloudFormation::Stack/eksctl-test-cluster-cluster: ROLLBACK_IN_PROGRESS "The following resource(s) failed to create: [InternetGateway, ServiceRole, NATIP, VPC]. . Rollback requested by user."
  15. [✖] AWS::EC2::EIP/NATIP: CREATE_FAILED "Resource creation cancelled"
  16. [✖] AWS::IAM::Role/ServiceRole: CREATE_FAILED "Resource creation cancelled"
  17. [ℹ] AWS::EC2::EIP/NATIP: CREATE_IN_PROGRESS "Resource creation Initiated"
  18. [✖] AWS::EC2::VPC/VPC: CREATE_FAILED "The maximum number of VPCs has been reached. (Service: AmazonEC2; Status Code: 400; Error Code: VpcLimitExceeded; Request ID: xxxxxxxxxx)"
  19. [ℹ] AWS::IAM::Role/ServiceRole: CREATE_IN_PROGRESS "Resource creation Initiated"
  20. [ℹ] AWS::EC2::EIP/NATIP: CREATE_IN_PROGRESS
  21. [✖] AWS::EC2::InternetGateway/InternetGateway: CREATE_FAILED "The maximum number of internet gateways has been reached. (Service: AmazonEC2; Status Code: 400; Error Code: InternetGatewayLimitExceeded; Request ID: 7b3c9620-d1fa-4893-9e91-fb94eb3f2ef3)"
  22. [ℹ] AWS::EC2::VPC/VPC: CREATE_IN_PROGRESS
  23. [ℹ] AWS::IAM::Role/ServiceRole: CREATE_IN_PROGRESS
  24. [ℹ] AWS::EC2::InternetGateway/InternetGateway: CREATE_IN_PROGRESS
  25. [ℹ] AWS::CloudFormation::Stack/eksctl-test-cluster-cluster: CREATE_IN_PROGRESS "User Initiated"
  26. [ℹ] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
  27. [ℹ] to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=test-cluster'
  28. [✖] waiting for CloudFormation stack "eksctl-test-cluster-cluster" to reach "CREATE_COMPLETE" status: ResourceNotReady: failed waiting for successful resource state
  29. [✖] failed to create cluster "test-cluster"

Resource Not Found in delete all

  1. + kubectl get ns/kubeflow
  2. Error from server (NotFound): namespaces "kubeflow" not found
  3. + kubectl get ns/kubeflow
  4. Error from server (NotFound): namespaces "kubeflow" not found
  5. + echo 'namespace kubeflow successfully deleted.'

You can ignore any Kubernetes “resource not found” errors that occur during the deletion phase.

InvalidParameterException in UpdateCluster

  1. + logging_components='"api","audit","authenticator","controllerManager","scheduler"'
  2. ++ aws eks update-cluster-config --name benchmark-0402222-sunday-satur --region us-west-2 --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
  3. An error occurred (InvalidParameterException) when calling the UpdateClusterConfig operation: No changes needed for the logging config provided

The Amazon EKS UpdateCluster API operation will fail if you have invalid parameters. For example, if you already enabled logs in your EKS cluster, and you choose to create Kubeflow on existing cluster and also enable logs, you will get this error.

FSX Mount Failure

  1. Mounting command: mount
  2. Mounting arguments: -t lustre fs-0xxxxx2a216cf.us-west-2.amazonaws.com@tcp:/fsx /var/lib/kubelet/pods/224c2c96-5a91-11e9-b7e6-0a2a42c99f84/volumes/kubernetes.io~csi/fsx-static/mount
  3. Output: mount.lustre: Can't parse NID 'fs-0xxxxx2a216cf.us-west-2.amazonaws.com@tcp:/fsx'
  4. This mount helper should only be invoked via the mount (8) command,
  5. e.g. mount -t lustre dev dir
  6. usage: mount.lustre [-fhnvV] [-o <mntopt>] <device> <mountpt>
  7. <device>: the disk device, or for a client:
  8. <mgsnid>[:<altmgsnid>...]:/<filesystem>[/<subdir>]
  9. <filesystem>: name of the Lustre filesystem (e.g. lustre1)
  10. <mountpt>: filesystem mountpoint (e.g. /mnt/lustre)
  11. -f|--fake: fake mount (updates /etc/mtab)

The Amazon FSx dnsName is incorrect, you can delete your pod using this persistent volume claim. The next step is to delete the PV and PVC. Next correct your input and reapply the PV and PVC.

  1. kubectl delete pod ${pod_using_pvc}
  2. ks delete default -c ${COMPONENT}
  3. ks param set ${COMPONENT} dnsName fs-0xxxxx2a216cf.fsx.us-west-2.amazonaws.com
  4. ks apply default -c ${COMPONENT}

Amazon RDS Connectivity Issues

If you run into CloudFormation deployment errors, you can use troubleshooting guide to find a resolution.

If you have connectivity issues with Amazon RDS, try launching mysql-client container and connecting to your RDS endpoint. This will let you know if you have network connectivity with the db and also if db is created properly.

  1. # Remember to change your RDS endpoint, DB username and DB Password
  2. $ kubectl run -it --rm --image=mysql:5.7 --restart=Never mysql-client -- mysql -h <YOUR RDS ENDPOINT> -u admin -pKubefl0w
  3. If you don't see a command prompt, try pressing enter.
  4. mysql> show databases;
  5. +--------------------+
  6. | Database |
  7. +--------------------+
  8. | information_schema |
  9. | kubeflow |
  10. | mlpipeline |
  11. | mysql |
  12. | performance_schema |
  13. +--------------------+
  14. 5 rows in set (0.00 sec)
  15. mysql> use mlpipeline; show tables;
  16. Reading table information for completion of table and column names
  17. You can turn off this feature to get a quicker startup with -A
  18. Database changed
  19. +----------------------+
  20. | Tables_in_mlpipeline |
  21. +----------------------+
  22. | db_statuses |
  23. | default_experiments |
  24. | experiments |
  25. | jobs |
  26. | pipeline_versions |
  27. | pipelines |
  28. | resource_references |
  29. | run_details |
  30. | run_metrics |
  31. +----------------------+
  32. 9 rows in set (0.00 sec)

Incompatible eksctl version

If you see this error when you run apply platform, it means your eksctl cli version is not compatible with eksctl.io version in cluster_config.yaml. Please upgrade your eksctl and try again. v1alpha5 is introduced from 0.1.31.

We are working with eksctl team to make sure feature release support backward compatibility at least for one version.

  1. loading config file "${KF_DIR}/aws_config/cluster_config.yaml": no kind "ClusterConfig" is registered for version "eksctl.io/v1alpha5" in scheme "k8s.io/client-go/kubernetes/scheme/register.go:60"

Last modified 04.08.2020: Remove outdate banner for AWS docs (#2080) (efc5b0cf)