Troubleshooting Deployments on GKE

Help fixing problems on GKE and Google Cloud

Out of date

This guide contains outdated information pertaining to Kubeflow 1.0. This guide needs to be updated for Kubeflow 1.1.

This guide helps diagnose and fix issues you may encounter with Kubeflow on Google Kubernetes Engine (GKE) and Google Cloud.

Before you start

This guide covers troubleshooting specifically for Kubeflow deployments on Google Cloud.

For more help, try the general Kubeflow troubleshooting guide.

This guide assumes the following settings:

  • The ${KF_DIR} environment variable contains the path to your Kubeflow application directory, which holds your Kubeflow configuration files. For example, /opt/gcp-blueprints/kubeflow/.

    1. export KF_DIR=<path to your Kubeflow application directory>
  • The ${CONFIG_FILE} environment variable contains the path to your Kubeflow configuration file.

    1. export CONFIG_FILE=${KF_DIR}/kfctl_gcp_iap.v1.0.2.yaml

    Or:

    1. export CONFIG_FILE=${KF_DIR}/kfctl_gcp_basic_auth.v1.0.2.yaml
  • The ${KF_NAME} environment variable contains the name of your Kubeflow deployment. You can find the name in your ${CONFIG_FILE} configuration file, as the value for the metadata.name key.

    1. export KF_NAME=<the name of your Kubeflow deployment>
  • The ${PROJECT} environment variable contains the ID of your Google Cloud project. You can find the project ID in your ${CONFIG_FILE} configuration file, as the value for the project key.

    1. export PROJECT=<your Google Cloud project ID>
  • The ${ZONE} environment variable contains the Google Cloud zone where your Kubeflow resources are deployed.

    1. export ZONE=<your Google Cloud zone>
  • For further background about the above settings, see the guide to deploying Kubeflow with the CLI.

Troubleshooting Kubeflow deployment on Google Cloud

Here are some tips for troubleshooting Google Cloud.

  • Make sure you are a Google Cloud project owner.
  • Make sure you are using HTTPS.
  • Check project quota page to see if any service’s current usage reached quota limit, increase them as needed.
  • Check deployment manager page and see if there’s a failed deployment.
  • Check if endpoint is up: do DNS lookup against your Cloud Identity-Aware Proxy (Cloud IAP) URL and see if it resolves to the correct IP address.
  • Check if certificate succeeded: kubectl describe certificates -n istio-system should give you certificate status.
  • Check ingress status: kubectl describe ingress -n istio-system
  • Check if endpoint entry is created. There should be one entry with name <deployment>.endpoints.<project>.cloud.goog
    • If endpoint entry doesn’t exist, check kubectl describe cloudendpoint -n istio-system
  • If using IAP: make sure you added https://<deployment>.endpoints.<project>.cloud.goog/_gcp_gatekeeper/authenticate as an authorized redirect URI for the OAUTH credentials used to create the deployment.
  • If using IAP: see the guide to monitoring your Cloud IAP setup.
  • See the sections below for troubleshooting specific problems.
  • Please report a bug if you can’t resolve the problem by following the above steps.

DNS name not registered

This section provides troubleshooting information for problems creating a DNS entry for your ingress. The ingress is a K8s resource that creates a Google Cloud loadbalancer to enable http(s) access to Kubeflow web services from outside the cluster. This section assumes you are using Cloud Endpoints and a DNS name of the following pattern

  1. https://${KF_NAME}.endpoints.${PROJECT}.cloud.goog

Symptoms:

  • When you access the URL in Chrome you get the error: server IP address could not be found

  • nslookup for the domain name doesn’t return the IP address associated with the ingress

    1. nslookup ${KF_NAME}.endpoints.${PROJECT}.cloud.goog
    2. Server: 127.0.0.1
    3. Address: 127.0.0.1#53
    4. ** server can't find ${KF_NAME}.endpoints.${PROJECT}.cloud.goog: NXDOMAIN

Troubleshooting

  1. Check the cloudendpoints resource

    1. kubectl get cloudendpoints -o yaml ${KF_NAME}
    2. kubectl describe cloudendpoints ${KF_NAME}
    • Check if there are errors indicating problems creating the endpoint
  2. The status of the cloudendpoints object will contain the cloud operation used to register the operation

    • For example

      1. status:
      2. config: ""
      3. configMapHash: ""
      4. configSubmit: operations/serviceConfigs.jlewi-1218-001.endpoints.cloud-ml-dev.cloud.goog:43fe6c6f-eb9c-41d0-ac85-b547fc3e6e38
      5. endpoint: jlewi-1218-001.endpoints.cloud-ml-dev.cloud.goog
      6. ingressIP: 35.227.243.83
      7. jwtAudiences: null
      8. lastAppliedSig: 4f3b903a06a683b380bf1aac1deca72792472429
      9. observedGeneration: 1
      10. stateCurrent: ENDPOINT_SUBMIT_PENDING
  • You can check the status of the operation by running:

    1. gcloud --project=${PROJECT} endpoints operations describe ${OPERATION}
    • Operation is everything after operations/ in the configSubmit field

404 Page Not Found When Accessing Central Dashboard

This section provides troubleshooting information for 404s, page not found, being return by the central dashboard which is served at

  1. https://${KUBEFLOW_FQDN}/
  • KUBEFLOW_FQDN is your project’s OAuth web app URI domain name <name>.endpoints.<project>.cloud.goog
  • Since we were able to sign in this indicates the Ambassador reverse proxy is up and healthy we can confirm this is the case by running the following command
  1. kubectl -n ${NAMESPACE} get pods -l service=envoy
  2. NAME READY STATUS RESTARTS AGE
  3. envoy-76774f8d5c-lx9bd 2/2 Running 2 4m
  4. envoy-76774f8d5c-ngjnr 2/2 Running 2 4m
  5. envoy-76774f8d5c-sg555 2/2 Running 2 4m
  • Try other services to see if they’re accessible for example

    1. https://${KUBEFLOW_FQDN}/whoami
    2. https://${KUBEFLOW_FQDN}/tfjobs/ui
    3. https://${KUBEFLOW_FQDN}/hub
  • If other services are accessible then we know its a problem specific to the central dashboard and not ingress

  • Check that the centraldashboard is running

    1. kubectl get pods -l app=centraldashboard
    2. NAME READY STATUS RESTARTS AGE
    3. centraldashboard-6665fc46cb-592br 1/1 Running 0 7h
  • Check a service for the central dashboard exists

    1. kubectl get service -o yaml centraldashboard
  • Check that an Ambassador route is properly defined

    1. kubectl get service centraldashboard -o jsonpath='{.metadata.annotations.getambassador\.io/config}'
    2. apiVersion: ambassador/v0
    3. kind: Mapping
    4. name: centralui-mapping
    5. prefix: /
    6. rewrite: /
    7. service: centraldashboard.kubeflow,
  • Check the logs of Ambassador for errors. See if there are errors like the following indicating an error parsing the route.If you are using the new Stackdriver Kubernetes monitoring you can use the following filter in the stackdriver console

    1. resource.type="k8s_container"
    2. resource.labels.location=${ZONE}
    3. resource.labels.cluster_name=${CLUSTER}
    4. metadata.userLabels.service="ambassador"
    5. "could not parse YAML"

502 Server Error

A 502 usually means traffic isn’t even making it to the envoy reverse proxy. And it usually indicates the loadbalancer doesn’t think any backends are healthy.

  • In Cloud Console select Network Services -> Load Balancing

    • Click on the load balancer (the name should contain the name of the ingress)

    • The exact name can be found by looking at the ingress.kubernetes.io/url-map annotation on your ingress object

      1. URLMAP=$(kubectl --namespace=${NAMESPACE} get ingress envoy-ingress -o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/url-map}')
      2. echo ${URLMAP}
    • Click on your loadbalancer

    • This will show you the backend services associated with the load balancer

      • There is 1 backend service for each K8s service the ingress rule routes traffic too

      • The named port will correspond to the NodePort a service is using

        1. NODE_PORT=$(kubectl --namespace=${NAMESPACE} get svc envoy -o jsonpath='{.spec.ports[0].nodePort}')
        2. BACKEND_NAME=$(gcloud compute --project=${PROJECT} backend-services list --filter=name~k8s-be-${NODE_PORT}- --format='value(name)')
        3. gcloud compute --project=${PROJECT} backend-services get-health --global ${BACKEND_NAME}
    • Make sure the load balancer reports the backends as healthy

      • If the backends aren’t reported as healthy check that the pods associated with the K8s service are up and running

      • Check that health checks are properly configured

        • Click on the health check associated with the backend service for envoy
        • Check that the path is /healthz and corresponds to the path of the readiness probe on the envoy pods
        • See K8s docs for important information about how health checks are determined from readiness probes.
      • Check firewall rules to ensure traffic isn’t blocked from the Google Cloud loadbalancer

        • The firewall rule should be added automatically by the ingress but its possible it got deleted if you have some automatic firewall policy enforcement. You can recreate the firewall rule if needed with a rule like this

          1. gcloud compute firewall-rules create $NAME \
          2. --project $PROJECT \
          3. --allow tcp:$PORT \
          4. --target-tags $NODE_TAG \
          5. --source-ranges 130.211.0.0/22,35.191.0.0/16
        • To get the node tag

          1. # From the Kubernetes Engine cluster get the name of the managed instance group
          2. gcloud --project=$PROJECT container clusters --zone=$ZONE describe $CLUSTER
          3. # Get the template associated with the MIG
          4. gcloud --project=kubeflow-rl compute instance-groups managed describe --zone=${ZONE} ${MIG_NAME}
          5. # Get the instance tags from the template
          6. gcloud --project=kubeflow-rl compute instance-templates describe ${TEMPLATE_NAME}

          For more info see Google Cloud HTTP health check docs

    • In Stackdriver Logging look at the Cloud Http Load Balancer logs

      • Logs are labeled with the forwarding rule
      • The forwarding rules are available via the annotations on the ingress

        1. ingress.kubernetes.io/forwarding-rule
        2. ingress.kubernetes.io/https-forwarding-rule
    • Verify that requests are being properly routed within the cluster

    • Connect to one of the envoy proxies

      1. kubectl exec -ti `kubectl get pods --selector=service=envoy -o jsonpath='{.items[0].metadata.name}'` /bin/bash
    • Install curl in the pod

      1. apt-get update && apt-get install -y curl
    • Verify access to the whoami app

      1. curl -L -s -i http://envoy:8080/noiap/whoami
    • If this doesn’t return a 200 OK response; then there is a problem with the K8s resources

      • Check the pods are running
      • Check services are pointing at the points (look at the endpoints for the various services)

GKE Certificate Fails To Be Provisioned

A common symptom of your certificate failing to be provisioned is SSL errors like ERR_SSL_VERSION_OR_CIPHER_MISMATCH when you try to access the Kubeflow https endpoint.

To troubleshoot check the status of your GKE managed certificate

  1. kubectl -n istio-system describe managedcertificate

If the certificate is in status FailedNotVisible then it means Google Cloud failed to provision the certificate because it could not verify that you owned the domain by doing an ACME challenge. In order for Google Cloud to provision your certificate

  1. Your ingress must be created in order to associated a Google Cloud Load Balancer(GCLB) with the IP address for your endpoint
  2. There must be a DNS entry mapping your domain name to the IP.

If there is a problem preventing either of the above then Google Cloud will be unable to provision your certificate and eventually enter the permanent failure state FailedNotVisible indicating your endpoint isn’t accessible. The most common cause is the ingress can’t be created because the K8s secret containing OAuth credentials doesn’t exist.

To fix this you must first resolve the underlying problems preventing your ingress or DNS entry from being created. Once the underlying problem has been fixed you can follow the steps below to force a new certificate to be generated.

You can fix the certificate by performing the following steps to delete the existing certificate and create a new one.

  1. Get the name of the Google Cloud certificate

    1. kubectl -n istio-system describe managedcertificate gke-certificate
    • The status will contain Certificate Name which will start with mcrt make a note of this.
  2. Delete the ingress

    1. kubectl -n istio-system delete ingress envoy-ingress
  3. Ensure the certificate was deleted

    1. gcloud --project=${PROJECT} compute ssl-certificates list
    • Make sure the certificate obtained in the first step no longer exists
  4. Reapply kubeflow in order to recreate the ingress and certificate

    • If you deployed with kfctl rerun kfctl apply
    • If you deployed using the Google Cloud blueprint rerun make apply-kubeflow
  5. Monitor the certificate to make sure it can be provisioned

    1. kubectl --context=gcp-private-0527 -n istio-system describe managedcertificate gke-certificate
  6. Since the ingress has been recreated we need to restart the pods that configure it

    1. kubectl -n istio-system delete pods -l service=backend-updater
    2. kubectl -n istio-system delete pods -l service=iap-enabler

Problems with SSL certificate from Let’s Encrypt

As of Kubeflow 1.0, Kubeflow should be using GKE Managed Certificates and no longer using Let’s Encrypt.

See the guide to monitoring your Cloud IAP setup.

Envoy pods crash-looping: root cause is backend quota exceeded

If your logs show the Envoy pods crash-looping, the root cause may be that you have exceeded your quota for some backend services such as loadbalancers. This is particularly likely if you have multiple, differently named deployments in the same Google Cloud project using Cloud IAP.

The error

The error looks like this for the pod’s Envoy container:

  1. kubectl logs -n kubeflow envoy-79ff8d86b-z2snp envoy
  2. [2019-01-22 00:19:44.400][1][info][main] external/envoy/source/server/server.cc:184] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
  3. [2019-01-22 00:19:44.400][1][critical][main] external/envoy/source/server/server.cc:71] error initializing configuration '/etc/envoy/envoy-config.json': unable to read file: /etc/envoy/envoy-config.json

And the Cloud IAP container shows a message like this:

  1. Waiting for backend id PROJECT=<your-project> NAMESPACE=kubeflow SERVICE=envoy filter=name~k8s-be-30352-...

Diagnosing the cause

You can verify the cause of the problem by entering the following command:

  1. kubectl -n istio-system describe ingress

Look for something like this in the output:

  1. Events:
  2. Type Reason Age From Message
  3. ---- ------ ---- ---- -------
  4. Warning Sync 14m (x193 over 19h) loadbalancer-controller Error during sync: googleapi: Error 403: Quota 'BACKEND_SERVICES' exceeded. Limit: 5.0 globally., quotaExceeded

Fixing the problem

If you have any redundant Kubeflow deployments, you can delete them using the Deployment Manager.

Alternatively, you can request more backend services quota on the Google Cloud Console.

  1. Go to the quota settings for backend services on the Google Cloud Console.
  2. Click EDIT QUOTAS. A quota editing form opens on the right of the screen.
  3. Follow the form instructions to apply for more quota.

Legacy networks are not supported

Cloud Filestore and GKE try to use the network named default by default. For older projects, this will be a legacy network which is incompatible with Cloud Filestore and newer GKE features like private clusters. This will manifest as the error “default is invalid; legacy networks are not supported” when deploying Kubeflow.

Here’s an example error when deploying Cloud Filestore:

  1. ERROR: (gcloud.deployment-manager.deployments.update) Error in Operation [operation-1533189457517-5726d7cfd19c9-e1b0b0b5-58ca11b8]: errors:
  2. - code: RESOURCE_ERROR
  3. location: /deployments/jl-0801-b-gcfs/resources/filestore
  4. message: '{"ResourceType":"gcp-types/file-v1beta1:projects.locations.instances","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"network
  5. default is invalid; legacy networks are not supported.","status":"INVALID_ARGUMENT","statusMessage":"Bad
  6. Request","requestPath":"https://file.googleapis.com/v1beta1/projects/cloud-ml-dev/locations/us-central1-a/instances","httpMethod":"POST"}}'

To fix this we can create a new network:

  1. cd ${KF_DIR}
  2. cp .cache/master/deployment/gke/deployment_manager_configs/network.* \
  3. ./gcp_config/

Edit network.yaml to set the name for the network.

Edit gcfs.yaml to use the name of the newly created network.

Apply the changes.

  1. cd ${KF_DIR}
  2. kfctl apply -V -f ${CONFIG}

CPU platform unavailable in requested zone

By default, we set minCpuPlatform to Intel Haswell to make sure AVX2 is supported. See troubleshooting for more details.

If you encounter this CPU platform unavailable error (might manifest as Cluster is currently being created, deleted, updated or repaired and cannot be updated.), you can change the zone or change the minCpuPlatform. See here for available zones and cpu platforms.

Changing the OAuth client used by IAP

If you need to change the OAuth client used by IAP, you can run the following commands to replace the Kubernetes secret containing the ID and secret.

  1. kubectl -n kubeflow delete secret kubeflow-oauth
  2. kubectl -n kubeflow create secret generic kubeflow-oauth \
  3. --from-literal=client_id=${CLIENT_ID} \
  4. --from-literal=client_secret=${CLIENT_SECRET}

Troubleshooting SSL certificate errors

This section describes how to enable service management API to avoid managed certificates failure.

To check your certificate:

  1. Run the following command:

    1. kubectl -n istio-system describe managedcertificate gke-certificate

    Make sure the certificate status is either Active or Provisioning which means it is not ready. For more details on certificate status, refer to the certificate statuses descriptions section. Also, make sure the domain name is correct.

  2. Run the following command to look for the errors using the certificate name from the previous step:

    1. gcloud beta --project=${PROJECT} compute ssl-certificates describe --global ${CERTIFICATE_NAME}
  3. Run the following command:

    1. kubectl -n istio-system get ingress envoy-ingress -o yaml

    Make sure of the following:

    • networking.gke.io/managed-certificates annotation value points to the name of the Kubernetes managed certificate resource and is gke-certificate;

    • public IP address that is displayed in the status is assigned. See the example of IP address below:

      1. status:
      2. loadBalancer:
      3. ingress:
      4. - ip: 35.186.212.202
    • DNS entry for the domain has propagated. To verify this, use the following nslookup command example:

      1. `nslookup ${DOMAIN}`
    • domain name is the fully qualified domain name which be the host value in the ingress. See the example below:

      1. ${KF_APP_NAME}.endpoints.${PROJECT}.cloud.goog

    Note that managed certificates cannot provision the certificate if the DNS lookup does not work properly.

Last modified 17.05.2021: (gke) Full Kubeflow Deployment guide on GKE (revision for Kubeflow 1.3) (#2686) (f50693a3)