Logging and monitoring

Logging and monitoring for Kubeflow

This guide has information about how to set up logging and monitoring for yourKubeflow deployment.

Logging

Stackdriver on GKE

The default on GKE is to send logs toStackdriver logging.

Stackdriver recently introduced new features for Kubernetes Monitoring that are currentlyin Beta. These features are only available on Kubernetes v1.10 or later and mustbe explicitly installed. Below are instructions for both versions of Stackdriver Kubernetes.

Default stackdriver

This section contains instructions for using the existing stackdriver supportfor GKE which is the default.

To get the logs for a particular pod you can use the followingadvanced filter in Stackdriver logging’s search UI.

  1. resource.type="container"
  2. resource.labels.cluster_name="${CLUSTER}"
  3. resource.labels.pod_id="${POD_NAME}"

where ${POD_NAME} is the name of the pod and ${CLUSTER} is the name of your cluster.

The equivalent gcloud command would be

  1. gcloud --project=${PROJECT} logging read \
  2. --freshness=24h \
  3. --order asc \
  4. "resource.type=\"container\" resource.labels.cluster_name=\"${CLUSTER}\" resource.labels.pod_id=\"${POD}\" "

Kubernetes events for the TFJob are also available in stackdriver and canbe obtained using the following query in the UI

  1. resource.labels.cluster_name="${CLUSTER}"
  2. logName="projects/${PROJECT}/logs/events"
  3. jsonPayload.involvedObject.name="${TFJOB}"

The equivalent gcloud command is

  1. gcloud --project=${PROJECT} logging read \
  2. --freshness=24h \
  3. --order asc \
  4. "resource.labels.cluster_name=\"${CLUSTER}\" jsonPayload.involvedObject.name=\"${TFJOB}\" logName=\"projects/${PROJECT}/logs/events\" "

Stackdriver Kubernetes

This section contains the relevant stackdriver queries and gloud commandsif you are using the new Stackdriver Kubernetes Monitoring

To get the stdout/stderr logs for a particular container you can use the followingadvanced filter in Stackdriver logging’s search UI.

  1. resource.type="k8s_container"
  2. resource.labels.cluster_name="${CLUSTER}"
  3. resource.labels.pod_name="${POD_NAME}"

where ${POD_NAME} is the name of the pod and ${CLUSTER} is the name of your cluster.

The equivalent gcloud command would be

  1. gcloud --project=${PROJECT} logging read \
  2. --freshness=24h \
  3. --order asc \
  4. "resource.type=\"k8s_container\" resource.labels.cluster_name=\"${CLUSTER}\" resource.labels.pod_name=\"${POD_NAME}\" "

Events about individual pods can be obtained with the following query

  1. resource.type="k8s_pod"
  2. resource.labels.cluster_name="${CLUSTER}"
  3. resource.labels.pod_name="${POD_NAME}"

or via gcloud

  1. gcloud --project=${PROJECT} logging read \
  2. --freshness=24h \
  3. --order asc \
  4. "resource.type=\"k8s_pod\" resource.labels.cluster_name=\"${CLUSTER}\" resource.labels.pod_name=\"${POD_NAME}\" "

Filter with labels

The new agents also support querying for logs using pod labelsFor example:

  1. resource.type="k8s_container"
  2. resource.labels.cluster_name="${CLUSTER}"
  3. metadata.userLables.${LABEL_KEY}="${LABEL_VALUE}"

Monitoring

Stackdriver on GKE

The new Stackdriver Kubernetes Monitoringprovides single dashboard observability and is compatible with Prometheus data model.

See this doc for moredetails on the dashboard.

Stackdriver by default provides container level CPU/memory metrics.We can also define custom Prometheus metrics and view them on the Stackdriver dashboard.See for more detail.

Prometheus

Kubeflow Prometheus component

Kubeflow provides a Prometheus component.To deploy the Prometheus component:

  1. ks generate prometheus prom --projectId=YOUR_PROJECT --clusterName=YOUR_CLUSTER --zone=ZONE
  2. ks apply YOUR_ENV -c prom

The prometheus server will scrape the services with annotation prometheus.io/scrape=true.See for more detailand an example.

Export metrics to Stackdriver

The Prometheus server will export metrics to Stackdriver, asconfigured.We are using an imageprovided by Stackdriver. See Stackdriver docfor more detail, but you don’t need to change anything here.

If you don’t want to export metrics to Stackdriver, remove the remote_write part in the prometheus.yml,and use a native Prometheus image.

Metric collector component for IAP (GKE only)

Kubeflow also provides a metric-collector component.This component periodically pings your Kubeflow endpoint and provides ametricof whether the endpoint is up or not. To deploy it:

  1. ks generate metric-collector mc --targetUrl=YOUR_KF_ENDPOINT
  2. ks apply YOUR_ENV -c mc