Hyperparameter Tuning (Katib)

Using Katib to tune your model’s hyperparameters on Kubernetes

The Katib project is inspired byGoogle vizier.Katib is a scalable and flexible hyperparameter tuning framework and is tightlyintegrated with Kubernetes. It does not depend on any specific deep learningframework (such as TensorFlow, MXNet, or PyTorch).

Installing Katib

To run Katib jobs, you must install the required packages as shown in thissection. You can do so by following the Kubeflow deployment guide,or by installing Katib directly from its repository:

  1. git clone https://github.com/kubeflow/katib
  2. ./katib/scripts/v1alpha2/deploy.sh

Persistent Volumes

If you want to use Katib outside Google Kubernetes Engine (GKE) and you don’thave a StorageClass for dynamic volume provisioning in your cluster, you mustcreate a persistent volume (PV) to bind your persistent volume claim (PVC).

This is the YAML file for a PV:

  1. apiVersion: v1
  2. kind: PersistentVolume
  3. metadata:
  4. name: katib-mysql
  5. labels:
  6. type: local
  7. app: katib
  8. spec:
  9. capacity:
  10. storage: 10Gi
  11. accessModes:
  12. - ReadWriteOnce
  13. hostPath:
  14. path: /data/katib

After deploying the Katib package, run the following command to create the PV:

  1. kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha2/pv/pv.yaml

Running examples

After deploying everything, you can run some examples.

Example using random algorithm

You can create an Experiment for Katib by defining an Experiment config file. See therandom algorithm example.

  1. kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/random-example.yaml

Running this command launches an Experiment. It runs a series oftraining jobs to train models using different hyperparameters and save theresults.

The configurations for the experiment (hyperparameter feasible space, optimizationparameter, optimization goal, suggestion algorithm, and so on) are defined inrandom-example.yaml.

In this demo, hyperparameters are embedded as args.You can embed hyperparameters in another way (for example, environment values)by using the template defined in TrialTemplate.GoTemplate.RawTemplate.It is written in go template format.

This demo randomly generates 3 hyperparameters:

  • Learning Rate (–lr) - type: double
  • Number of NN Layer (–num-layers) - type: int
  • optimizer (–optimizer) - type: categorical

Check the experiment status:

  1. $ kubectl -n kubeflow describe experiment random-example
  2. Name: random-example
  3. Namespace: kubeflow
  4. Labels: controller-tools.k8s.io=1.0
  5. Annotations: <none>
  6. API Version: kubeflow.org/v1alpha2
  7. Kind: Experiment
  8. Metadata:
  9. Creation Timestamp: 2019-01-18T16:30:46Z
  10. Finalizers:
  11. clean-data-in-db
  12. Generation: 5
  13. Resource Version: 1777650
  14. Self Link: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/experiments/random-example
  15. UID: 687a67f9-1b3e-11e9-a0c2-c6456c1f5f0a
  16. Spec:
  17. Algorithm:
  18. Algorithm Name: random
  19. Algorithm Settings:
  20. Max Failed Trial Count: 3
  21. Max Trial Count: 100
  22. Objective:
  23. Additional Metric Names:
  24. accuracy
  25. Goal: 0.99
  26. Objective Metric Name: Validation-accuracy
  27. Type: maximize
  28. Parallel Trial Count: 10
  29. Parameters:
  30. Feasible Space:
  31. Max: 0.03
  32. Min: 0.01
  33. Name: --lr
  34. Parameter Type: double
  35. Feasible Space:
  36. Max: 5
  37. Min: 2
  38. Name: --num-layers
  39. Parameter Type: int
  40. Feasible Space:
  41. List:
  42. sgd
  43. adam
  44. ftrl
  45. Name: --optimizer
  46. Parameter Type: categorical
  47. Trial Template:
  48. Go Template:
  49. Template Spec:
  50. Config Map Name: trial-template
  51. Config Map Namespace: kubeflow
  52. Template Path: mnist-trial-template
  53. Status:
  54. Completion Time: 2019-06-20T00:12:07Z
  55. Conditions:
  56. Last Transition Time: 2019-06-19T23:20:56Z
  57. Last Update Time: 2019-06-19T23:20:56Z
  58. Message: Experiment is created
  59. Reason: ExperimentCreated
  60. Status: True
  61. Type: Created
  62. Last Transition Time: 2019-06-20T00:12:07Z
  63. Last Update Time: 2019-06-20T00:12:07Z
  64. Message: Experiment is running
  65. Reason: ExperimentRunning
  66. Status: False
  67. Type: Running
  68. Last Transition Time: 2019-06-20T00:12:07Z
  69. Last Update Time: 2019-06-20T00:12:07Z
  70. Message: Experiment has succeeded because max trial count has reached
  71. Reason: ExperimentSucceeded
  72. Status: True
  73. Type: Succeeded
  74. Current Optimal Trial:
  75. Observation:
  76. Metrics:
  77. Name: Validation-accuracy
  78. Value: 0.982483983039856
  79. Parameter Assignments:
  80. Name: --lr
  81. Value: 0.026666666666666665
  82. Name: --num-layers
  83. Value: 2
  84. Name: --optimizer
  85. Value: sgd
  86. Start Time: 2019-06-19T23:20:55Z
  87. Trials: 100
  88. Trials Succeeded: 100
  89. Events: <none>

The demo should start an experiment and run three jobs with different parameters.When the spec.Status.Condition changes to Completed, the experiment isfinished.

TensorFlow operator example

To run the TensorFlow operator example, you must install a volume.

If you are using GKE and default StorageClass, you must create this PVC:

  1. apiVersion: v1
  2. kind: PersistentVolumeClaim
  3. metadata:
  4. name: tfevent-volume
  5. namespace: kubeflow
  6. labels:
  7. type: local
  8. app: tfjob
  9. spec:
  10. accessModes:
  11. - ReadWriteOnce
  12. resources:
  13. requests:
  14. storage: 10Gi

If you are not using GKE and you don’t have StorageClass for dynamic volumeprovisioning in your cluster, you must create a PVC and a PV:

  1. kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pvc.yaml
  2. kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pv.yaml

Now you can run the TensorFlow operator example:

  1. kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfjob-example.yaml

You can check the status of the experiment:

  1. kubectl -n kubeflow describe experiment tfjob-example

PyTorch example

This is an example for the PyTorch operator:

  1. kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/pytorchjob-example.yaml

You can check the status of the experiment:

  1. kubectl -n kubeflow describe experiment pytorchjob-example

Monitoring results

You can monitor your results in the Katib UI. If you installed Kubeflowusing the deployment guide, you can access the Katib UI at

  1. https://<your kubeflow endpoint>/katib/

For example, if you deployed Kubeflow on GKE, your endpoint would be

  1. https://<deployment_name>.endpoints.<project>.cloud.goog/

Otherwise, you can set port-forwarding for the Katib UI service:

  1. kubectl port-forward svc/katib-ui -n kubeflow 8080:80

Now you can access the Katib UI at this URL: http://localhost:8080/katib/.

Cleanup

Delete the installed components:

  1. ./scripts/v1alpha2/undeploy.sh

If you created a PV for Katib, delete it:

  1. kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha2/pv/pv.yaml

If you created a PV and PVC for the TensorFlow operator, delete it:

  1. kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pvc.yaml
  2. kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pv.yaml

Metrics collector

Katib has a metrics collector to take metrics from each trial. Katib collectsmetrics from stdout of each trial. Metrics should print in the followingformat: {metrics name}={value}. For example, when your objective value nameis loss and the metrics are recall and precision, your training containershould print like this:

  1. epoch 1:
  2. loss=0.3
  3. recall=0.5
  4. precision=0.4
  5. epoch 2:
  6. loss=0.2
  7. recall=0.55
  8. precision=0.5

Katib periodically launches CronJobs to collect metrics from pods.