Getting started with Katib

How to set up Katib and run some hyperparameter tuning examples

This page gets you started with Katib. Follow this guide to perform anyadditional setup you may need, depending on your environment, and to run a fewexamples using the command line and the Katib user interface (UI).

For an overview of the concepts around Katib and hyperparameter tuning, read theintroduction toKatib.

Katib setup

This section describes some configurations that you may need to add to yourKubernetes cluster, depending on the way you’re using Kubeflow and Katib.

Installing Katib

You can skip this step if you have already installed Kubeflow. Your Kubeflowdeployment includes Katib.

To install Katib as part of Kubeflow, follow theKubeflow installation guide.

If you want to install Katib separately from Kubeflow, or to get a later versionof Katib, run the following commands to install Katib directly from itsrepository on GitHub and deploy Katib to your cluster:

  1. git clone https://github.com/kubeflow/katib
  2. bash ./katib/scripts/v1alpha3/deploy.sh

Setting up persistent volumes

You can skip this step if you’re using Kubeflow on Google Kubernetes Engine(GKE) or if your Kubernetes cluster includes a StorageClass for dynamic volumeprovisioning. For more information, see the Kubernetes documentation ondynamic provisioningand persistent volumes.

If you’re using Katib outside GKE and your cluster doesn’t include aStorageClass for dynamic volume provisioning, you must create a persistentvolume (PV) to bind to the persistent volume claim (PVC) required by Katib.

After deploying Katib to your cluster, run the following command to create thePV:

  1. kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha3/pv/pv.yaml

The above kubectl apply command uses a YAML file(pv.yaml)that defines the properties of the PV.

Accessing the Katib UI

You can use the Katib user interface (UI) to submit experiments and to monitoryour results. The Katib home page within Kubeflow looks like this:

The Katib home page within the Kubeflow UI

If you installed Katib as part of Kubeflow, you can access theKatib UI from the Kubeflow UI:

  • Open the Kubeflow UI. See the guide toaccessing the Kubeflow UI.
  • Click Katib in the left-hand menu. Alternatively, you can set port-forwarding for the Katib UI service:
  1. kubectl port-forward svc/katib-ui -n kubeflow 8080:80

Then you can access the Katib UI at this URL:

  1. http://localhost:8080/katib/

Examples

This section introduces some examples that you can run to try Katib.

Example using random algorithm

You can create an experiment for Katib by defining the experiment in a YAMLconfiguration file. The YAML file defines the configurations for the experiment,including the hyperparameter feasible space, optimization parameter,optimization goal, suggestion algorithm, and so on.

This example uses the YAML file for therandom algorithm example.

The random algorithm example uses an MXNet neural network to train an imageclassification model using the MNIST dataset. The experiment runs threetraining jobs with various hyperparameters and saves the results.

Run the following command to launch an experiment using the random algorithmexample:

  1. kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/random-example.yaml

This example embeds the hyperparameters as arguments. You can embedhyperparameters in another way (for example, using environment variables)by using the template defined in the TrialTemplate.GoTemplate.RawTemplatesection of the YAML file. The template uses theGo template format.

This example randomly generates the following hyperparameters:

  • —lr: Learning rate. Type: double.
  • —num-layers: Number of layers in the neural network. Type: integer.
  • —optimizer: Optimizer. Type: categorical.

Check the experiment status:

  1. kubectl -n kubeflow describe experiment random-example

The output of the above command should look similar to this:

  1. Name: random-example
  2. Namespace: kubeflow
  3. Labels: controller-tools.k8s.io=1.0
  4. Annotations: <none>
  5. API Version: kubeflow.org/v1alpha3
  6. Kind: Experiment
  7. Metadata:
  8. Creation Timestamp: 2019-12-22T22:53:25Z
  9. Finalizers:
  10. update-prometheus-metrics
  11. Generation: 2
  12. Resource Version: 720692
  13. Self Link: /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/experiments/random-example
  14. UID: dc6bc15a-250d-11ea-8cae-42010a80010f
  15. Spec:
  16. Algorithm:
  17. Algorithm Name: random
  18. Algorithm Settings: <nil>
  19. Max Failed Trial Count: 3
  20. Max Trial Count: 12
  21. Metrics Collector Spec:
  22. Collector:
  23. Kind: StdOut
  24. Objective:
  25. Additional Metric Names:
  26. accuracy
  27. Goal: 0.99
  28. Objective Metric Name: Validation-accuracy
  29. Type: maximize
  30. Parallel Trial Count: 3
  31. Parameters:
  32. Feasible Space:
  33. Max: 0.03
  34. Min: 0.01
  35. Name: --lr
  36. Parameter Type: double
  37. Feasible Space:
  38. Max: 5
  39. Min: 2
  40. Name: --num-layers
  41. Parameter Type: int
  42. Feasible Space:
  43. List:
  44. sgd
  45. adam
  46. ftrl
  47. Name: --optimizer
  48. Parameter Type: categorical
  49. Trial Template:
  50. Go Template:
  51. Raw Template: apiVersion: batch/v1
  52. kind: Job
  53. metadata:
  54. name: {{.Trial}}
  55. namespace: {{.NameSpace}}
  56. spec:
  57. template:
  58. spec:
  59. containers:
  60. - name: {{.Trial}}
  61. image: docker.io/kubeflowkatib/mxnet-mnist-example
  62. command:
  63. - "python"
  64. - "/mxnet/example/image-classification/train_mnist.py"
  65. - "--batch-size=64"
  66. {{- with .HyperParameters}}
  67. {{- range .}}
  68. - "{{.Name}}={{.Value}}"
  69. {{- end}}
  70. {{- end}}
  71. restartPolicy: Never
  72. Status:
  73. Conditions:
  74. Last Transition Time: 2019-12-22T22:53:25Z
  75. Last Update Time: 2019-12-22T22:53:25Z
  76. Message: Experiment is created
  77. Reason: ExperimentCreated
  78. Status: True
  79. Type: Created
  80. Last Transition Time: 2019-12-22T22:55:10Z
  81. Last Update Time: 2019-12-22T22:55:10Z
  82. Message: Experiment is running
  83. Reason: ExperimentRunning
  84. Status: True
  85. Type: Running
  86. Current Optimal Trial:
  87. Observation:
  88. Metrics:
  89. Name: Validation-accuracy
  90. Value: 0.981091
  91. Parameter Assignments:
  92. Name: --lr
  93. Value: 0.025139701133432946
  94. Name: --num-layers
  95. Value: 4
  96. Name: --optimizer
  97. Value: sgd
  98. Start Time: 2019-12-22T22:53:25Z
  99. Trials: 12
  100. Trials Running: 2
  101. Trials Succeeded: 10
  102. Events: <none>

When the last value in Status.Conditions.Type is Succeeded, the experimentis complete.

View the results of the experiment in the Katib UI:

  • Open the Katib UI as described above.
  • Click Hyperparameter Tuning on the Katib home page.
  • Open the Katib menu panel on the left, then open the HP section andclick Monitor:

The Katib menu panel

  • Click on the right-hand panel to close the menu panel. You should seethe list of experiments:

The random example in the list of Katib experiments

  • Click the name of the experiment, random-example.

  • You should see a graph showing the level of accuracy for variouscombinations of the hyperparameter values (learning rate, number of layers,and optimizer):

Graph produced by the random example

  • Below the graph is a list of trials that ran within the experiment:

Trials that ran during the experiment

TensorFlow example

Run the following command to launch an experiment using the Kubeflow’sTensorFlow training job operator, TFJob:

  1. kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/tfjob-example.yaml

You can check the status of the experiment:

  1. kubectl -n kubeflow describe experiment tfjob-example

Follow the steps as described for the random algorithm exampleabove, to see the results of the experiment in the Katib UI.

PyTorch example

Run the following command to launch an experiment using Kubeflow’s PyTorchtraining job operator, PyTorchJob:

  1. kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/pytorchjob-example.yaml

You can check the status of the experiment:

  1. kubectl -n kubeflow describe experiment pytorchjob-example

Follow the steps as described for the random algorithm exampleabove, to see the results of the experiment in the Katib UI.

Cleanup

Delete the installed components:

  1. bash ./scripts/v1alpha3/undeploy.sh

If you created a PV for Katib, delete it:

  1. kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha3/pv/pv.yaml

Next steps

For details of how to configure and run your experiment, see the guide torunning an experiment.