MXNet Training

Instructions for using MXNet

Out of date

This guide contains outdated information pertaining to Kubeflow 1.0. This guide needs to be updated for Kubeflow 1.1.

Alpha

This Kubeflow component has alpha status with limited support. See the Kubeflow versioning policies. The Kubeflow team is interested in your feedback about the usability of the feature.

This guide walks you through using MXNet with Kubeflow.

Installing MXNet Operator

If you haven’t already done so please follow the Getting Started Guide to deploy Kubeflow.

A version of MXNet support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.

Verify that MXNet support is included in your Kubeflow deployment

Check that the MXNet custom resource is installed

  1. kubectl get crd

The output should include mxjobs.kubeflow.org

  1. NAME AGE
  2. ...
  3. mxjobs.kubeflow.org 4d
  4. ...

If it is not included you can add it as follows

  1. git clone https://github.com/kubeflow/manifests
  2. cd manifests/mxnet-job/mxnet-operator
  3. kubectl kustomize base | kubectl apply -f -

Alternatively, you can deploy the operator with default settings without using kustomize by running the following from the repo:

  1. git clone https://github.com/kubeflow/mxnet-operator.git
  2. cd mxnet-operator
  3. kubectl create -f manifests/crd-v1beta1.yaml
  4. kubectl create -f manifests/rbac.yaml
  5. kubectl create -f manifests/deployment.yaml

Creating a MXNet training job

You create a training job by defining a MXJob with MXTrain mode and then creating it with

  1. kubectl create -f examples/v1beta1/train/mx_job_dist_gpu.yaml

Creating a TVM tuning job (AutoTVM)

TVM is a end to end deep learning compiler stack, you can easily run AutoTVM with mxnet-operator. You can create a auto tuning job by define a type of MXTune job and then creating it with

  1. kubectl create -f examples/v1beta1/tune/mx_job_tune_gpu.yaml

Before you use the auto-tuning example, there is some preparatory work need to be finished in advance. To let TVM tune your network, you should create a docker image which has TVM module. Then, you need a auto-tuning script to specify which network will be tuned and set the auto-tuning parameters, For more details, please see https://docs.tvm.ai/tutorials/autotvm/tune_relay_mobile_gpu.html#sphx-glr-tutorials-autotvm-tune-relay-mobile-gpu-py. Finally, you need a startup script to start the auto-tuning program. In fact, mxnet-operator will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script. We provide an example under examples/v1beta1/tune/, tuning result will be saved in a log file like resnet-18.log in the example we gave. You can refer it for details.

Monitoring a MXNet Job

To get the status of your job

  1. kubectl get -o yaml mxjobs ${JOB}

Here is sample output for an example job

  1. apiVersion: kubeflow.org/v1beta1
  2. kind: MXJob
  3. metadata:
  4. creationTimestamp: 2019-03-19T09:24:27Z
  5. generation: 1
  6. name: mxnet-job
  7. namespace: default
  8. resourceVersion: "3681685"
  9. selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/mxjobs/mxnet-job
  10. uid: cb11013b-4a28-11e9-b7f4-704d7bb59f71
  11. spec:
  12. cleanPodPolicy: All
  13. jobMode: MXTrain
  14. mxReplicaSpecs:
  15. Scheduler:
  16. replicas: 1
  17. restartPolicy: Never
  18. template:
  19. metadata:
  20. creationTimestamp: null
  21. spec:
  22. containers:
  23. - image: mxjob/mxnet:gpu
  24. name: mxnet
  25. ports:
  26. - containerPort: 9091
  27. name: mxjob-port
  28. resources: {}
  29. Server:
  30. replicas: 1
  31. restartPolicy: Never
  32. template:
  33. metadata:
  34. creationTimestamp: null
  35. spec:
  36. containers:
  37. - image: mxjob/mxnet:gpu
  38. name: mxnet
  39. ports:
  40. - containerPort: 9091
  41. name: mxjob-port
  42. resources: {}
  43. Worker:
  44. replicas: 1
  45. restartPolicy: Never
  46. template:
  47. metadata:
  48. creationTimestamp: null
  49. spec:
  50. containers:
  51. - args:
  52. - /incubator-mxnet/example/image-classification/train_mnist.py
  53. - --num-epochs
  54. - "10"
  55. - --num-layers
  56. - "2"
  57. - --kv-store
  58. - dist_device_sync
  59. - --gpus
  60. - "0"
  61. command:
  62. - python
  63. image: mxjob/mxnet:gpu
  64. name: mxnet
  65. ports:
  66. - containerPort: 9091
  67. name: mxjob-port
  68. resources:
  69. limits:
  70. nvidia.com/gpu: "1"
  71. status:
  72. completionTime: 2019-03-19T09:25:11Z
  73. conditions:
  74. - lastTransitionTime: 2019-03-19T09:24:27Z
  75. lastUpdateTime: 2019-03-19T09:24:27Z
  76. message: MXJob mxnet-job is created.
  77. reason: MXJobCreated
  78. status: "True"
  79. type: Created
  80. - lastTransitionTime: 2019-03-19T09:24:27Z
  81. lastUpdateTime: 2019-03-19T09:24:29Z
  82. message: MXJob mxnet-job is running.
  83. reason: MXJobRunning
  84. status: "False"
  85. type: Running
  86. - lastTransitionTime: 2019-03-19T09:24:27Z
  87. lastUpdateTime: 2019-03-19T09:25:11Z
  88. message: MXJob mxnet-job is successfully completed.
  89. reason: MXJobSucceeded
  90. status: "True"
  91. type: Succeeded
  92. mxReplicaStatuses:
  93. Scheduler: {}
  94. Server: {}
  95. Worker: {}
  96. startTime: 2019-03-19T09:24:29Z

Last modified 03.08.2020: Added outdated banner to non-index docs unchanged in last 30d (#2072) (e56f3650)