MPI Training

Instructions for using MPI for training

Alpha

This Kubeflow component has alpha status with limited support. See the Kubeflow versioning policies. The Kubeflow team is interested in your feedback about the usability of the feature.

This guide walks you through using MPI for training.

The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes. Please check out this blog post for an introduction to MPI Operator and its industry adoption.

Installation

You can deploy the operator with default settings by running the following commands:

  1. git clone https://github.com/kubeflow/mpi-operator
  2. cd mpi-operator
  3. kubectl create -f deploy/v1alpha2/mpi-operator.yaml

Alternatively, follow the getting started guide to deploy Kubeflow.

An alpha version of MPI support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.

You can check whether the MPI Job custom resource is installed via:

  1. kubectl get crd

The output should include mpijobs.kubeflow.org like the following:

  1. NAME AGE
  2. ...
  3. mpijobs.kubeflow.org 4d
  4. ...

If it is not included you can add it as follows using kustomize:

  1. git clone https://github.com/kubeflow/mpi-operator
  2. cd mpi-operator/manifests
  3. kustomize build overlays/kubeflow | kubectl apply -f -

Note that since Kubernetes v1.14, kustomize became a subcommand in kubectl so you can also run the following command instead:

  1. kubectl kustomize base | kubectl apply -f -

Creating an MPI Job

You can create an MPI job by defining an MPIJob config file. See TensorFlow benchmark example config file for launching a multi-node TensorFlow benchmark training job. You may change the config file based on your requirements.

  1. cat examples/v1alpha2/tensorflow-benchmarks.yaml

Deploy the MPIJob resource to start training:

  1. kubectl create -f examples/v1alpha2/tensorflow-benchmarks.yaml

Monitoring an MPI Job

Once the MPIJob resource is created, you should now be able to see the created pods matching the specified number of GPUs. You can also monitor the job status from the status section. Here is sample output when the job is successfully completed.

  1. kubectl get -o yaml mpijobs tensorflow-benchmarks
  1. apiVersion: kubeflow.org/v1alpha2
  2. kind: MPIJob
  3. metadata:
  4. creationTimestamp: "2019-07-09T22:15:51Z"
  5. generation: 1
  6. name: tensorflow-benchmarks
  7. namespace: default
  8. resourceVersion: "5645868"
  9. selfLink: /apis/kubeflow.org/v1alpha2/namespaces/default/mpijobs/tensorflow-benchmarks
  10. uid: 1c5b470f-a297-11e9-964d-88d7f67c6e6d
  11. spec:
  12. cleanPodPolicy: Running
  13. mpiReplicaSpecs:
  14. Launcher:
  15. replicas: 1
  16. template:
  17. spec:
  18. containers:
  19. - command:
  20. - mpirun
  21. - --allow-run-as-root
  22. - -np
  23. - "2"
  24. - -bind-to
  25. - none
  26. - -map-by
  27. - slot
  28. - -x
  29. - NCCL_DEBUG=INFO
  30. - -x
  31. - LD_LIBRARY_PATH
  32. - -x
  33. - PATH
  34. - -mca
  35. - pml
  36. - ob1
  37. - -mca
  38. - btl
  39. - ^openib
  40. - python
  41. - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
  42. - --model=resnet101
  43. - --batch_size=64
  44. - --variable_update=horovod
  45. image: mpioperator/tensorflow-benchmarks:latest
  46. name: tensorflow-benchmarks
  47. Worker:
  48. replicas: 1
  49. template:
  50. spec:
  51. containers:
  52. - image: mpioperator/tensorflow-benchmarks:latest
  53. name: tensorflow-benchmarks
  54. resources:
  55. limits:
  56. nvidia.com/gpu: 2
  57. slotsPerWorker: 2
  58. status:
  59. completionTime: "2019-07-09T22:17:06Z"
  60. conditions:
  61. - lastTransitionTime: "2019-07-09T22:15:51Z"
  62. lastUpdateTime: "2019-07-09T22:15:51Z"
  63. message: MPIJob default/tensorflow-benchmarks is created.
  64. reason: MPIJobCreated
  65. status: "True"
  66. type: Created
  67. - lastTransitionTime: "2019-07-09T22:15:54Z"
  68. lastUpdateTime: "2019-07-09T22:15:54Z"
  69. message: MPIJob default/tensorflow-benchmarks is running.
  70. reason: MPIJobRunning
  71. status: "False"
  72. type: Running
  73. - lastTransitionTime: "2019-07-09T22:17:06Z"
  74. lastUpdateTime: "2019-07-09T22:17:06Z"
  75. message: MPIJob default/tensorflow-benchmarks successfully completed.
  76. reason: MPIJobSucceeded
  77. status: "True"
  78. type: Succeeded
  79. replicaStatuses:
  80. Launcher:
  81. succeeded: 1
  82. Worker: {}
  83. startTime: "2019-07-09T22:15:51Z"

Training should run for 100 steps and takes a few minutes on a GPU cluster. You can inspect the logs to see the training progress. When the job starts, access the logs from the launcher pod:

  1. PODNAME=$(kubectl get pods -l mpi_job_name=tensorflow-benchmarks,mpi_role_type=launcher -o name)
  2. kubectl logs -f ${PODNAME}
  1. TensorFlow: 1.14
  2. Model: resnet101
  3. Dataset: imagenet (synthetic)
  4. Mode: training
  5. SingleSess: False
  6. Batch size: 128 global
  7. 64 per device
  8. Num batches: 100
  9. Num epochs: 0.01
  10. Devices: ['horovod/gpu:0', 'horovod/gpu:1']
  11. NUMA bind: False
  12. Data format: NCHW
  13. Optimizer: sgd
  14. Variables: horovod
  15. ...
  16. 40 images/sec: 154.4 +/- 0.7 (jitter = 4.0) 8.280
  17. 40 images/sec: 154.4 +/- 0.7 (jitter = 4.1) 8.482
  18. 50 images/sec: 154.8 +/- 0.6 (jitter = 4.0) 8.397
  19. 50 images/sec: 154.8 +/- 0.6 (jitter = 4.2) 8.450
  20. 60 images/sec: 154.5 +/- 0.5 (jitter = 4.1) 8.321
  21. 60 images/sec: 154.5 +/- 0.5 (jitter = 4.4) 8.349
  22. 70 images/sec: 154.5 +/- 0.5 (jitter = 4.0) 8.433
  23. 70 images/sec: 154.5 +/- 0.5 (jitter = 4.4) 8.430
  24. 80 images/sec: 154.8 +/- 0.4 (jitter = 3.6) 8.199
  25. 80 images/sec: 154.8 +/- 0.4 (jitter = 3.8) 8.404
  26. 90 images/sec: 154.6 +/- 0.4 (jitter = 3.7) 8.418
  27. 90 images/sec: 154.6 +/- 0.4 (jitter = 3.6) 8.459
  28. 100 images/sec: 154.2 +/- 0.4 (jitter = 4.0) 8.372
  29. 100 images/sec: 154.2 +/- 0.4 (jitter = 4.0) 8.542
  30. ----------------------------------------------------------------
  31. total images/sec: 308.27

Docker Images

Docker images are built and pushed automatically to mpioperator on Dockerhub. You can use the following Dockerfiles to build the images yourself:

Last modified 06.04.2021: docs: Update MPI and MXNet operator pages (#2586) (420450dc)