VolcanoJob

Introduction

VolcanoJob, referred to as vcjob, is a CRD object for Volcano. Different from a Kubernetes job, it provides more advanced features such as specified scheduler, minimum number of members, task definition, lifecycle management, specific queue, and specific priority. VolcanoJob is ideal for high performance computing scenarios such as machine learning, big data applications, and scientific computing.

Example

  1. apiVersion: batch.volcano.sh/v1alpha1
  2. kind: Job
  3. metadata:
  4. name: test-job
  5. spec:
  6. minAvailable: 3
  7. schedulerName: volcano
  8. priorityClassName: high-priority
  9. policies:
  10. - event: PodEvicted
  11. action: RestartJob
  12. plugins:
  13. ssh: []
  14. env: []
  15. svc: []
  16. maxRetry: 5
  17. queue: default
  18. volumes:
  19. - mountPath: "/myinput"
  20. - mountPath: "/myoutput"
  21. volumeClaimName: "testvolumeclaimname"
  22. volumeClaim:
  23. accessModes: [ "ReadWriteOnce" ]
  24. storageClassName: "my-storage-class"
  25. resources:
  26. requests:
  27. storage: 1Gi
  28. tasks:
  29. - replicas: 6
  30. name: "default-nginx"
  31. template:
  32. metadata:
  33. name: web
  34. spec:
  35. containers:
  36. - image: nginx
  37. imagePullPolicy: IfNotPresent
  38. name: nginx
  39. resources:
  40. requests:
  41. cpu: "1"
  42. restartPolicy: OnFailure

Key Fields

schedulerName

schedulerName indicates the scheduler that will schedule the job. Currently, the value can be volcano or default-scheduler, withvolcano` selected by default.

minAvailable

minAvailable represents the minimum number of running pods required to run the job. Only when the number of running pods is not less than minAvailable can the job be considered as running.

volumes

volumes indicates the configuration of the volume to which the job is mounted. It complies with the volume configuration requirements in Kubernetes.

tasks.replicas

tasks.replicas indicates the number of pod replicas in a task.

tasks.template

tasks.template defines the pod configuration of a task. It is the same as a pod template in Kubernetes.

tasks.policies

tasks.policies defines the lifecycle policy of a task.

policies

policies defines the default lifecycle policy for all tasks when tasks.policies is not set.

plugins

plugins indicates the plugins used by Volcano when the job is scheduled.

queue

queue indicates the queue to which the job belongs.

priorityClassName

priorityClassName indicates the priority of the job. It is used in preemptive scheduling.

maxRetry

maxRetry indicates the maximum number of retries allowed by the job.

Status

pending

pending indicates that the job is waiting to be scheduled.

aborting

aborting indicates that the job is being aborted because of some external factors.

aborted

aborted indicates that the job has already been aborted because of some external factors.

running

running indicates that there are at least minAvailable pods running.

restarting

restarting indicates that the job is restarting.

completing

completing indicates that there are at least minAvailable pods in the completing state. The job is doing cleanup.

completed

completed indicates that there are at least minAvailable pods in the completed state. The job has completed cleanup.

terminating

terminating indicates that the job is being terminated because of some internal factors. The job is waiting pods to release resources.

terminated

terminated indicates that the job has already been terminated because of some internal factors.

failed

failed indicates that the job still cannot start after maxRetry tries.

Usage

TensorFlow Workload

Create a tensorflow workload with a ps and three workers.

  1. apiVersion: batch.volcano.sh/v1alpha1
  2. kind: Job
  3. metadata:
  4. name: tensorflow-dist-mnist
  5. spec:
  6. minAvailable: 3 // There must be at least 3 available pods.
  7. schedulerName: volcano // Scheduler specified
  8. plugins:
  9. env: []
  10. svc: []
  11. policies:
  12. - event: PodEvicted // Restart the job when a pod is evicted.
  13. action: RestartJob
  14. tasks:
  15. - replicas: 1 // One ps pod specified
  16. name: ps
  17. template: // Definition of the ps pod
  18. spec:
  19. containers:
  20. - command:
  21. - sh
  22. - -c
  23. - |
  24. PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
  25. WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
  26. export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
  27. python /var/tf_dist_mnist/dist_mnist.py
  28. image: volcanosh/dist-mnist-tf-example:0.0.1
  29. name: tensorflow
  30. ports:
  31. - containerPort: 2222
  32. name: tfjob-port
  33. resources: {}
  34. restartPolicy: Never
  35. - replicas: 2 // Two worker pods specified
  36. name: worker
  37. policies:
  38. - event: TaskCompleted // The job will be marked as completed when two worker pods finish tasks.
  39. action: CompleteJob
  40. template: // Definition of worker pods
  41. spec:
  42. containers:
  43. - command:
  44. - sh
  45. - -c
  46. - |
  47. PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
  48. WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
  49. export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
  50. python /var/tf_dist_mnist/dist_mnist.py
  51. image: volcanosh/dist-mnist-tf-example:0.0.1
  52. name: tensorflow
  53. ports:
  54. - containerPort: 2222
  55. name: tfjob-port
  56. resources: {}
  57. restartPolicy: Never

Argo Workload

Create an argo workload with two pod replicas. The workload is considered normal when at least one pod replica works normally.

  1. apiVersion: argoproj.io/v1alpha1
  2. kind: Workflow
  3. metadata:
  4. generateName: volcano-step-job-
  5. spec:
  6. entrypoint: volcano-step-job
  7. serviceAccountName: argo
  8. templates:
  9. - name: volcano-step-job
  10. steps:
  11. - - name: hello-1
  12. template: hello-tmpl
  13. arguments:
  14. parameters: [{name: message, value: hello1}, {name: task, value: hello1}]
  15. - - name: hello-2a
  16. template: hello-tmpl
  17. arguments:
  18. parameters: [{name: message, value: hello2a}, {name: task, value: hello2a}]
  19. - name: hello-2b
  20. template: hello-tmpl
  21. arguments:
  22. parameters: [{name: message, value: hello2b}, {name: task, value: hello2b}]
  23. - name: hello-tmpl
  24. inputs:
  25. parameters:
  26. - name: message
  27. - name: task
  28. resource:
  29. action: create
  30. successCondition: status.state.phase = Completed
  31. failureCondition: status.state.phase = Failed
  32. manifest: | // Definition of the VolcanoJob
  33. apiVersion: batch.volcano.sh/v1alpha1
  34. kind: Job
  35. metadata:
  36. generateName: step-job-{{inputs.parameters.task}}-
  37. ownerReferences:
  38. - apiVersion: argoproj.io/v1alpha1
  39. blockOwnerDeletion: true
  40. kind: Workflow
  41. name: "{{workflow.name}}"
  42. uid: "{{workflow.uid}}"
  43. spec:
  44. minAvailable: 1
  45. schedulerName: volcano
  46. policies:
  47. - event: PodEvicted
  48. action: RestartJob
  49. plugins:
  50. ssh: []
  51. env: []
  52. svc: []
  53. maxRetry: 1
  54. queue: default
  55. tasks:
  56. - replicas: 2
  57. name: "default-hello"
  58. template:
  59. metadata:
  60. name: helloworld
  61. spec:
  62. containers:
  63. - image: docker/whalesay
  64. imagePullPolicy: IfNotPresent
  65. command: [cowsay]
  66. args: ["{{inputs.parameters.message}}"]
  67. name: hello
  68. resources:
  69. requests:
  70. cpu: "100m"
  71. restartPolicy: OnFailure

MindSpore Workload

Create a Mindspore workload with eight pod replicas. The workload is considered normal when at least one pod replica works normally.

  1. apiVersion: batch.volcano.sh/v1alpha1
  2. kind: Job
  3. metadata:
  4. name: mindspore-cpu
  5. spec:
  6. minAvailable: 1
  7. schedulerName: volcano
  8. policies:
  9. - event: PodEvicted
  10. action: RestartJob
  11. plugins:
  12. ssh: []
  13. env: []
  14. svc: []
  15. maxRetry: 5
  16. queue: default
  17. tasks:
  18. - replicas: 8
  19. name: "pod"
  20. template:
  21. spec:
  22. containers:
  23. - command: ["/bin/bash", "-c", "python /tmp/lenet.py"]
  24. image: lyd911/mindspore-cpu-example:0.2.0
  25. imagePullPolicy: IfNotPresent
  26. name: mindspore-cpu-job
  27. resources:
  28. limits:
  29. cpu: "1"
  30. requests:
  31. cpu: "1"
  32. restartPolicy: OnFailure

Note

Supported Frameworks

Volcano supports almost all mainstream computing frameworks including:

  1. TensorFlow
  2. PyTorch
  3. MindSpore
  4. PaddlePaddle
  5. Spark
  6. Flink
  7. Open MPI
  8. Horovod
  9. MXNet
  10. Kubeflow
  11. Argo
  12. KubeGene

volcano or default-scheduler

Volcano has been enhanced in batch computing when compared with default-scheduler. It is ideal for high performance computing scenarios such as machine learning, big data applications, and scientific computing.