Spark on Volcano

最近更新于 Jul 31, 2021

Spark简介

Spark是一款快速通用的大数据集群计算系统。它提供了Scala、Java、Python和R的高级api,以及一个支持用于数据分析的通用计算图的优化引擎。它还支持一组丰富的高级工具,包括用于SQL和DataFrames的Spark SQL、用于机器学习的MLlib、用于图形处理的GraphX和用于流处理的Spark Streaming。

Spark on volcano

Spark在volcano上的运行有两种形式,这里采用比较简单的spark-operator的形式[1]。还有一种较为复杂的部署方式可以参考[2]。

通过helm安装spark-operator。

  1. $ helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
  2. $ helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace

为确保spark-operator已经正常运行,通过如下指令查看。

  1. $ kubectl get po -nspark-operator

这里是用官方提供的spark-pi.yaml.

  1. apiVersion: "sparkoperator.k8s.io/v1beta2"
  2. kind: SparkApplication
  3. metadata:
  4. name: spark-pi
  5. namespace: default
  6. spec:
  7. type: Scala
  8. mode: cluster
  9. image: "gcr.io/spark-operator/spark:v3.0.0"
  10. imagePullPolicy: Always
  11. mainClass: org.apache.spark.examples.SparkPi
  12. mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar"
  13. sparkVersion: "3.0.0"
  14. batchScheduler: "volcano" #Note: the batch scheduler name must be specified with `volcano`
  15. restartPolicy:
  16. type: Never
  17. volumes:
  18. - name: "test-volume"
  19. hostPath:
  20. path: "/tmp"
  21. type: Directory
  22. driver:
  23. cores: 1
  24. coreLimit: "1200m"
  25. memory: "512m"
  26. labels:
  27. version: 3.0.0
  28. serviceAccount: spark
  29. volumeMounts:
  30. - name: "test-volume"
  31. mountPath: "/tmp"
  32. executor:
  33. cores: 1
  34. instances: 1
  35. memory: "512m"
  36. labels:
  37. version: 3.0.0
  38. volumeMounts:
  39. - name: "test-volume"
  40. mountPath: "/tmp"

部署spark应用并查看状态。

  1. $ kubectl apply -f spark-pi.yaml
  2. $ kubectl get SparkApplication