QuickStart in Kubernetes

QuickStart in Kubernetes

Kubernetes deployment is DolphinScheduler deployment in a Kubernetes cluster, which can schedule massive tasks and can be used in production.

If you are a new hand and want to experience DolphinScheduler functions, we recommend you install follow Standalone deployment. If you want to experience more complete functions and schedule massive tasks, we recommend you install follow pseudo-cluster deployment. If you want to deploy DolphinScheduler in production, we recommend you follow cluster deployment or Kubernetes deployment.

Prerequisites

Helm version 3.1.0+
Kubernetes version 1.12+
PV provisioner support in the underlying infrastructure

Install DolphinScheduler

Please download the source code package apache-dolphinscheduler-<version>-src.tar.gz, download address: download address

To publish the release name dolphinscheduler version, please execute the following commands:

$ tar -zxvf apache-dolphinscheduler-<version>-src.tar.gz
$ cd apache-dolphinscheduler-<version>-src/deploy/kubernetes/dolphinscheduler
$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ helm dependency update .
$ helm install dolphinscheduler . --set image.tag=<version>

To publish the release name dolphinscheduler version to test namespace:

$ helm install dolphinscheduler . -n test

Tip: If a namespace named test is used, the optional parameter -n test needs to be added to the helm and kubectl commands.

These commands are used to deploy DolphinScheduler on the Kubernetes cluster by default. The Appendix-Configuration section lists the parameters that can be configured during installation.

Tip: List all releases using helm list

The PostgreSQL (with username root, password root and database dolphinscheduler) and ZooKeeper services will start by default.

Access DolphinScheduler UI

If ingress.enabled in values.yaml is set to true, you could access http://${ingress.host}/dolphinscheduler in browser.

Tip: If there is a problem with ingress access, please contact the Kubernetes administrator and refer to the Ingress.

Otherwise, when api.service.type=ClusterIP you need to execute port-forward commands:

$ kubectl port-forward --address 0.0.0.0 svc/dolphinscheduler-api 12345:12345
$ kubectl port-forward --address 0.0.0.0 -n test svc/dolphinscheduler-api 12345:12345 # with test namespace

Tip: If the error of unable to do port forwarding: socat not found appears, you need to install socat first.

Access the web: http://localhost:12345/dolphinscheduler/ui (Modify the IP address if needed).

Or when api.service.type=NodePort you need to execute the command:

NODE_IP=$(kubectl get no -n {{ .Release.Namespace }} -o jsonpath="{.items[0].status.addresses[0].address}")
NODE_PORT=$(kubectl get svc {{ template "dolphinscheduler.fullname" . }}-api -n {{ .Release.Namespace }} -o jsonpath="{.spec.ports[0].nodePort}")
echo http://$NODE_IP:$NODE_PORT/dolphinscheduler

Access the web: http://$NODE_IP:$NODE_PORT/dolphinscheduler.

The default username is admin and the default password is dolphinscheduler123.

Please refer to the Quick Start in the chapter Quick Start to explore how to use DolphinScheduler.

Uninstall the Chart

To uninstall or delete the dolphinscheduler deployment:

$ helm uninstall dolphinscheduler

The command removes all the Kubernetes components (except PVC) associated with the dolphinscheduler and deletes the release.

Run the command below to delete the PVC’s associated with dolphinscheduler:

$ kubectl delete pvc -l app.kubernetes.io/instance=dolphinscheduler

Note: Deleting the PVC’s will delete all data as well. Please be cautious before doing it.

Configuration

The configuration file is values.yaml, and the Appendix-Configuration tables lists the configurable parameters of the DolphinScheduler and their default values.

Support Matrix

Type	Support	Notes
Shell	Yes
Python2	Yes
Python3	Indirect Yes	Refer to FAQ
Hadoop2	Indirect Yes	Refer to FAQ
Hadoop3	Not Sure	Not tested
Spark-Local(client)	Indirect Yes	Refer to FAQ
Spark-YARN(cluster)	Indirect Yes	Refer to FAQ
Spark-Standalone(cluster)	Not Yet
Spark-Kubernetes(cluster)	Not Yet
Flink-Local(local>=1.11)	Not Yet	Generic CLI mode is not yet supported
Flink-YARN(yarn-cluster)	Indirect Yes	Refer to FAQ
Flink-YARN(yarn-session/yarn-per-job/yarn-application>=1.11)	Not Yet	Generic CLI mode is not yet supported
Flink-Standalone(default)	Not Yet
Flink-Standalone(remote>=1.11)	Not Yet	Generic CLI mode is not yet supported
Flink-Kubernetes(default)	Not Yet
Flink-Kubernetes(remote>=1.11)	Not Yet	Generic CLI mode is not yet supported
Flink-NativeKubernetes(kubernetes-session/application>=1.11)	Not Yet	Generic CLI mode is not yet supported
MapReduce	Indirect Yes	Refer to FAQ
Kerberos	Indirect Yes	Refer to FAQ
HTTP	Yes
DataX	Indirect Yes	Refer to FAQ
Sqoop	Indirect Yes	Refer to FAQ
SQL-MySQL	Indirect Yes	Refer to FAQ
SQL-PostgreSQL	Yes
SQL-Hive	Indirect Yes	Refer to FAQ
SQL-Spark	Indirect Yes	Refer to FAQ
SQL-ClickHouse	Indirect Yes	Refer to FAQ
SQL-Oracle	Indirect Yes	Refer to FAQ
SQL-SQLServer	Indirect Yes	Refer to FAQ
SQL-DB2	Indirect Yes	Refer to FAQ

FAQ

How to View the Logs of a Pod Container?

List all pods (aka po):

kubectl get po
kubectl get po -n test # with test namespace

View the logs of a pod container named dolphinscheduler-master-0:

kubectl logs dolphinscheduler-master-0
kubectl logs -f dolphinscheduler-master-0 # follow log output
kubectl logs --tail 10 dolphinscheduler-master-0 -n test # show last 10 lines from the end of the logs

How to Scale API, master and worker on Kubernetes?

List all deployments (aka deploy):

kubectl get deploy
kubectl get deploy -n test # with test namespace

Scale api to 3 replicas:

kubectl scale --replicas=3 deploy dolphinscheduler-api
kubectl scale --replicas=3 deploy dolphinscheduler-api -n test # with test namespace

List all stateful sets (aka sts):

kubectl get sts
kubectl get sts -n test # with test namespace

Scale master to 2 replicas:

kubectl scale --replicas=2 sts dolphinscheduler-master
kubectl scale --replicas=2 sts dolphinscheduler-master -n test # with test namespace

Scale worker to 6 replicas:

kubectl scale --replicas=6 sts dolphinscheduler-worker
kubectl scale --replicas=6 sts dolphinscheduler-worker -n test # with test namespace

How to Use MySQL as the DolphinScheduler’s Database Instead of PostgreSQL?

Because of the commercial license, we cannot directly use the driver of MySQL.

If you want to use MySQL, you can build a new image based on the apache/dolphinscheduler-<service> image follow the following instructions:

Since version 3.0.0, dolphinscheduler has been microserviced and the change of metadata storage requires replacing all services with MySQL driver, which including dolphinscheduler-tools, dolphinscheduler-master, dolphinscheduler-worker, dolphinscheduler-api, dolphinscheduler-alert-server

Download the MySQL driver mysql-connector-java-8.0.16.jar.
Create a new Dockerfile to add MySQL driver:

FROM dolphinscheduler.docker.scarf.sh/apache/dolphinscheduler-<service>:<version>
# For example
# FROM dolphinscheduler.docker.scarf.sh/apache/dolphinscheduler-tools:<version>
# Attention Please, If the build is dolphinscheduler-tools image
# You need to change the following line to: COPY mysql-connector-java-8.0.16.jar /opt/dolphinscheduler/tools/libs
# The other services don't need any changes
COPY mysql-connector-java-8.0.16.jar /opt/dolphinscheduler/libs

Build a new docker image including MySQL driver:

docker build -t apache/dolphinscheduler-<service>:mysql-driver .

Push the docker image apache/dolphinscheduler-<service>:mysql-driver to a docker registry.
Modify image repository and update tag to mysql-driver in values.yaml.
Modify postgresql enabled to false in values.yaml.
Modify externalDatabase (especially modify host, username and password) in values.yaml:

externalDatabase:
  type: "mysql"
  host: "localhost"
  port: "3306"
  username: "root"
  password: "root"
  database: "dolphinscheduler"
  params: "useUnicode=true&characterEncoding=UTF-8"

Run a DolphinScheduler release in Kubernetes (See Install DolphinScheduler).

How to Support MySQL or Oracle Datasource in `Datasource manage`?

Because of the commercial license, we cannot directly use the driver of MySQL or Oracle.

If you want to add MySQL or Oracle datasource, you can build a new image based on the apache/dolphinscheduler-<service> image follow the following instructions:

You need to change the two service images including dolphinscheduler-worker, dolphinscheduler-api.

Download the MySQL driver mysql-connector-java-8.0.16.jar. or download the Oracle driver ojdbc8.jar (such as ojdbc8-19.9.0.0.jar)
Create a new Dockerfile to add MySQL or Oracle driver:

FROM dolphinscheduler.docker.scarf.sh/apache/dolphinscheduler-<service>:<version>
# For example
# FROM dolphinscheduler.docker.scarf.sh/apache/dolphinscheduler-worker:<version>
# If you want to support MySQL Datasource
COPY mysql-connector-java-8.0.16.jar /opt/dolphinscheduler/libs
# If you want to support Oracle Datasource
COPY ojdbc8-19.9.0.0.jar /opt/dolphinscheduler/libs

Build a new docker image including MySQL or Oracle driver:

docker build -t apache/dolphinscheduler-<service>:new-driver .

Push the docker image apache/dolphinscheduler-<service>:new-driver to a docker registry.
Modify image repository and update tag to new-driver in values.yaml.
Run a DolphinScheduler release in Kubernetes (See Install DolphinScheduler).
Add a MySQL or Oracle datasource in Datasource manage.

How to Support Python 2 pip and Custom requirements.txt?

Just change the image of the dolphinscheduler-worker service.

Create a new Dockerfile to install pip:

FROM dolphinscheduler.docker.scarf.sh/apache/dolphinscheduler-worker:<version>
COPY requirements.txt /tmp
RUN apt-get update && \
    apt-get install -y --no-install-recommends python-pip && \
    pip install --no-cache-dir -r /tmp/requirements.txt && \
    rm -rf /var/lib/apt/lists/*

The command will install the default pip 18.1. If you upgrade the pip, just add the following command.

pip install --no-cache-dir -U pip && \

Build a new docker image including pip:

docker build -t apache/dolphinscheduler-worker:pip .

Push the docker image apache/dolphinscheduler-worker:pip to a docker registry.
Modify image repository and update tag to pip in values.yaml.
Run a DolphinScheduler release in Kubernetes (See Install DolphinScheduler).
Verify pip under a new Python task.

How to Support Python 3?

Just change the image of the dolphinscheduler-worker service.

Create a new Dockerfile to install Python 3:

FROM dolphinscheduler.docker.scarf.sh/apache/dolphinscheduler-worker:<version>
RUN apt-get update && \
    apt-get install -y --no-install-recommends python3 && \
    rm -rf /var/lib/apt/lists/*

The command will install the default Python 3.7.3. If you also want to install pip3, just replace python3 with python3-pip like:

apt-get install -y --no-install-recommends python3-pip && \

Build a new docker image including Python 3:

docker build -t apache/dolphinscheduler-worker:python3 .

Push the docker image apache/dolphinscheduler-worker:python3 to a docker registry.
Modify image repository and update tag to python3 in values.yaml.
Modify PYTHON_HOME to /usr/bin/python3 in values.yaml.
Run a DolphinScheduler release in Kubernetes (See Install DolphinScheduler).
Verify Python 3 under a new Python task.

How to Support Hadoop, Spark, Flink, Hive or DataX?

Take Spark 2.4.7 as an example:

Download the Spark 2.4.7 release binary spark-2.4.7-bin-hadoop2.7.tgz.
Ensure that common.sharedStoragePersistence.enabled is turned on.
Run a DolphinScheduler release in Kubernetes (See Install DolphinScheduler).
Copy the Spark 2.4.7 release binary into the Docker container.

kubectl cp spark-2.4.7-bin-hadoop2.7.tgz dolphinscheduler-worker-0:/opt/soft
kubectl cp -n test spark-2.4.7-bin-hadoop2.7.tgz dolphinscheduler-worker-0:/opt/soft # with test namespace

Because the volume sharedStoragePersistence is mounted on /opt/soft, all files in /opt/soft will not be lost.

Attach the container and ensure that SPARK_HOME2 exists.

kubectl exec -it dolphinscheduler-worker-0 bash
kubectl exec -n test -it dolphinscheduler-worker-0 bash # with test namespace
cd /opt/soft
tar zxf spark-2.4.7-bin-hadoop2.7.tgz
rm -f spark-2.4.7-bin-hadoop2.7.tgz
ln -s spark-2.4.7-bin-hadoop2.7 spark2 # or just mv
$SPARK_HOME2/bin/spark-submit --version

The last command will print the Spark version if everything goes well.

Verify Spark under a Shell task.

$SPARK_HOME2/bin/spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME2/examples/jars/spark-examples_2.11-2.4.7.jar

Check whether the task log contains the output like Pi is roughly 3.146015.

Verify Spark under a Spark task.

The file spark-examples_2.11-2.4.7.jar needs to be uploaded to the resources first, and then create a Spark task with:

Spark Version: SPARK2
Main Class: org.apache.spark.examples.SparkPi
Main Package: spark-examples_2.11-2.4.7.jar
Deploy Mode: local

Similarly, check whether the task log contains the output like Pi is roughly 3.146015.

Verify Spark on YARN.

Spark on YARN (Deploy Mode is cluster or client) requires Hadoop support. Similar to Spark support, the operation of supporting Hadoop is almost the same as the previous steps.

Ensure that $HADOOP_HOME and $HADOOP_CONF_DIR exists.

How to Support Spark 3?

In fact, the way to submit applications with spark-submit is the same, regardless of Spark 1, 2 or 3. In other words, the semantics of SPARK_HOME2 is the second SPARK_HOME instead of SPARK2‘s HOME, so just set SPARK_HOME2=/path/to/spark3.

Take Spark 3.1.1 as an example:

Download the Spark 3.1.1 release binary spark-3.1.1-bin-hadoop2.7.tgz.
Ensure that common.sharedStoragePersistence.enabled is turned on.
Run a DolphinScheduler release in Kubernetes (See Install DolphinScheduler).
Copy the Spark 3.1.1 release binary into the Docker container.

kubectl cp spark-3.1.1-bin-hadoop2.7.tgz dolphinscheduler-worker-0:/opt/soft
kubectl cp -n test spark-3.1.1-bin-hadoop2.7.tgz dolphinscheduler-worker-0:/opt/soft # with test namespace

Attach the container and ensure that SPARK_HOME2 exists.

kubectl exec -it dolphinscheduler-worker-0 bash
kubectl exec -n test -it dolphinscheduler-worker-0 bash # with test namespace
cd /opt/soft
tar zxf spark-3.1.1-bin-hadoop2.7.tgz
rm -f spark-3.1.1-bin-hadoop2.7.tgz
ln -s spark-3.1.1-bin-hadoop2.7 spark2 # or just mv
$SPARK_HOME2/bin/spark-submit --version

The last command will print the Spark version if everything goes well.

Verify Spark under a Shell task.

$SPARK_HOME2/bin/spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME2/examples/jars/spark-examples_2.12-3.1.1.jar

Check whether the task log contains the output like Pi is roughly 3.146015.

How to Support Shared Storage Between Master, Worker and Api Server?

For example, Master, Worker and API server may use Hadoop at the same time.

Modify the following configurations in values.yaml

common:
  sharedStoragePersistence:
    enabled: false
    mountPath: "/opt/soft"
    accessModes:
    - "ReadWriteMany"
    storageClassName: "-"
    storage: "20Gi"

Modify storageClassName and storage to actual environment values.

Note: storageClassName must support the access mode: ReadWriteMany.

Copy the Hadoop into the directory /opt/soft.
Ensure that $HADOOP_HOME and $HADOOP_CONF_DIR are correct.

How to Support Local File Resource Storage Instead of HDFS and S3?

Modify the following configurations in values.yaml:

common:
  configmap:
    RESOURCE_STORAGE_TYPE: "HDFS"
    RESOURCE_UPLOAD_PATH: "/dolphinscheduler"
    FS_DEFAULT_FS: "file:///"
  fsFileResourcePersistence:
    enabled: true
    accessModes:
    - "ReadWriteMany"
    storageClassName: "-"
    storage: "20Gi"