Spark Doris Connector

Spark Doris Connector can support reading data stored in Doris and writing data to Doris through Spark.

Github: https://github.com/apache/incubator-doris-spark-connector

  • Support reading data from Doris.
  • Support Spark DataFrame batch/stream writing data to Doris
  • You can map the Doris table to DataFrame or RDD, it is recommended to use DataFrame.
  • Support the completion of data filtering on the Doris side to reduce the amount of data transmission.

Version Compatibility

ConnectorSparkDorisJavaScala
2.3.4-2.11.xx2.x0.12+82.11
3.1.2-2.12.xx3.x0.12.+82.12
3.2.0-2.12.xx3.2.x0.12.+82.12

Build and Install

Ready to work

1.Modify the custom_env.sh.tpl file and rename it to custom_env.sh

2.Specify the thrift installation directory

  1. ##source file content
  2. #export THRIFT_BIN=
  3. #export MVN_BIN=
  4. #export JAVA_HOME=
  5. ##amend as below,MacOS as an example
  6. export THRIFT_BIN=/opt/homebrew/Cellar/thrift@0.13.0/0.13.0/bin/thrift
  7. #export MVN_BIN=
  8. #export JAVA_HOME=
  9. Install `thrift` 0.13.0 (Note: `Doris` 0.15 and the latest builds are based on `thrift` 0.13.0, previous versions are still built with `thrift` 0.9.3)
  10. Windows:
  11. 1. Download: `http://archive.apache.org/dist/thrift/0.13.0/thrift-0.13.0.exe`
  12. 2. Modify thrift-0.13.0.exe to thrift
  13. MacOS:
  14. 1. Download: `brew install thrift@0.13.0`
  15. 2. default address: /opt/homebrew/Cellar/thrift@0.13.0/0.13.0/bin/thrift
  16. Note: Executing `brew install thrift@0.13.0` on MacOS may report an error that the version cannot be found. The solution is as follows, execute it in the terminal:
  17. 1. `brew tap-new $USER/local-tap`
  18. 2. `brew extract --version='0.13.0' thrift $USER/local-tap`
  19. 3. `brew install thrift@0.13.0`
  20. Reference link: `https://gist.github.com/tonydeng/02e571f273d6cce4230dc8d5f394493c`
  21. Linux:
  22. 1.Download source package`wget https://archive.apache.org/dist/thrift/0.13.0/thrift-0.13.0.tar.gz`
  23. 2.Install dependencies`yum install -y autoconf automake libtool cmake ncurses-devel openssl-devel lzo-devel zlib-devel gcc gcc-c++`
  24. 3.`tar zxvf thrift-0.13.0.tar.gz`
  25. 4.`cd thrift-0.13.0`
  26. 5.`./configure --without-tests`
  27. 6.`make`
  28. 7.`make install`
  29. Check the version after installation is completethrift --version
  30. Note: If you have compiled Doris, you do not need to install thrift, you can directly use $DORIS_HOME/thirdparty/installed/bin/thrift

Execute following command in source dir

  1. sh build.sh --spark 2.3.4 --scala 2.11 ## spark 2.3.4, scala 2.11
  2. sh build.sh --spark 3.1.2 --scala 2.12 ## spark 3.1.2, scala 2.12
  3. sh build.sh --spark 3.2.0 --scala 2.12 \
  4. --mvn-args "-Dnetty.version=4.1.68.Final -Dfasterxml.jackson.version=2.12.3" ## spark 3.2.0, scala 2.12

Note: If you check out the source code from tag, you can just run sh build.sh —tag without specifying the spark and scala versions. This is because the version in the tag source code is fixed.

After successful compilation, the file doris-spark-2.3.4-2.11-1.0.0-SNAPSHOT.jar will be generated in the output/ directory. Copy this file to ClassPath in Spark to use Spark-Doris-Connector. For example, Spark running in Local mode, put this file in the jars/ folder. Spark running in Yarn cluster mode, put this file in the pre-deployment package ,for example upload doris-spark-2.3.4-2.11-1.0.0-SNAPSHOT.jar to hdfs and add hdfs file path in spark.yarn.jars.

  1. Upload doris-spark-connector-3.1.2-2.12-1.0.0.jar Jar to hdfs.
  1. hdfs dfs -mkdir /spark-jars/
  2. hdfs dfs -put /your_local_path/doris-spark-connector-3.1.2-2.12-1.0.0.jar /spark-jars/
  1. Add doris-spark-connector-3.1.2-2.12-1.0.0.jar depence in Cluster.
  1. spark.yarn.jars=hdfs:///spark-jars/doris-spark-connector-3.1.2-2.12-1.0.0.jar

Using Maven

  1. <dependency>
  2. <groupId>org.apache.doris</groupId>
  3. <artifactId>spark-doris-connector-3.1_2.12</artifactId>
  4. <!--artifactId>spark-doris-connector-2.3_2.11</artifactId-->
  5. <version>1.0.1</version>
  6. </dependency>

Notes

Please replace the Connector version according to the different Spark and Scala versions.

Example

Read

SQL

  1. CREATE TEMPORARY VIEW spark_doris
  2. USING doris
  3. OPTIONS(
  4. "table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME",
  5. "fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT",
  6. "user"="$YOUR_DORIS_USERNAME",
  7. "password"="$YOUR_DORIS_PASSWORD"
  8. );
  9. SELECT * FROM spark_doris;

DataFrame

  1. val dorisSparkDF = spark.read.format("doris")
  2. .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
  3. .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
  4. .option("user", "$YOUR_DORIS_USERNAME")
  5. .option("password", "$YOUR_DORIS_PASSWORD")
  6. .load()
  7. dorisSparkDF.show(5)

RDD

  1. import org.apache.doris.spark._
  2. val dorisSparkRDD = sc.dorisRDD(
  3. tableIdentifier = Some("$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME"),
  4. cfg = Some(Map(
  5. "doris.fenodes" -> "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT",
  6. "doris.request.auth.user" -> "$YOUR_DORIS_USERNAME",
  7. "doris.request.auth.password" -> "$YOUR_DORIS_PASSWORD"
  8. ))
  9. )
  10. dorisSparkRDD.collect()

pySpark

  1. dorisSparkDF = spark.read.format("doris")
  2. .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
  3. .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
  4. .option("user", "$YOUR_DORIS_USERNAME")
  5. .option("password", "$YOUR_DORIS_PASSWORD")
  6. .load()
  7. # show 5 lines data
  8. dorisSparkDF.show(5)

Write

SQL

  1. CREATE TEMPORARY VIEW spark_doris
  2. USING doris
  3. OPTIONS(
  4. "table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME",
  5. "fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT",
  6. "user"="$YOUR_DORIS_USERNAME",
  7. "password"="$YOUR_DORIS_PASSWORD"
  8. );
  9. INSERT INTO spark_doris VALUES ("VALUE1","VALUE2",...);
  10. # or
  11. INSERT INTO spark_doris SELECT * FROM YOUR_TABLE

DataFrame(batch/stream)

  1. ## batch sink
  2. val mockDataDF = List(
  3. (3, "440403001005", "21.cn"),
  4. (1, "4404030013005", "22.cn"),
  5. (33, null, "23.cn")
  6. ).toDF("id", "mi_code", "mi_name")
  7. mockDataDF.show(5)
  8. mockDataDF.write.format("doris")
  9. .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
  10. .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
  11. .option("user", "$YOUR_DORIS_USERNAME")
  12. .option("password", "$YOUR_DORIS_PASSWORD")
  13. //other options
  14. //specify the fields to write
  15. .option("doris.write.fields","$YOUR_FIELDS_TO_WRITE")
  16. .save()
  17. ## stream sink(StructuredStreaming)
  18. val kafkaSource = spark.readStream
  19. .option("kafka.bootstrap.servers", "$YOUR_KAFKA_SERVERS")
  20. .option("startingOffsets", "latest")
  21. .option("subscribe", "$YOUR_KAFKA_TOPICS")
  22. .format("kafka")
  23. .load()
  24. kafkaSource.selectExpr("CAST(key AS STRING)", "CAST(value as STRING)")
  25. .writeStream
  26. .format("doris")
  27. .option("checkpointLocation", "$YOUR_CHECKPOINT_LOCATION")
  28. .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME")
  29. .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT")
  30. .option("user", "$YOUR_DORIS_USERNAME")
  31. .option("password", "$YOUR_DORIS_PASSWORD")
  32. //other options
  33. //specify the fields to write
  34. .option("doris.write.fields","$YOUR_FIELDS_TO_WRITE")
  35. .start()
  36. .awaitTermination()

Configuration

General

KeyDefault ValueComment
doris.fenodesDoris FE http address, support multiple addresses, separated by commas
doris.table.identifierDoris table identifier, eg, db1.tbl1
doris.request.retries3Number of retries to send requests to Doris
doris.request.connect.timeout.ms30000Connection timeout for sending requests to Doris
doris.request.read.timeout.ms30000Read timeout for sending request to Doris
doris.request.query.timeout.s3600Query the timeout time of doris, the default is 1 hour, -1 means no timeout limit
doris.request.tablet.sizeInteger.MAX_VALUEThe number of Doris Tablets corresponding to an RDD Partition. The smaller this value is set, the more partitions will be generated. This will increase the parallelism on the Spark side, but at the same time will cause greater pressure on Doris.
doris.batch.size1024The maximum number of rows to read data from BE at one time. Increasing this value can reduce the number of connections between Spark and Doris. Thereby reducing the extra time overhead caused by network delay.
doris.exec.mem.limit2147483648Memory limit for a single query. The default is 2GB, in bytes.
doris.deserialize.arrow.asyncfalseWhether to support asynchronous conversion of Arrow format to RowBatch required for spark-doris-connector iteration
doris.deserialize.queue.size64Asynchronous conversion of the internal processing queue in Arrow format takes effect when doris.deserialize.arrow.async is true
doris.write.fieldsSpecifies the fields (or the order of the fields) to write to the Doris table, fileds separated by commas.
By default, all fields are written in the order of Doris table fields.
sink.batch.size10000Maximum number of lines in a single write BE
sink.max-retries1Number of retries after writing BE failed
sink.properties.*The stream load parameters.

eg:
sink.properties.column_separator’ = ‘,’

doris.sink.task.partition.sizeThe number of partitions corresponding to the Writing task. After filtering and other operations, the number of partitions written in Spark RDD may be large, but the number of records corresponding to each Partition is relatively small, resulting in increased writing frequency and waste of computing resources. The smaller this value is set, the less Doris write frequency and less Doris merge pressure. It is generally used with doris.sink.task.use.repartition.
doris.sink.task.use.repartitionfalseWhether to use repartition mode to control the number of partitions written by Doris. The default value is false, and coalesce is used (note: if there is no Spark action before the write, the whole computation will be less parallel). If it is set to true, then repartition is used (note: you can set the final number of partitions at the cost of shuffle).
doris.sink.batch.interval.ms50The interval time of each batch sink, unit ms.

SQL & Dataframe Configuration

KeyDefault ValueComment
userDoris username
passwordDoris password
doris.filter.query.in.max.count100In the predicate pushdown, the maximum number of elements in the in expression value list. If this number is exceeded, the in-expression conditional filtering is processed on the Spark side.

RDD Configuration

KeyDefault ValueComment
doris.request.auth.userDoris username
doris.request.auth.passwordDoris password
doris.read.fieldList of column names in the Doris table, separated by commas
doris.filter.queryFilter expression of the query, which is transparently transmitted to Doris. Doris uses this expression to complete source-side data filtering.

Doris & Spark Column Type Mapping

Doris TypeSpark Type
NULL_TYPEDataTypes.NullType
BOOLEANDataTypes.BooleanType
TINYINTDataTypes.ByteType
SMALLINTDataTypes.ShortType
INTDataTypes.IntegerType
BIGINTDataTypes.LongType
FLOATDataTypes.FloatType
DOUBLEDataTypes.DoubleType
DATEDataTypes.StringType1
DATETIMEDataTypes.StringType1
BINARYDataTypes.BinaryType
DECIMALDecimalType
CHARDataTypes.StringType
LARGEINTDataTypes.StringType
VARCHARDataTypes.StringType
DECIMALV2DecimalType
TIMEDataTypes.DoubleType
HLLUnsupported datatype
  • Note: In Connector, DATE and DATETIME are mapped to String. Due to the processing logic of the Doris underlying storage engine, when the time type is used directly, the time range covered cannot meet the demand. So use String type to directly return the corresponding time readable text.