ROUTINE LOAD

description

Routine Load function allows users to submit a resident load task, and continuously load data into Doris by continuously reading data from the specified data source. Currently, only text data format (CSV) data is loaded from Kakfa by means of no authentication or SSL authentication.

Syntax:

  1. CREATE ROUTINE LOAD [db.]job_name ON tbl_name
  2. [load_properties]
  3. [job_properties]
  4. FROM data_source
  5. [data_source_properties]
  1. [db.]job_name

    The name of the load job, in the same database, only one job can run with the same name.

  2. tbl_name

    Specifies the name of the table that needs to be loaded.

  3. load_properties

    Used to describe the load data. grammar:

    1. [column_separator],
    2. [columns_mapping],
    3. [where_predicates],
    4. [partitions]
    1. column_separator:

      Specify column separators, such as:

      COLUMNS TERMINATED BY ","

      The default is: \t

    2. columns_mapping:

      Specifies the mapping of columns in the source data and defines how the derived columns are generated.

      1. Map column:

        Specify in order, which columns in the source data correspond to which columns in the destination table. For columns that you want to skip, you can specify a column name that does not exist.

        Suppose the destination table has three columns k1, k2, v1. The source data has 4 columns, of which columns 1, 2, and 4 correspond to k2, k1, and v1, respectively. Write as follows:

        COLUMNS (k2, k1, xxx, v1)

        Where xxx is a column that does not exist and is used to skip the third column in the source data.

      2. Derived columns:

        A column represented in the form of col_name = expr, which we call a derived column. That is, the value of the corresponding column in the destination table is calculated by expr.

        Derived columns are usually arranged after the mapped column. Although this is not mandatory, Doris always parses the mapped columns first and then parses the derived columns.

        Following an example, assume that the destination table also has column 4, v2, which is generated by the sum of k1 and k2. You can write as follows:

        COLUMNS (k2, k1, xxx, v1, v2 = k1 + k2);

    3. where_predicates

      Used to specify filter criteria to filter out unwanted columns. Filter columns can be either mapped columns or derived columns.

      For example, if we only want to load a column with k1 greater than 100 and k2 equal to 1000, we would write as follows:

      WHERE k1 > 100 and k2 = 1000

    4. partitions

      Specifies which partitions of the load destination table. If not specified, it will be automatically loaded into the corresponding partition.

      Example:

      PARTITION(p1, p2, p3)

  4. job_properties

    A generic parameter that specifies a routine load job.

    syntax:

    1. PROPERTIES (
    2. "key1" = "val1",
    3. "key2" = "val2"
    4. )

    Currently we support the following parameters:

    1. desired_concurrent_number

      The degree of concurrency desired. A routine load job is split into multiple subtasks. This parameter specifies how many tasks can be executed simultaneously in a job. Must be greater than 0. The default is 3.

      This concurrency is not the actual concurrency. The actual concurrency will be considered by the number of nodes in the cluster, the load, and the data source.

      example:

      "desired_concurrent_number" = "3"

    2. max_batch_interval/max_batch_rows/max_batch_size

      These three parameters represent:

      1. The maximum execution time of each subtask, in seconds. The range is 5 to 60. The default is 10.

      2. The maximum number of rows read per subtask. Must be greater than or equal to 200,000. The default is 200000.

      3. The maximum number of bytes read per subtask. The unit is byte and the range is 100MB to 1GB. The default is 100MB.

  1. These three parameters are used to control the execution time and throughput of a subtask. When either one reaches the threshold, the task ends.
  2. example:
  3. ```
  4. "max_batch_interval" = "20",
  5. "max_batch_rows" = "300000",
  6. "max_batch_size" = "209715200"
  7. ```
  8. 3. `max_error_number`
  9. The maximum number of error lines allowed in the sampling window. Must be greater than or equal to 0. The default is 0, which means that no error lines are allowed.
  10. The sampling window is max\_batch\_rows \* 10. That is, if the number of error lines is greater than max\_error\_number in the sampling window, the routine job will be suspended, and manual intervention is required to check the data quality problem.
  11. Lines that are filtered by the where condition are not counted as error lines.
  12. 4. `strict_mode`
  13. Whether to enable strict mode, the default is on. If turned on, the column type transformation of non-null raw data is filtered if the result is NULL. Specified as "strict\_mode" = "true"
  14. 5. timezone
  15. Specifies the time zone in which the job will be loaded. The default by using session variable's timezone. This parameter affects all function results related to the time zone involved in the load.
  1. data_source

    The type of data source. Current support:

    KAFKA

  2. data_source_properties

    Specify information about the data source.

    syntax:

    1. (
    2. "key1" = "val1",
    3. "key2" = "val2"
    4. )
    1. KAFKA data source

      Kafka_broker_list

      Kafka’s broker connection information. The format is ip:host. Multiple brokare separated by commas.

      Example:

      "kafka_broker_list" = "broker1:9092,broker2:9092"

    2. kafka_topic

      Specify the topic of Kafka to subscribe to.

      Example:

      "kafka_topic" = "my_topic"

    3. kafka_partitions/kafka_offsets

      Specify the kafka partition to be subscribed to, and the corresponding star offset for each partition.

      Offset can specify a specific offset from 0 or greater, or:

      1. OFFSET_BEGINNING: Subscribe from the location where the data is avaie.

      2. OFFSET_END: ​​Subscribe from the end.

  1. If not specified, all partitions under topic are subscribed by default fromSET\_END.
  2. Example:
  3. ```
  4. "kafka_partitions" = "0,1,2,3",
  5. "kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END"
  6. ```
  7. 4. property
  8. Specify custom kafka parameters.
  9. The function is equivalent to the "--property" parameter in the kafka shel
  10. When the value of the parameter is a file, you need to add the keyword: "FILbefore the value.
  11. For information on how to create a file, see "HELP CREATE FILE;"
  12. For more supported custom parameters, see the configuration items on the nt side in the official CONFIGURATION documentation for librdkafka.
  13. Example:
  14. ```
  15. "property.client.id" = "12345",
  16. "property.ssl.ca.location" = "FILE:ca.pem"
  17. ```
  18. 1. When connecting to Kafka using SSL, you need to specify the follg parameters:
  19. ```
  20. "property.security.protocol" = "ssl",
  21. "property.ssl.ca.location" = "FILE:ca.pem",
  22. "property.ssl.certificate.location" = "FILE:client.pem",
  23. "property.ssl.key.location" = "FILE:client.key",
  24. "property.ssl.key.password" = "abcdefg"
  25. ```
  26. among them:
  27. "property.security.protocol" and "property.ssl.ca.location" are requ to indicate the connection method is SSL and the location of the CA certate.
  28. If the client authentication is enabled on the Kafka server, you alsod to set:
  29. ```
  30. "property.ssl.certificate.location"
  31. "property.ssl.key.location"
  32. "property.ssl.key.password"
  33. ```
  34. Used to specify the public key of the client, the private key, and the word of the private key.
  35. 2. Specify the default starting offset for kafka partition
  36. If kafka\_partitions/kafka\_offsets is not specified, all partitions are umed by default, and you can specify kafka\_default\_offsets to specify the star offset. The default is OFFSET\_END, which starts at the end of the substion.
  37. Values:
  38. 1. OFFSET\_BEGINNING: Subscribe from the location where the data is avaie.
  39. 2. OFFSET\_END: Subscribe from the end.
  40. Example:
  41. `"property.kafka_default_offsets" = "OFFSET_BEGINNING"`
  1. load data format sample

    Integer class (TINYINT/SMALLINT/INT/BIGINT/LARGEINT): 1, 1000, 1234

    Floating point class (FLOAT/DOUBLE/DECIMAL): 1.1, 0.23, .356 Date class (DATE/DATETIME): 2017-10-03, 2017-06-13 12:34:03.

    String class (CHAR/VARCHAR) (without quotes): I am a student, a

    NULL value: \N

example

  1. Create a Kafka routine load task named test1 for the example_tbl of example_db. Specify group.id and client.id, and automatically consume all partitions by default, with subscriptions starting at the end (OFFSET_END)

    1. CREATE ROUTINE LOAD example_db.test1 ON example_tbl
    2. COLUMNS(k1, k2, k3, v1, v2, v3 = k1 * 100)
    3. PROPERTIES
    4. (
    5. "desired_concurrent_number"="3",
    6. "max_batch_interval" = "20",
    7. "max_batch_rows" = "300000",
    8. "max_batch_size" = "209715200",
    9. "strict_mode" = "false"
    10. )
    11. FROM KAFKA
    12. (
    13. "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
    14. "kafka_topic" = "my_topic",
    15. "property.group.id" = "xxx",
    16. "property.client.id" = "xxx"
    17. );
  2. Create a Kafka routine load task named test1 for the example_tbl of example_db. The load task is in strict mode.

    1. CREATE ROUTINE LOAD example_db.test1 ON example_tbl
    2. COLUMNS(k1, k2, k3, v1, v2, v3 = k1 * 100),
    3. WHERE k1 > 100 and k2 like "%doris%"
    4. PROPERTIES
    5. (
    6. "desired_concurrent_number"="3",
    7. "max_batch_interval" = "20",
    8. "max_batch_rows" = "300000",
    9. "max_batch_size" = "209715200",
    10. "strict_mode" = "false"
    11. )
    12. FROM KAFKA
    13. (
    14. "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
    15. "kafka_topic" = "my_topic",
    16. "kafka_partitions" = "0,1,2,3",
    17. "kafka_offsets" = "101,0,0,200"
    18. );
  3. load data from Kafka clusters via SSL authentication. Also set the client.id parameter. The load task is in non-strict mode and the time zone is Africa/Abidjan

    1. CREATE ROUTINE LOAD example_db.test1 ON example_tbl
    2. COLUMNS(k1, k2, k3, v1, v2, v3 = k1 * 100),
    3. WHERE k1 > 100 and k2 like "%doris%"
    4. PROPERTIES
    5. (
    6. "desired_concurrent_number"="3",
    7. "max_batch_interval" = "20",
    8. "max_batch_rows" = "300000",
    9. "max_batch_size" = "209715200",
    10. "strict_mode" = "false",
    11. "timezone" = "Africa/Abidjan"
    12. )
    13. FROM KAFKA
    14. (
    15. "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
    16. "kafka_topic" = "my_topic",
    17. "property.security.protocol" = "ssl",
    18. "property.ssl.ca.location" = "FILE:ca.pem",
    19. "property.ssl.certificate.location" = "FILE:client.pem",
    20. "property.ssl.key.location" = "FILE:client.key",
    21. "property.ssl.key.password" = "abcdefg",
    22. "property.client.id" = "my_client_id"
    23. );

keyword

  1. CREATE, ROUTINE, LOAD