Export Query Result

This document describes how to use the SELECT INTO OUTFILE command to export query results.

Syntax

The SELECT INTO OUTFILE statement can export the query results to a file. Currently, it only supports exporting to remote storage such as HDFS, S3, and BOS through the Broker process. The syntax is as follows:

  1. query_stmt
  2. INTO OUTFILE "file_path"
  3. [format_as]
  4. WITH BROKER `broker_name`
  5. [broker_properties]
  6. [other_properties]
  • file_path

    file_path specify the file path and file name prefix. Like: hdfs://path/to/my_file_.

    The final file name will be assembled as my_file_, file seq no and the format suffix. File seq no starts from 0, determined by the number of split.

    1. my_file_0.csv
    2. my_file_1.csv
    3. my_file_2.csv
  • [format_as]

    1. FORMAT AS CSV

    Specify the export format. The default is CSV.

  • [properties]

    Specify the relevant attributes. Currently only export through Broker process is supported. Broker related attributes need to be prefixed with broker.. For details, please refer to Broker.

    1. PROPERTIES
    2. ("broker.prop_key" = "broker.prop_val", ...)

    Other properties

    1. PROPERTIELS
    2. ("key1" = "val1", "key2" = "val2", ...)

    currently supports the following properties:

    • column_separator: Column separator, only applicable to CSV format. The default is \t.
    • line_delimiter: Line delimiter, only applicable to CSV format. The default is \n.
    • max_file_size:The max size of a single file. Default is 1GB. Range from 5MB to 2GB. Files exceeding this size will be splitted.
  1. Example 1

    Export simple query results to the file hdfs:/path/to/result.txt. Specify the export format as CSV. Use my_broker and set kerberos authentication information. Specify the column separator as , and the line delimiter as \n.

    1. SELECT * FROM tbl
    2. INTO OUTFILE "hdfs:/path/to/result_"
    3. FORMAT AS CSV
    4. PROPERTIELS
    5. (
    6. "broker.name" = "my_broker",
    7. "broker.hadoop.security.authentication" = "kerberos",
    8. "broker.kerberos_principal" = "doris@YOUR.COM",
    9. "broker.kerberos_keytab" = "/home/doris/my.keytab"
    10. "column_separator" = ",",
    11. "line_delimiter" = "\n",
    12. "max_file_size" = "100MB"
    13. );

    If the result is less than 100MB, file will be: result_0.csv.

    If larger than 100MB, may be: result_0.csv, result_1.csv, ....

  2. Example 2

    Export the query result of the CTE statement to the file hdfs:/path/to/result.txt. The default export format is CSV. Use my_broker and set hdfs high availability information. Use the default column separators and line delimiter.

    1. WITH
    2. x1 AS
    3. (SELECT k1, k2 FROM tbl1),
    4. x2 AS
    5. (SELECT k3 FROM tbl2)
    6. SELEC k1 FROM x1 UNION SELECT k3 FROM x2
    7. INTO OUTFILE "hdfs:/path/to/result_"
    8. PROPERTIELS
    9. (
    10. "broker.name" = "my_broker",
    11. "broker.username"="user",
    12. "broker.password"="passwd",
    13. "broker.dfs.nameservices" = "my_ha",
    14. "broker.dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
    15. "broker.dfs.namenode.rpc-address.my_ha.my_namenode1" = "nn1_host:rpc_port",
    16. "broker.dfs.namenode.rpc-address.my_ha.my_namenode2" = "nn2_host:rpc_port",
    17. "broker.dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
    18. );

    If the result is less than 1GB, file will be: result_0.csv.

    If larger than 1GB, may be: result_0.csv, result_1.csv, ....

  3. Example 3

    Export the query results of the UNION statement to the file bos://bucket/result.parquet. Specify the export format as PARQUET. Use my_broker and set hdfs high availability information. PARQUET format does not need to specify the column separator and line delimiter.

    1. SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1
    2. INTO OUTFILE "bos://bucket/result_"
    3. FORMAT AS PARQUET
    4. PROPERTIELS
    5. (
    6. "broker.name" = "my_broker",
    7. "broker.bos_endpoint" = "http://bj.bcebos.com",
    8. "broker.bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
    9. "broker.bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyyyyyyyy"
    10. );

    If the result is less than 1GB, file will be: result_0.parquet.

    If larger than 1GB, may be: result_0.parquet, result_1.parquet, ....

Return result

The command is a synchronization command. The command returns, which means the operation is over.

If it exports and returns normally, the result is as follows:

  1. mysql> SELECT * FROM tbl INTO OUTFILE ... Query OK, 100000 row affected (5.86 sec)

100000 row affected Indicates the size of the exported result set.

If the execution is incorrect, an error message will be returned, such as:

  1. mysql> SELECT * FROM tbl INTO OUTFILE ... ERROR 1064 (HY000): errCode = 2, detailMessage = Open broker writer failed ...

Notice

  • The CSV format does not support exporting binary types, such as BITMAP and HLL types. These types will be output as \N, which is null.
  • The query results are exported from a single BE node and a single thread. Therefore, the export time and the export result set size are positively correlated.
  • The export command does not check whether the file and file path exist. Whether the path will be automatically created or whether the existing file will be overwritten is entirely determined by the semantics of the remote storage system.
  • If an error occurs during the export process, the exported file may remain on the remote storage system. Doris will not clean these files. The user needs to manually clean up.
  • The timeout of the export command is the same as the timeout of the query. It can be set by SET query_timeout = xxx.
  • For empty result query, there will be an empty file.
  • File spliting will ensure that a row of data is stored in a single file. Therefore, the size of the file is not strictly equal to max_file_size.
  • For functions whose output is invisible characters, such as BITMAP and HLL types, the output is \N, which is NULL.
  • At present, the output type of some geo functions, such as ST_Point is VARCHAR, but the actual output value is an encoded binary character. Currently these functions will output garbled characters. For geo functions, use ST_AsText for output.