Hive External Table of Doris

Hive External Table of Doris provides Doris with direct access to Hive external tables, which eliminates the need for cumbersome data import and solves the problem of analyzing Hive tables with the help of Doris’ OLAP capabilities:

  1. support for Hive data sources to access Doris
  2. Support joint queries between Doris and Hive data sources to perform more complex analysis operations
  3. Support access to kerberos-enabled Hive data sources

This document introduces how to use this feature and the considerations.

Glossary

Noun in Doris

  • FE: Frontend, the front-end node of Doris, responsible for metadata management and request access.
  • BE: Backend, the backend node of Doris, responsible for query execution and data storage

How To Use

Create Hive External Table

  1. -- Syntax
  2. CREATE [EXTERNAL] TABLE table_name (
  3. col_name col_type [NULL | NOT NULL] [COMMENT "comment"]
  4. ) ENGINE=HIVE
  5. [COMMENT "comment"] )
  6. PROPERTIES (
  7. 'property_name'='property_value',
  8. ...
  9. );
  10. -- Example 1: Create the hive_table table under hive_db in a Hive cluster
  11. CREATE TABLE `t_hive` (
  12. `k1` int NOT NULL COMMENT "",
  13. `k2` char(10) NOT NULL COMMENT "",
  14. `k3` datetime NOT NULL COMMENT "",
  15. `k5` varchar(20) NOT NULL COMMENT "",
  16. `k6` double NOT NULL COMMENT ""
  17. ) ENGINE=HIVE
  18. COMMENT "HIVE"
  19. PROPERTIES (
  20. 'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
  21. 'database' = 'hive_db',
  22. 'table' = 'hive_table'
  23. );
  24. -- Example 2: Create the hive_table table under hive_db in a Hive cluster with HDFS HA configuration.
  25. CREATE TABLE `t_hive` (
  26. `k1` int NOT NULL COMMENT "",
  27. `k2` char(10) NOT NULL COMMENT "",
  28. `k3` datetime NOT NULL COMMENT "",
  29. `k5` varchar(20) NOT NULL COMMENT "",
  30. `k6` double NOT NULL COMMENT ""
  31. ) ENGINE=HIVE
  32. COMMENT "HIVE"
  33. PROPERTIES (
  34. 'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
  35. 'database' = 'hive_db',
  36. 'table' = 'hive_table',
  37. 'dfs.nameservices'='hacluster',
  38. 'dfs.ha.namenodes.hacluster'='n1,n2',
  39. 'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
  40. 'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
  41. 'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
  42. );
  43. -- Example 3: Create the hive external table under hive_db in Hive cluster with HDFS HA and enable kerberos authentication.
  44. CREATE TABLE `t_hive` (
  45. `k1` int NOT NULL COMMENT "",
  46. `k2` char(10) NOT NULL COMMENT "",
  47. `k3` datetime NOT NULL COMMENT "",
  48. `k5` varchar(20) NOT NULL COMMENT "",
  49. `k6` double NOT NULL COMMENT ""
  50. ) ENGINE=HIVE
  51. COMMENT "HIVE"
  52. PROPERTIES (
  53. 'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
  54. 'database' = 'hive_db',
  55. 'table' = 'hive_table',
  56. 'dfs.nameservices'='hacluster',
  57. 'dfs.ha.namenodes.hacluster'='n1,n2',
  58. 'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
  59. 'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
  60. 'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
  61. 'hadoop.security.authentication'='kerberos',
  62. 'dfs.namenode.kerberos.principal'='hadoop/_HOST@REALM.COM'
  63. 'hadoop.kerberos.principal'='doris_test@REALM.COM',
  64. 'hadoop.kerberos.keytab'='/path/to/doris_test.keytab'
  65. );
  66. -- Example 4: Create the hive_table table under hive_db in a Hive cluster with data stored on S3
  67. CREATE TABLE `t_hive` (
  68. `k1` int NOT NULL COMMENT "",
  69. `k2` char(10) NOT NULL COMMENT "",
  70. `k3` datetime NOT NULL COMMENT "",
  71. `k5` varchar(20) NOT NULL COMMENT "",
  72. `k6` double NOT NULL COMMENT ""
  73. ) ENGINE=HIVE
  74. COMMENT "HIVE"
  75. PROPERTIES (
  76. 'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
  77. 'database' = 'hive_db',
  78. 'table' = 'hive_table',
  79. 'AWS_ACCESS_KEY'='your_access_key',
  80. 'AWS_SECRET_KEY'='your_secret_key',
  81. 'AWS_ENDPOINT'='s3.us-east-1.amazonaws.com',
  82. 'AWS_REGION'='us-east-1'
  83. );

Parameter Description

  • External Table Columns
    • Column names should correspond to the Hive table
    • The order of the columns should be the same as the Hive table
    • Must contain all the columns in the Hive table
    • Hive table partition columns do not need to be specified, they can be defined as normal columns.
  • ENGINE should be specified as HIVE
  • PROPERTIES attribute.
    • hive.metastore.uris: Hive Metastore service address
    • database: the name of the database to which Hive is mounted
    • table: the name of the table to which Hive is mounted
    • hadoop.username: the username to visit HDFS (need to specify it when the authentication type is simple)
    • dfs.nameservices: the logical name for this new nameservice. See hdfs-site.xml
    • dfs.ha.namenodes.[nameservice ID]:unique identifiers for each NameNode in the nameservice. See hdfs-site.xml
    • dfs.namenode.rpc-address.[nameservice ID].[name node ID]:the fully-qualified RPC address for each NameNode to listen on. See hdfs-site.xml
    • dfs.client.failover.proxy.provider.[nameservice ID]:the Java class that HDFS clients use to contact the Active NameNode, usually it is org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
  • For a kerberos enabled Hive datasource, additional properties need to be set:
    • dfs.namenode.kerberos.principal: HDFS namenode service principal
    • hadoop.security.authentication: HDFS authentication type please set kerberos, default simple
    • hadoop.kerberos.principal: The Kerberos pincipal that Doris will use when connectiong to HDFS.
    • hadoop.kerberos.keytab: HDFS client keytab location.
    • AWS_ACCESS_KEY: AWS access key id.
    • AWS_SECRET_KEY: AWS secret access key.
    • AWS_ENDPOINT: S3 endpoint. e.g. s3.us-east-1.amazonaws.com
    • AWS_REGION: AWS region. e.g. us-east-1

Note:

  • To enable Doris to access the hadoop cluster with kerberos authentication enabled, you need to deploy the Kerberos client kinit on the Doris all FE and BE nodes, configure krb5.conf, and fill in the KDC service information.
  • The value of the PROPERTIES property hadoop.kerberos.keytab needs to specify the absolute path of the keytab local file and allow the Doris process to access it.
  • The configuration of the HDFS cluster can be written into the hdfs-site.xml file. The configuration file is in the conf directory of fe and be. When users create a Hive table, they do not need to fill in the relevant information of the HDFS cluster configuration.

Data Type Matching

The supported Hive column types correspond to Doris in the following table.

HiveDorisDescription
BOOLEANBOOLEAN
CHARCHAROnly UTF8 encoding is supported
VARCHARVARCHAROnly UTF8 encoding is supported
TINYINTTINYINT
SMALLINTSMALLINT
INTINT
BIGINTBIGINT
FLOATFLOAT
DOUBLEDOUBLE
DECIMALDECIMAL
DATEDATE
TIMESTAMPDATETIMETimestamp to Datetime will lose precision

Note:

  • Hive table Schema changes are not automatically synchronized and require rebuilding the Hive external table in Doris.
  • The current Hive storage format only supports Text, Parquet and ORC types
  • The Hive version currently supported by default is 2.3.7、3.1.2, which has not been tested in other versions. More versions will be supported in the future.

Query Usage

After you finish building the Hive external table in Doris, it is no different from a normal Doris OLAP table except that you cannot use the data model in Doris (rollup, preaggregation, materialized view, etc.)

  1. select * from t_hive where k1 > 1000 and k3 = 'term' or k4 like '%doris';