HBase SQL Connector

Scan Source: Bounded Lookup Source: Sync Mode Sink: Batch Sink: Streaming Upsert Mode

The HBase connector allows for reading from and writing to an HBase cluster. This document describes how to setup the HBase Connector to run SQL queries against HBase.

HBase always works in upsert mode for exchange changelog messages with the external system using a primary key defined on the DDL. The primary key must be defined on the HBase rowkey field (rowkey field must be declared). If the PRIMARY KEY clause is not declared, the HBase connector will take rowkey as the primary key by default.

Dependencies

There is no connector (yet) available for Flink version 1.19.

The HBase connector is not part of the binary distribution. See how to link with it for cluster execution here.

How to use HBase table

All the column families in HBase table must be declared as ROW type, the field name maps to the column family name, and the nested field names map to the column qualifier names. There is no need to declare all the families and qualifiers in the schema, users can declare what’s used in the query. Except the ROW type fields, the single atomic type field (e.g. STRING, BIGINT) will be recognized as HBase rowkey. The rowkey field can be arbitrary name, but should be quoted using backticks if it is a reserved keyword.

  1. -- register the HBase table 'mytable' in Flink SQL
  2. CREATE TABLE hTable (
  3. rowkey INT,
  4. family1 ROW<q1 INT>,
  5. family2 ROW<q2 STRING, q3 BIGINT>,
  6. family3 ROW<q4 DOUBLE, q5 BOOLEAN, q6 STRING>,
  7. PRIMARY KEY (rowkey) NOT ENFORCED
  8. ) WITH (
  9. 'connector' = 'hbase-1.4',
  10. 'table-name' = 'mytable',
  11. 'zookeeper.quorum' = 'localhost:2181'
  12. );
  13. -- use ROW(...) construction function construct column families and write data into the HBase table.
  14. -- assuming the schema of "T" is [rowkey, f1q1, f2q2, f2q3, f3q4, f3q5, f3q6]
  15. INSERT INTO hTable
  16. SELECT rowkey, ROW(f1q1), ROW(f2q2, f2q3), ROW(f3q4, f3q5, f3q6) FROM T;
  17. -- scan data from the HBase table
  18. SELECT rowkey, family1, family3.q4, family3.q6 FROM hTable;
  19. -- temporal join the HBase table as a dimension table
  20. SELECT * FROM myTopic
  21. LEFT JOIN hTable FOR SYSTEM_TIME AS OF myTopic.proctime
  22. ON myTopic.key = hTable.rowkey;

Available Metadata

The following connector metadata can be accessed as metadata columns in a table definition.

The R/W column defines whether a metadata field is readable (R) and/or writable (W). Read-only columns must be declared VIRTUAL to exclude them during an INSERT INTO operation.

KeyData TypeDescriptionR/W
timestampTIMESTAMP_LTZ(3) NOT NULLTimestamp for the HBase mutation.W
ttlBIGINT NOT NULLTime-to-live for the HBase mutation, in milliseconds.W

Connector Options

OptionRequiredForwardedDefaultTypeDescription
connector
requiredno(none)StringSpecify what connector to use, valid values are:
  • hbase-1.4: connect to HBase 1.4.x cluster
  • hbase-2.2: connect to HBase 2.2.x cluster
table-name
requiredyes(none)StringThe name of HBase table to connect. By default, the table is in ‘default’ namespace. To assign the table a specified namespace you need to use ‘namespace:table’.
zookeeper.quorum
requiredyes(none)StringThe HBase Zookeeper quorum.
zookeeper.znode.parent
optionalyes/hbaseStringThe root dir in Zookeeper for HBase cluster.
null-string-literal
optionalyesnullStringRepresentation for null values for string fields. HBase source and sink encodes/decodes empty bytes as null values for all types except string type.
sink.buffer-flush.max-size
optionalyes2mbMemorySizeWriting option, maximum size in memory of buffered rows for each writing request. This can improve performance for writing data to HBase database, but may increase the latency. Can be set to ‘0’ to disable it.
sink.buffer-flush.max-rows
optionalyes1000IntegerWriting option, maximum number of rows to buffer for each writing request. This can improve performance for writing data to HBase database, but may increase the latency. Can be set to ‘0’ to disable it.
sink.buffer-flush.interval
optionalyes1sDurationWriting option, the interval to flush any buffered rows. This can improve performance for writing data to HBase database, but may increase the latency. Can be set to ‘0’ to disable it. Note, both ‘sink.buffer-flush.max-size’ and ‘sink.buffer-flush.max-rows’ can be set to ‘0’ with the flush interval set allowing for complete async processing of buffered actions.
sink.ignore-null-value
optionalyesfalseBooleanWriting option, whether ignore null value or not.
sink.parallelism
optionalno(none)IntegerDefines the parallelism of the HBase sink operator. By default, the parallelism is determined by the framework using the same parallelism of the upstream chained operator.
lookup.async
optionalnofalseBooleanWhether async lookup are enabled. If true, the lookup will be async. Note, async only supports hbase-2.2 connector.
lookup.cache
optionalyesNONE

Enum

Possible values: NONE, PARTIAL
The cache strategy for the lookup table. Currently supports NONE (no caching) and PARTIAL (caching entries on lookup operation in external database).
lookup.partial-cache.max-rows
optionalyes(none)LongThe max number of rows of lookup cache, over this value, the oldest rows will be expired. “lookup.cache” must be set to “PARTIAL” to use this option.
lookup.partial-cache.expire-after-write
optionalyes(none)DurationThe max time to live for each rows in lookup cache after writing into the cache “lookup.cache” must be set to “PARTIAL” to use this option.
lookup.partial-cache.expire-after-access
optionalyes(none)DurationThe max time to live for each rows in lookup cache after accessing the entry in the cache. “lookup.cache” must be set to “PARTIAL” to use this option.
lookup.partial-cache.caching-missing-key
optionalyestrueBooleanWhether to store an empty value into the cache if the lookup key doesn’t match any rows in the table. “lookup.cache” must be set to “PARTIAL” to use this option.
lookup.max-retries
optionalyes3IntegerThe max retry times if lookup database failed.
properties.*
optionalno(none)StringThis can set and pass arbitrary HBase configurations. Suffix names must match the configuration key defined in HBase Configuration documentation. Flink will remove the “properties.” key prefix and pass the transformed key and values to the underlying HBaseClient. For example, you can add a kerberos authentication parameter ‘properties.hbase.security.authentication’ = ‘kerberos’.

Deprecated Options

These deprecated options has been replaced by new options listed above and will be removed eventually. Please consider using new options first.

OptionRequiredForwardedDefaultTypeDescription
lookup.cache.max-rows
optionalyes(none)IntegerPlease set “lookup.cache” = “PARTIAL” and use “lookup.partial-cache.max-rows” instead.
lookup.cache.ttl
optionalyes(none)DurationPlease set “lookup.cache” = “PARTIAL” and use “lookup.partial-cache.expire-after-write” instead.

Data Type Mapping

HBase stores all data as byte arrays. The data needs to be serialized and deserialized during read and write operation

When serializing and de-serializing, Flink HBase connector uses utility class org.apache.hadoop.hbase.util.Bytes provided by HBase (Hadoop) to convert Flink Data Types to and from byte arrays.

Flink HBase connector encodes null values to empty bytes, and decode empty bytes to null values for all data types except string type. For string type, the null literal is determined by null-string-literal option.

The data type mappings are as follows:

Flink SQL typeHBase conversion
CHAR / VARCHAR / STRING
  1. byte[] toBytes(String s)
  2. String toString(byte[] b)
BOOLEAN
  1. byte[] toBytes(boolean b)
  2. boolean toBoolean(byte[] b)
BINARY / VARBINARYReturns byte[] as is.
DECIMAL
  1. byte[] toBytes(BigDecimal v)
  2. BigDecimal toBigDecimal(byte[] b)
TINYINT
  1. new byte[] { val }
  2. bytes[0] // returns first and only byte from bytes
SMALLINT
  1. byte[] toBytes(short val)
  2. short toShort(byte[] bytes)
INT
  1. byte[] toBytes(int val)
  2. int toInt(byte[] bytes)
BIGINT
  1. byte[] toBytes(long val)
  2. long toLong(byte[] bytes)
FLOAT
  1. byte[] toBytes(float val)
  2. float toFloat(byte[] bytes)
DOUBLE
  1. byte[] toBytes(double val)
  2. double toDouble(byte[] bytes)
DATEStores the number of days since epoch as int value.
TIMEStores the number of milliseconds of the day as int value.
TIMESTAMPStores the milliseconds since epoch as long value.
ARRAYNot supported
MAP / MULTISETNot supported
ROWNot supported