Parquet Format
Format: Serialization Schema Format: Deserialization Schema
The Apache Parquet format allows to read and write Parquet data.
Dependencies
In order to setup the Parquet format, the following table provides dependency information for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
Maven dependency | SQL Client JAR |
---|---|
flink-parquet_2.11 | Download |
How to create a table with Parquet format
Here is an example to create a table using Filesystem connector and Parquet format.
CREATE TABLE user_behavior (
user_id BIGINT,
item_id BIGINT,
category_id BIGINT,
behavior STRING,
ts TIMESTAMP(3),
dt STRING
) PARTITIONED BY (dt) WITH (
'connector' = 'filesystem',
'path' = '/tmp/user_behavior',
'format' = 'parquet'
)
Format Options
Option | Required | Default | Type | Description |
---|---|---|---|---|
format | required | (none) | String | Specify what format to use, here should be ‘parquet’. |
parquet.utc-timezone | optional | false | Boolean | Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone. |
Parquet format also supports configuration from ParquetOutputFormat. For example, you can configure parquet.compression=GZIP
to enable gzip compression.
Data Type Mapping
Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:
- Timestamp: mapping timestamp type to int96 whatever the precision is.
- Decimal: mapping decimal type to fixed length byte array according to the precision.
The following table lists the type mapping from Flink type to Parquet type.
Flink Data Type | Parquet type | Parquet logical type |
---|---|---|
CHAR / VARCHAR / STRING | BINARY | UTF8 |
BOOLEAN | BOOLEAN | |
BINARY / VARBINARY | BINARY | |
DECIMAL | FIXED_LEN_BYTE_ARRAY | DECIMAL |
TINYINT | INT32 | INT_8 |
SMALLINT | INT32 | INT_16 |
INT | INT32 | |
BIGINT | INT64 | |
FLOAT | FLOAT | |
DOUBLE | DOUBLE | |
DATE | INT32 | DATE |
TIME | INT32 | TIME_MILLIS |
TIMESTAMP | INT96 |
Attention Composite data type: Array, Map and Row are not supported.