Parquet Format
Format: Serialization Schema Format: Deserialization Schema
The Apache Parquet format allows to read and write Parquet data.
Dependencies
In order to use the Parquet format the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
| Maven dependency | SQL Client |
|---|---|
Copied to clipboard! | Download |
How to create a table with Parquet format
Here is an example to create a table using Filesystem connector and Parquet format.
CREATE TABLE user_behavior (user_id BIGINT,item_id BIGINT,category_id BIGINT,behavior STRING,ts TIMESTAMP(3),dt STRING) PARTITIONED BY (dt) WITH ('connector' = 'filesystem','path' = '/tmp/user_behavior','format' = 'parquet')
Format Options
| Option | Required | Default | Type | Description |
|---|---|---|---|---|
format | required | (none) | String | Specify what format to use, here should be ‘parquet’. |
parquet.utc-timezone | optional | false | Boolean | Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone. |
Parquet format also supports configuration from ParquetOutputFormat. For example, you can configure parquet.compression=GZIP to enable gzip compression.
Data Type Mapping
Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:
- Timestamp: mapping timestamp type to int96 whatever the precision is.
- Decimal: mapping decimal type to fixed length byte array according to the precision.
The following table lists the type mapping from Flink type to Parquet type.
| Flink Data Type | Parquet type | Parquet logical type |
|---|---|---|
| CHAR / VARCHAR / STRING | BINARY | UTF8 |
| BOOLEAN | BOOLEAN | |
| BINARY / VARBINARY | BINARY | |
| DECIMAL | FIXED_LEN_BYTE_ARRAY | DECIMAL |
| TINYINT | INT32 | INT_8 |
| SMALLINT | INT32 | INT_16 |
| INT | INT32 | |
| BIGINT | INT64 | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DATE | INT32 | DATE |
| TIME | INT32 | TIME_MILLIS |
| TIMESTAMP | INT96 | |
| ARRAY | - | LIST |
| MAP | - | MAP |
| ROW | - | STRUCT |