Hive-TsFile

About Hive-TsFile-Connector

Hive-TsFile-Connector implements the support of Hive for external data sources of Tsfile type. This enables users to operate TsFile by Hive.

With this connector, you can

  • Load a single TsFile, from either the local file system or hdfs, into hive
  • Load all files in a specific directory, from either the local file system or hdfs, into hive
  • Query the tsfile through HQL.
  • As of now, the write operation is not supported in hive-connector. So, insert operation in HQL is not allowed while operating tsfile through hive.

System Requirements

Hadoop VersionHive VersionJava VersionTsFile
2.7.3 or 3.2.12.3.6 or 3.1.21.80.10.0+

Note: For more information about how to download and use TsFile, please see the following link: https://github.com/apache/iotdb/tree/master/tsfile.

Data Type Correspondence

TsFile data typeHive field type
BOOLEANBoolean
INT32INT
INT64BIGINT
FLOATFloat
DOUBLEDouble
TEXTSTRING

Add Dependency For Hive

To use hive-connector in hive, we should add the hive-connector jar into hive.

After downloading the code of iotdb from https://github.com/apache/iotdbHive TsFile - 图1 (opens new window), you can use the command of mvn clean package -pl hive-connector -am -Dmaven.test.skip=true -P get-jar-with-dependencies to get a hive-connector-X.X.X-jar-with-dependencies.jar.

Then in hive, use the command of add jar XXX to add the dependency. For example:

  1. hive> add jar /Users/hive/iotdb/hive-connector/target/hive-connector-0.10.0-jar-with-dependencies.jar;
  2. Added [/Users/hive/iotdb/hive-connector/target/hive-connector-0.10.0-jar-with-dependencies.jar] to class path
  3. Added resources: [/Users/hive/iotdb/hive-connector/target/hive-connector-0.10.0-jar-with-dependencies.jar]

Create Tsfile-backed Hive tables

To create a Tsfile-backed table, specify the serde as org.apache.iotdb.hive.TsFileSerDe, specify the inputformat as org.apache.iotdb.hive.TSFHiveInputFormat, and the outputformat as org.apache.iotdb.hive.TSFHiveOutputFormat.

Also provide a schema which only contains two fields: time_stamp and sensor_id for the table. time_stamp is the time value of the time series and sensor_id is the sensor name to extract from the tsfile to hive such as sensor_1. The name of the table can be any valid table names in hive.

Also a location provided for hive-connector to pull the most current data for the table.

The location should be a specific directory on your local file system or HDFS to set up Hadoop. If it is in your local file system, the location should look like file:///data/data/sequence/root.baic2.WWS.leftfrontdoor/

Last, set the device_id in TBLPROPERTIES to the device name you want to analyze.

For example:

  1. CREATE EXTERNAL TABLE IF NOT EXISTS only_sensor_1(
  2. time_stamp TIMESTAMP,
  3. sensor_1 BIGINT)
  4. ROW FORMAT SERDE 'org.apache.iotdb.hive.TsFileSerDe'
  5. STORED AS
  6. INPUTFORMAT 'org.apache.iotdb.hive.TSFHiveInputFormat'
  7. OUTPUTFORMAT 'org.apache.iotdb.hive.TSFHiveOutputFormat'
  8. LOCATION '/data/data/sequence/root.baic2.WWS.leftfrontdoor/'
  9. TBLPROPERTIES ('device_id'='root.baic2.WWS.leftfrontdoor.plc1');

In this example, the data of root.baic2.WWS.leftfrontdoor.plc1.sensor_1 is pulled from the directory of /data/data/sequence/root.baic2.WWS.leftfrontdoor/. This table results in a description as below:

  1. hive> describe only_sensor_1;
  2. OK
  3. time_stamp timestamp from deserializer
  4. sensor_1 bigint from deserializer
  5. Time taken: 0.053 seconds, Fetched: 2 row(s)

At this point, the Tsfile-backed table can be worked with in Hive like any other table.

Query from TsFile-backed Hive tables

Before we do any queries, we should set the hive.input.format in hive by executing the following command.

  1. hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

Now, we already have an external table named only_sensor_1 in hive. We can use any query operations through HQL to analyse it.

For example:

Select Clause Example

  1. hive> select * from only_sensor_1 limit 10;
  2. OK
  3. 1 1000000
  4. 2 1000001
  5. 3 1000002
  6. 4 1000003
  7. 5 1000004
  8. 6 1000005
  9. 7 1000006
  10. 8 1000007
  11. 9 1000008
  12. 10 1000009
  13. Time taken: 1.464 seconds, Fetched: 10 row(s)

Aggregate Clause Example

  1. hive> select count(*) from only_sensor_1;
  2. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  3. Query ID = jackietien_20191016202416_d1e3e233-d367-4453-b39a-2aac9327a3b6
  4. Total jobs = 1
  5. Launching Job 1 out of 1
  6. Number of reduce tasks determined at compile time: 1
  7. In order to change the average load for a reducer (in bytes):
  8. set hive.exec.reducers.bytes.per.reducer=<number>
  9. In order to limit the maximum number of reducers:
  10. set hive.exec.reducers.max=<number>
  11. In order to set a constant number of reducers:
  12. set mapreduce.job.reduces=<number>
  13. Job running in-process (local Hadoop)
  14. 2019-10-16 20:24:18,305 Stage-1 map = 0%, reduce = 0%
  15. 2019-10-16 20:24:27,443 Stage-1 map = 100%, reduce = 100%
  16. Ended Job = job_local867757288_0002
  17. MapReduce Jobs Launched:
  18. Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
  19. Total MapReduce CPU Time Spent: 0 msec
  20. OK
  21. 1000000
  22. Time taken: 11.334 seconds, Fetched: 1 row(s)