查询外部数据

对文件进行查询

目前,我们支持 ParquetCSVNDJson 格式文件的查询。

Taxi Zone Lookup Table 数据为例。

bash

  1. curl "https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv" -o /tmp/taxi+_zone_lookup.csv

创建一个外部表:

sql

  1. CREATE EXTERNAL TABLE taxi_zone_lookup with (location='/tmp/taxi+_zone_lookup.csv',format='csv');

检查外部表的组织和结构:

sql

  1. DESC TABLE taxi_zone_lookup;

sql

  1. +--------------+--------+------+------+---------+---------------+
  2. | Column | Type | Key | Null | Default | Semantic Type |
  3. +--------------+--------+------+------+---------+---------------+
  4. | LocationID | Int64 | | YES | | FIELD |
  5. | Borough | String | | YES | | FIELD |
  6. | Zone | String | | YES | | FIELD |
  7. | service_zone | String | | YES | | FIELD |
  8. +--------------+--------+------+------+---------+---------------+
  9. 4 rows in set (0.00 sec)

现在就可以查询外部表了:

sql

  1. SELECT "Zone","Borough" FROM taxi_zone_lookup LIMIT 5;

sql

  1. +-------------------------+---------------+
  2. | Zone | Borough |
  3. +-------------------------+---------------+
  4. | Newark Airport | EWR |
  5. | Jamaica Bay | Queens |
  6. | Allerton/Pelham Gardens | Bronx |
  7. | Alphabet City | Manhattan |
  8. | Arden Heights | Staten Island |
  9. +-------------------------+---------------+

对目录进行查询

首先下载一些数据:

bash

  1. mkdir /tmp/external
  2. curl "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet" -o /tmp/external/yellow_tripdata_2022-01.parquet
  3. curl "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet" -o /tmp/external/yellow_tripdata_2022-02.parquet

验证下载情况:

bash

  1. ls -l /tmp/external
  2. total 165368
  3. -rw-r--r-- 1 wenyxu wheel 38139949 Apr 28 14:35 yellow_tripdata_2022-01.parquet
  4. -rw-r--r-- 1 wenyxu wheel 45616512 Apr 28 14:36 yellow_tripdata_2022-02.parquet

创建外部表

sql

  1. CREATE EXTERNAL TABLE yellow_tripdata with(location='/tmp/external/',format='parquet');

执行查询:

sql

  1. SELECT count(*) FROM yellow_tripdata;

sql

  1. +-----------------+
  2. | COUNT(UInt8(1)) |
  3. +-----------------+
  4. | 5443362 |
  5. +-----------------+
  6. 1 row in set (0.48 sec)

sql

  1. SELECT * FROM yellow_tripdata LIMIT 5;

sql

  1. +----------+--------------------------+--------------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+---------+------------+--------------+-----------------------+--------------+----------------------+-------------+
  2. | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |
  3. +----------+--------------------------+--------------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+---------+------------+--------------+-----------------------+--------------+----------------------+-------------+
  4. | 1 | 2022-01-01 09:35:40+0900 | 2022-01-01 09:53:29+0900 | 2 | 3.8 | 1 | N | 142 | 236 | 1 | 14.5 | 3 | 0.5 | 3.65 | 0 | 0.3 | 21.95 | 2.5 | 0 |
  5. | 1 | 2022-01-01 09:33:43+0900 | 2022-01-01 09:42:07+0900 | 1 | 2.1 | 1 | N | 236 | 42 | 1 | 8 | 0.5 | 0.5 | 4 | 0 | 0.3 | 13.3 | 0 | 0 |
  6. | 2 | 2022-01-01 09:53:21+0900 | 2022-01-01 10:02:19+0900 | 1 | 0.97 | 1 | N | 166 | 166 | 1 | 7.5 | 0.5 | 0.5 | 1.76 | 0 | 0.3 | 10.56 | 0 | 0 |
  7. | 2 | 2022-01-01 09:25:21+0900 | 2022-01-01 09:35:23+0900 | 1 | 1.09 | 1 | N | 114 | 68 | 2 | 8 | 0.5 | 0.5 | 0 | 0 | 0.3 | 11.8 | 2.5 | 0 |
  8. | 2 | 2022-01-01 09:36:48+0900 | 2022-01-01 10:14:20+0900 | 1 | 4.3 | 1 | N | 68 | 163 | 1 | 23.5 | 0.5 | 0.5 | 3 | 0 | 0.3 | 30.3 | 2.5 | 0 |
  9. +----------+--------------------------+--------------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+---------+------------+--------------+-----------------------+--------------+----------------------+-------------+
  10. 5 rows in set (0.11 sec)