Complete a NYC Test with MatrixOne

New York City (NYC) Taxi data set captures detailed information on billions of individual taxi trips in New York City, including pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts(Most of the raw data comes from the NYC Taxi & Limousine Commission).

By going through this tutorial, you’ll learn how to complete some queries on NYC Taxi data with MatrixOne.

For detail description and instructions for downloading about NYC Taxi Data, you can see:

Before you begin

Make sure you have already installed MatrixOne and connected to MatrixOne Server.

1. Download and Import data

This section of the tutorial references here, and you can get original information about how to download and import raw data there.

The data set has 1.7 billion rows data and takes up 450 GB of space, so make sure there are enough space to hold the data.

  • Install PostgreSQL and PostGIS

Both are available via Homebrew on Mac.

  • Download raw data
  1. ./download_raw_data.sh && ./remove_bad_rows.sh

The remove_bad_rows.sh script fixes two particular files that have a few rows with too many columns.
For more detailed information about this, you can see the original references.

  • Initialize database and set up schema
  1. ./initialize_database.sh
  • Import taxi and FHV data
  1. ./import_trip_data.sh
  2. ./import_fhv_trip_data.sh
  • Optional: download and import 2014 Uber data

The FiveThirtyEight Uber dataset contains Uber trip records from Apr–Sep 2014. Uber and other FHV (Lyft, Juno, Via, etc.) data is available since Jan 2015 in the TLC’s data.

  1. ./download_raw_2014_uber_data.sh
  2. ./import_2014_uber_trip_data.sh

2. Exporting the data from PostgreSQL

  1. COPY
  2. (
  3. SELECT trips.id,
  4. trips.vendor_id,
  5. trips.pickup_datetime,
  6. trips.dropoff_datetime,
  7. trips.store_and_fwd_flag,
  8. trips.rate_code_id,
  9. trips.pickup_longitude,
  10. trips.pickup_latitude,
  11. trips.dropoff_longitude,
  12. trips.dropoff_latitude,
  13. trips.passenger_count,
  14. trips.trip_distance,
  15. trips.fare_amount,
  16. trips.extra,
  17. trips.mta_tax,
  18. trips.tip_amount,
  19. trips.tolls_amount,
  20. trips.ehail_fee,
  21. trips.improvement_surcharge,
  22. trips.total_amount,
  23. trips.payment_type,
  24. trips.trip_type,
  25. trips.pickup_location_id,
  26. trips.dropoff_location_id,
  27. cab_types.type cab_type,
  28. weather.precipitation rain,
  29. weather.snow_depth,
  30. weather.snowfall,
  31. weather.max_temperature max_temp,
  32. weather.min_temperature min_temp,
  33. weather.average_wind_speed wind,
  34. pick_up.gid pickup_nyct2010_gid,
  35. pick_up.ctlabel pickup_ctlabel,
  36. pick_up.borocode pickup_borocode,
  37. pick_up.boroname pickup_boroname,
  38. pick_up.ct2010 pickup_ct2010,
  39. pick_up.boroct2010 pickup_boroct2010,
  40. pick_up.cdeligibil pickup_cdeligibil,
  41. pick_up.ntacode pickup_ntacode,
  42. pick_up.ntaname pickup_ntaname,
  43. pick_up.puma pickup_puma,
  44. drop_off.gid dropoff_nyct2010_gid,
  45. drop_off.ctlabel dropoff_ctlabel,
  46. drop_off.borocode dropoff_borocode,
  47. drop_off.boroname dropoff_boroname,
  48. drop_off.ct2010 dropoff_ct2010,
  49. drop_off.boroct2010 dropoff_boroct2010,
  50. drop_off.cdeligibil dropoff_cdeligibil,
  51. drop_off.ntacode dropoff_ntacode,
  52. drop_off.ntaname dropoff_ntaname,
  53. drop_off.puma dropoff_puma
  54. FROM trips
  55. LEFT JOIN cab_types
  56. ON trips.cab_type_id = cab_types.id
  57. LEFT JOIN central_park_weather_observations weather
  58. ON weather.date = trips.pickup_datetime::date
  59. LEFT JOIN nyct2010 pick_up
  60. ON pick_up.gid = trips.pickup_nyct2010_gid
  61. LEFT JOIN nyct2010 drop_off
  62. ON drop_off.gid = trips.dropoff_nyct2010_gid
  63. ) TO '/matrixone/export_data/trips.tsv';

3. Create tables in MatrixOne

  1. CREATE TABLE trips
  2. (
  3. trip_id int unsigned,
  4. vendor_id varchar(64),
  5. pickup_datetime bigint unsigned,
  6. dropoff_datetime bigint unsigned,
  7. store_and_fwd_flag char(1),
  8. rate_code_id smallint unsigned,
  9. pickup_longitude double,
  10. pickup_latitude double,
  11. dropoff_longitude double,
  12. dropoff_latitude double,
  13. passenger_count smallint unsigned,
  14. trip_distance double,
  15. distance bigint,
  16. fare_amount float,
  17. extra float,
  18. mta_tax float,
  19. tip_amount float,
  20. tolls_amount float,
  21. ehail_fee float,
  22. improvement_surcharge float,
  23. total_amount float,
  24. payment_type varchar(64),
  25. trip_type smallint unsigned,
  26. pickup varchar(64),
  27. dropoff varchar(64),
  28. cab_type varchar(64),
  29. precipitation float,
  30. snow_depth float,
  31. snowfall float,
  32. max_temperature smallint,
  33. min_temperature smallint,
  34. average_wind_speed float,
  35. pickup_nyct2010_gid smallint unsigned,
  36. pickup_ctlabel varchar(64),
  37. pickup_borocode smallint unsigned,
  38. pickup_boroname varchar(64),
  39. pickup_ct2010 varchar(64),
  40. pickup_boroct2010 varchar(64),
  41. pickup_cdeligibil char(1),
  42. pickup_ntacode varchar(64),
  43. pickup_ntaname varchar(64),
  44. pickup_puma varchar(64),
  45. dropoff_nyct2010_gid smallint unsigned,
  46. dropoff_ctlabel varchar(64),
  47. dropoff_borocode smallint unsigned,
  48. dropoff_boroname varchar(64),
  49. dropoff_ct2010 varchar(64),
  50. dropoff_boroct2010 varchar(64),
  51. dropoff_cdeligibil varchar(64),
  52. dropoff_ntacode varchar(64),
  53. dropoff_ntaname varchar(64),
  54. dropoff_puma varchar(64)
  55. ) ;

4. Insert data into the created tables

  1. load data infile '/matrixone/export_data/trips.tsv ' into table trips FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

Then you can query data in MatrixOne with the created table.

5. Run Queries

  1. # Q1
  2. SELECT cab_type, count(*) FROM trips GROUP BY cab_type;
  3. # Q2
  4. SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count;
  5. # Q3
  6. SELECT passenger_count, year(pickup_datetime) as year, count(*) FROM trips GROUP BY passenger_count, year;
  7. # Q4
  8. SELECT passenger_count, year(pickup_datetime) as year, round(trip_distance) AS distance, count(*) as count
  9. FROM trips
  10. GROUP BY passenger_count, year, distance
  11. ORDER BY year,count DESC;