Query the data

When you have your dataset loaded, you can start constructing some queries to discover what your data tells you. In this section, you learn how to write queries that answer these questions:

How many rides take place every day?

This dataset contains ride data for January 2016. To find out how many rides took place each day, you can use a SELECT statement. In this case, you want to count the total number of rides each day, and show them in a list by date.

Finding how many rides take place every day

  1. Connect to the Timescale Cloud database that contains the NYC taxi dataset.

  2. At the psql prompt, use this query to select all rides taken in the first week of January 2016, and return a count of rides for each day:

    1. SELECT date_trunc('day', pickup_datetime) as day,
    2. COUNT(*) FROM rides
    3. WHERE pickup_datetime < '2016-01-08'
    4. GROUP BY day
    5. ORDER BY day;

    The result of the query looks like this:

    1. day | count
    2. ---------------------+--------
    3. 2016-01-01 00:00:00 | 345037
    4. 2016-01-02 00:00:00 | 312831
    5. 2016-01-03 00:00:00 | 302878
    6. 2016-01-04 00:00:00 | 316171
    7. 2016-01-05 00:00:00 | 343251
    8. 2016-01-06 00:00:00 | 348516
    9. 2016-01-07 00:00:00 | 364894

What is the average fare amount?

You can include a function in your SELECT query to determine the average fare paid by each passenger.

Finding the average fare amount

  1. Connect to the Timescale Cloud database that contains the NYC taxi dataset.

  2. At the psql prompt, use this query to select all rides taken in the first week of January 2016, and return the average fare paid on each day:

    1. SELECT date_trunc('day', pickup_datetime)
    2. AS day, avg(fare_amount)
    3. FROM rides
    4. WHERE pickup_datetime < '2016-01-08'
    5. GROUP BY day
    6. ORDER BY day;

    The result of the query looks like this:

    1. day | avg
    2. ---------------------+---------------------
    3. 2016-01-01 00:00:00 | 12.8569325028909943
    4. 2016-01-02 00:00:00 | 12.4344713599355563
    5. 2016-01-03 00:00:00 | 13.0615900461571986
    6. 2016-01-04 00:00:00 | 12.2072927308323660
    7. 2016-01-05 00:00:00 | 12.0018670885154013
    8. 2016-01-06 00:00:00 | 12.0002329017893009
    9. 2016-01-07 00:00:00 | 12.1234180337303436

How many rides of each rate type were taken?

Taxis in New York City use a range of different rate types for different kinds of trips. For example, trips to the airport are charged at a flat rate from any location within the city. This section shows you how to construct a query that shows you the nuber of trips taken for each different fare type. It also uses a JOIN statement to present the data in a more informative way.

Finding the number of rides for each fare type

  1. Connect to the Timescale Cloud database that contains the NYC taxi dataset.

  2. At the psql prompt, use this query to select all rides taken in the first week of January 2016, and return the total number of trips taken for each rate code:

    1. SELECT rate_code, COUNT(vendor_id) AS num_trips
    2. FROM rides
    3. WHERE pickup_datetime < '2016-01-08'
    4. GROUP BY rate_code
    5. ORDER BY rate_code;

    The result of the query looks like this:

    1. rate_code | num_trips
    2. -----------+-----------
    3. 1 | 10626315
    4. 2 | 225019
    5. 3 | 16822
    6. 4 | 4696
    7. 5 | 33688
    8. 6 | 102
    9. 99 | 216

This output is correct, but it’s not very easy to read, because you probably don’t know what the different rate codes mean. However, the rates table in the dataset contains a human-readable description of each code. You can use a JOIN statement in your query to connect the rides and rates tables, and present information from both in your results.

Displaying the number of rides for each fare type

  1. Connect to the Timescale Cloud database that contains the NYC taxi dataset.

  2. At the psql prompt, copy this query to select all rides taken in the first week of January 2016, join the rides and rates tables, and return the total number of trips taken for each rate code, with a description of the rate code:

    1. SELECT rates.description, COUNT(vendor_id) AS num_trips
    2. FROM rides
    3. JOIN rates ON rides.rate_code = rates.rate_code
    4. WHERE pickup_datetime < '2016-01-08'
    5. GROUP BY rates.description
    6. ORDER BY LOWER(rates.description);

    The result of the query looks like this:

    1. description | num_trips
    2. -----------------------+-----------
    3. group ride | 17
    4. JFK | 54832
    5. Nassau or Westchester | 967
    6. negotiated fare | 7193
    7. Newark | 4126
    8. standard rate | 2266401

What kind of trips are going to and from airports

There are two primary airports in the dataset: John F. Kennedy airport, or JFK, is represented by rate code 2; Newark airport, or EWR, is represented by rate code 3.

Information about the trips that are going to and from the two airports is useful for city planning, as well as for organizations like the NYC Tourism Bureau.

This section shows you how to construct a query that returns trip information for trips going only to the new main airports.

Finding what kind of trips are going to and from airports

  1. Connect to the Timescale Cloud database that contains the NYC taxi dataset.

  2. At the psql prompt, use this query to select all rides taken to and from JFK and Newark airports, in the first week of January 2016, and return the number of trips to that airport, the average trip duration, average trip cost, and average number of passengers:

    1. SELECT rates.description,
    2. COUNT(vendor_id) AS num_trips,
    3. AVG(dropoff_datetime - pickup_datetime) AS avg_trip_duration,
    4. AVG(total_amount) AS avg_total,
    5. AVG(passenger_count) AS avg_passengers
    6. FROM rides
    7. JOIN rates ON rides.rate_code = rates.rate_code
    8. WHERE rides.rate_code IN (2,3) AND pickup_datetime < '2016-01-08'
    9. GROUP BY rates.description
    10. ORDER BY rates.description;

    The result of the query looks like this:

    1. description | num_trips | avg_trip_duration | avg_total | avg_passengers
    2. -------------+-----------+-------------------+---------------------+--------------------
    3. JFK | 54832 | 00:46:44.614222 | 63.7791311642836300 | 1.8062080536912752
    4. Newark | 4126 | 00:34:45.575618 | 84.3841783809985458 | 1.8979641299079011

How many rides took place on New Year’s Day 2016?

New York City is famous for the Ball Drop New Year’s Eve celebration in Times Square. Thousands of people gather to bring in the New Year and then head out into the city: to their favorite bar, to gather with friends for a meal, or back home. This section shows you how to construct a query that returns the number of taxi trips taken on 1 January, 2016, in 30 minute intervals.

In PotsgreSQL, it’s not particularly easy to segment the data by 30 minute time intervals. To do this, you would need to use a TRUNC function to calculate the quotient of the minute that a ride began in divided by 30, then truncate the result to take the floor of that quotient. When you had that result, you could multiply the truncated quotient by 30.

In your Timescale database, you can use the time_bucket function to segment the data into time intervals instead.

Finding how many rides took place on New Year’s Day 2016

  1. Connect to the Timescale Cloud database that contains the NYC taxi dataset.

  2. At the psql prompt, use this query to select all rides taken on the first day of January 2016, and return a count of rides for each 30 minute interval:

    1. SELECT time_bucket('30 minute', pickup_datetime) AS thirty_min, count(*)
    2. FROM rides
    3. WHERE pickup_datetime < '2016-01-02 00:00'
    4. GROUP BY thirty_min
    5. ORDER BY thirty_min;

    The result of the query starts like this:

    1. thirty_min | count
    2. ---------------------+-------
    3. 2016-01-01 00:00:00 | 10920
    4. 2016-01-01 00:30:00 | 14350
    5. 2016-01-01 01:00:00 | 14660
    6. 2016-01-01 01:30:00 | 13851
    7. 2016-01-01 02:00:00 | 13260
    8. 2016-01-01 02:30:00 | 12230
    9. 2016-01-01 03:00:00 | 11362