Aggregation

Aggregation is one of those capabilities that really make you appreciate the power of relational database systems. It allows you to move beyond merely persisting your data, into the realm of asking truly interesting questions that can be used to inform decision making. This category covers aggregation at length, making use of standard grouping as well as more recent window functions.

Count the number of facilities

For our first foray into aggregates, we’re going to stick to something simple. We want to know how many facilities exist - simply produce a total count.

  1. SELECT COUNT(facid) FROM facilities;
  1. query = Facility.select(fn.COUNT(Facility.facid))
  2. count = query.scalar()
  3. # OR:
  4. count = Facility.select().count()

Count the number of expensive facilities

Produce a count of the number of facilities that have a cost to guests of 10 or more.

  1. SELECT COUNT(facid) FROM facilities WHERE guestcost >= 10
  1. query = Facility.select(fn.COUNT(Facility.facid)).where(Facility.guestcost >= 10)
  2. count = query.scalar()
  3. # OR:
  4. # count = Facility.select().where(Facility.guestcost >= 10).count()

Count the number of recommendations each member makes.

Produce a count of the number of recommendations each member has made. Order by member ID.

  1. SELECT recommendedby, COUNT(memid) FROM members
  2. WHERE recommendedby IS NOT NULL
  3. GROUP BY recommendedby
  4. ORDER BY recommendedby
  1. query = (Member
  2. .select(Member.recommendedby, fn.COUNT(Member.memid))
  3. .where(Member.recommendedby.is_null(False))
  4. .group_by(Member.recommendedby)
  5. .order_by(Member.recommendedby))

List the total slots booked per facility

Produce a list of the total number of slots booked per facility. For now, just produce an output table consisting of facility id and slots, sorted by facility id.

  1. SELECT facid, SUM(slots) FROM bookings GROUP BY facid ORDER BY facid;
  1. query = (Booking
  2. .select(Booking.facid, fn.SUM(Booking.slots))
  3. .group_by(Booking.facid)
  4. .order_by(Booking.facid))

List the total slots booked per facility in a given month

Produce a list of the total number of slots booked per facility in the month of September 2012. Produce an output table consisting of facility id and slots, sorted by the number of slots.

  1. SELECT facid, SUM(slots)
  2. FROM bookings
  3. WHERE (date_trunc('month', starttime) = '2012-09-01'::dates)
  4. GROUP BY facid
  5. ORDER BY SUM(slots)
  1. query = (Booking
  2. .select(Booking.facility, fn.SUM(Booking.slots))
  3. .where(fn.date_trunc('month', Booking.starttime) == datetime.date(2012, 9, 1))
  4. .group_by(Booking.facility)
  5. .order_by(fn.SUM(Booking.slots)))

List the total slots booked per facility per month

Produce a list of the total number of slots booked per facility per month in the year of 2012. Produce an output table consisting of facility id and slots, sorted by the id and month.

  1. SELECT facid, date_part('month', starttime), SUM(slots)
  2. FROM bookings
  3. WHERE date_part('year', starttime) = 2012
  4. GROUP BY facid, date_part('month', starttime)
  5. ORDER BY facid, date_part('month', starttime)
  1. month = fn.date_part('month', Booking.starttime)
  2. query = (Booking
  3. .select(Booking.facility, month, fn.SUM(Booking.slots))
  4. .where(fn.date_part('year', Booking.starttime) == 2012)
  5. .group_by(Booking.facility, month)
  6. .order_by(Booking.facility, month))

Find the count of members who have made at least one booking

Find the total number of members who have made at least one booking.

  1. SELECT COUNT(DISTINCT memid) FROM bookings
  2. -- OR --
  3. SELECT COUNT(1) FROM (SELECT DISTINCT memid FROM bookings) AS _
  1. query = Booking.select(fn.COUNT(Booking.member.distinct()))
  2. # OR:
  3. query = Booking.select(Booking.member).distinct()
  4. count = query.count() # count() wraps in SELECT COUNT(1) FROM (...)

List facilities with more than 1000 slots booked

Produce a list of facilities with more than 1000 slots booked. Produce an output table consisting of facility id and hours, sorted by facility id.

  1. SELECT facid, SUM(slots) FROM bookings
  2. GROUP BY facid
  3. HAVING SUM(slots) > 1000
  4. ORDER BY facid;
  1. query = (Booking
  2. .select(Booking.facility, fn.SUM(Booking.slots))
  3. .group_by(Booking.facility)
  4. .having(fn.SUM(Booking.slots) > 1000)
  5. .order_by(Booking.facility))

Find the total revenue of each facility

Produce a list of facilities along with their total revenue. The output table should consist of facility name and revenue, sorted by revenue. Remember that there’s a different cost for guests and members!

  1. SELECT f.name, SUM(b.slots * (
  2. CASE WHEN b.memid = 0 THEN f.guestcost ELSE f.membercost END)) AS revenue
  3. FROM bookings AS b
  4. INNER JOIN facilities AS f ON b.facid = f.facid
  5. GROUP BY f.name
  6. ORDER BY revenue;
  1. revenue = fn.SUM(Booking.slots * Case(None, (
  2. (Booking.member == 0, Facility.guestcost),
  3. ), Facility.membercost))
  4. query = (Facility
  5. .select(Facility.name, revenue.alias('revenue'))
  6. .join(Booking)
  7. .group_by(Facility.name)
  8. .order_by(SQL('revenue')))

Find facilities with a total revenue less than 1000

Produce a list of facilities with a total revenue less than 1000. Produce an output table consisting of facility name and revenue, sorted by revenue. Remember that there’s a different cost for guests and members!

  1. SELECT f.name, SUM(b.slots * (
  2. CASE WHEN b.memid = 0 THEN f.guestcost ELSE f.membercost END)) AS revenue
  3. FROM bookings AS b
  4. INNER JOIN facilities AS f ON b.facid = f.facid
  5. GROUP BY f.name
  6. HAVING SUM(b.slots * ...) < 1000
  7. ORDER BY revenue;
  1. # Same definition as previous example.
  2. revenue = fn.SUM(Booking.slots * Case(None, (
  3. (Booking.member == 0, Facility.guestcost),
  4. ), Facility.membercost))
  5. query = (Facility
  6. .select(Facility.name, revenue.alias('revenue'))
  7. .join(Booking)
  8. .group_by(Facility.name)
  9. .having(revenue < 1000)
  10. .order_by(SQL('revenue')))

Output the facility id that has the highest number of slots booked

Output the facility id that has the highest number of slots booked.

  1. SELECT facid, SUM(slots) FROM bookings
  2. GROUP BY facid
  3. ORDER BY SUM(slots) DESC
  4. LIMIT 1
  1. query = (Booking
  2. .select(Booking.facility, fn.SUM(Booking.slots))
  3. .group_by(Booking.facility)
  4. .order_by(fn.SUM(Booking.slots).desc())
  5. .limit(1))
  6. # Retrieve multiple scalar values by calling scalar() with as_tuple=True.
  7. facid, nslots = query.scalar(as_tuple=True)

List the total slots booked per facility per month, part 2

Produce a list of the total number of slots booked per facility per month in the year of 2012. In this version, include output rows containing totals for all months per facility, and a total for all months for all facilities. The output table should consist of facility id, month and slots, sorted by the id and month. When calculating the aggregated values for all months and all facids, return null values in the month and facid columns.

Postgres ONLY.

  1. SELECT facid, date_part('month', starttime), SUM(slots)
  2. FROM booking
  3. WHERE date_part('year', starttime) = 2012
  4. GROUP BY ROLLUP(facid, date_part('month', starttime))
  5. ORDER BY facid, date_part('month', starttime)
  1. month = fn.date_part('month', Booking.starttime)
  2. query = (Booking
  3. .select(Booking.facility,
  4. month.alias('month'),
  5. fn.SUM(Booking.slots))
  6. .where(fn.date_part('year', Booking.starttime) == 2012)
  7. .group_by(fn.ROLLUP(Booking.facility, month))
  8. .order_by(Booking.facility, month))

List the total hours booked per named facility

Produce a list of the total number of hours booked per facility, remembering that a slot lasts half an hour. The output table should consist of the facility id, name, and hours booked, sorted by facility id.

  1. SELECT f.facid, f.name, SUM(b.slots) * .5
  2. FROM facilities AS f
  3. INNER JOIN bookings AS b ON (f.facid = b.facid)
  4. GROUP BY f.facid, f.name
  5. ORDER BY f.facid
  1. query = (Facility
  2. .select(Facility.facid, Facility.name, fn.SUM(Booking.slots) * .5)
  3. .join(Booking)
  4. .group_by(Facility.facid, Facility.name)
  5. .order_by(Facility.facid))

List each member’s first booking after September 1st 2012

Produce a list of each member name, id, and their first booking after September 1st 2012. Order by member ID.

  1. SELECT m.surname, m.firstname, m.memid, min(b.starttime) as starttime
  2. FROM members AS m
  3. INNER JOIN bookings AS b ON b.memid = m.memid
  4. WHERE starttime >= '2012-09-01'
  5. GROUP BY m.surname, m.firstname, m.memid
  6. ORDER BY m.memid;
  1. query = (Member
  2. .select(Member.surname, Member.firstname, Member.memid,
  3. fn.MIN(Booking.starttime).alias('starttime'))
  4. .join(Booking)
  5. .where(Booking.starttime >= datetime.date(2012, 9, 1))
  6. .group_by(Member.surname, Member.firstname, Member.memid)
  7. .order_by(Member.memid))

Produce a list of member names, with each row containing the total member count

Produce a list of member names, with each row containing the total member count. Order by join date.

Postgres ONLY (as written).

  1. SELECT COUNT(*) OVER(), firstname, surname
  2. FROM members ORDER BY joindate
  1. query = (Member
  2. .select(fn.COUNT(Member.memid).over(), Member.firstname,
  3. Member.surname)
  4. .order_by(Member.joindate))

Produce a numbered list of members

Produce a monotonically increasing numbered list of members, ordered by their date of joining. Remember that member IDs are not guaranteed to be sequential.

Postgres ONLY (as written).

  1. SELECT row_number() OVER (ORDER BY joindate), firstname, surname
  2. FROM members ORDER BY joindate;
  1. query = (Member
  2. .select(fn.row_number().over(order_by=[Member.joindate]),
  3. Member.firstname, Member.surname)
  4. .order_by(Member.joindate))

Output the facility id that has the highest number of slots booked, again

Output the facility id that has the highest number of slots booked. Ensure that in the event of a tie, all tieing results get output.

Postgres ONLY (as written).

  1. SELECT facid, total FROM (
  2. SELECT facid, SUM(slots) AS total,
  3. rank() OVER (order by SUM(slots) DESC) AS rank
  4. FROM bookings
  5. GROUP BY facid
  6. ) AS ranked WHERE rank = 1
  1. rank = fn.rank().over(order_by=[fn.SUM(Booking.slots).desc()])
  2. subq = (Booking
  3. .select(Booking.facility, fn.SUM(Booking.slots).alias('total'),
  4. rank.alias('rank'))
  5. .group_by(Booking.facility))
  6. # Here we use a plain Select() to create our query.
  7. query = (Select(columns=[subq.c.facid, subq.c.total])
  8. .from_(subq)
  9. .where(subq.c.rank == 1)
  10. .bind(db)) # We must bind() it to the database.
  11. # To iterate over the query results:
  12. for facid, total in query.tuples():
  13. print(facid, total)

Rank members by (rounded) hours used

Produce a list of members, along with the number of hours they’ve booked in facilities, rounded to the nearest ten hours. Rank them by this rounded figure, producing output of first name, surname, rounded hours, rank. Sort by rank, surname, and first name.

Postgres ONLY (as written).

  1. SELECT firstname, surname,
  2. ((SUM(bks.slots)+10)/20)*10 as hours,
  3. rank() over (order by ((sum(bks.slots)+10)/20)*10 desc) as rank
  4. FROM members AS mems
  5. INNER JOIN bookings AS bks ON mems.memid = bks.memid
  6. GROUP BY mems.memid
  7. ORDER BY rank, surname, firstname;
  1. hours = ((fn.SUM(Booking.slots) + 10) / 20) * 10
  2. query = (Member
  3. .select(Member.firstname, Member.surname, hours.alias('hours'),
  4. fn.rank().over(order_by=[hours.desc()]).alias('rank'))
  5. .join(Booking)
  6. .group_by(Member.memid)
  7. .order_by(SQL('rank'), Member.surname, Member.firstname))

Find the top three revenue generating facilities

Produce a list of the top three revenue generating facilities (including ties). Output facility name and rank, sorted by rank and facility name.

Postgres ONLY (as written).

  1. SELECT name, rank FROM (
  2. SELECT f.name, RANK() OVER (ORDER BY SUM(
  3. CASE WHEN memid = 0 THEN slots * f.guestcost
  4. ELSE slots * f.membercost END) DESC) AS rank
  5. FROM bookings
  6. INNER JOIN facilities AS f ON bookings.facid = f.facid
  7. GROUP BY f.name) AS subq
  8. WHERE rank <= 3
  9. ORDER BY rank;
  1. total_cost = fn.SUM(Case(None, (
  2. (Booking.member == 0, Booking.slots * Facility.guestcost),
  3. ), (Booking.slots * Facility.membercost)))
  4. subq = (Facility
  5. .select(Facility.name,
  6. fn.RANK().over(order_by=[total_cost.desc()]).alias('rank'))
  7. .join(Booking)
  8. .group_by(Facility.name))
  9. query = (Select(columns=[subq.c.name, subq.c.rank])
  10. .from_(subq)
  11. .where(subq.c.rank <= 3)
  12. .order_by(subq.c.rank)
  13. .bind(db)) # Here again we used plain Select, and call bind().

Classify facilities by value

Classify facilities into equally sized groups of high, average, and low based on their revenue. Order by classification and facility name.

Postgres ONLY (as written).

  1. SELECT name,
  2. CASE class WHEN 1 THEN 'high' WHEN 2 THEN 'average' ELSE 'low' END
  3. FROM (
  4. SELECT f.name, ntile(3) OVER (ORDER BY SUM(
  5. CASE WHEN memid = 0 THEN slots * f.guestcost ELSE slots * f.membercost
  6. END) DESC) AS class
  7. FROM bookings INNER JOIN facilities AS f ON bookings.facid = f.facid
  8. GROUP BY f.name
  9. ) AS subq
  10. ORDER BY class, name;
  1. cost = fn.SUM(Case(None, (
  2. (Booking.member == 0, Booking.slots * Facility.guestcost),
  3. ), (Booking.slots * Facility.membercost)))
  4. subq = (Facility
  5. .select(Facility.name,
  6. fn.NTILE(3).over(order_by=[cost.desc()]).alias('klass'))
  7. .join(Booking)
  8. .group_by(Facility.name))
  9. klass_case = Case(subq.c.klass, [(1, 'high'), (2, 'average')], 'low')
  10. query = (Select(columns=[subq.c.name, klass_case])
  11. .from_(subq)
  12. .order_by(subq.c.klass, subq.c.name)
  13. .bind(db))