Top N objects per group

These examples describe several ways to query the top N items per group reasonably efficiently. For a thorough discussion of various techniques, check out my blog post Querying the top N objects per group with Peewee ORM.

In these examples we will use the User and Tweet models to find each user and their three most-recent tweets.

Postgres lateral joins

Lateral joins are a neat Postgres feature that allow reasonably efficient correlated subqueries. They are often described as SQL for each loops.

The desired SQL is:

  1. SELECT * FROM
  2. (SELECT id, username FROM user) AS uq
  3. LEFT JOIN LATERAL
  4. (SELECT message, created_date
  5. FROM tweet
  6. WHERE (user_id = uq.id)
  7. ORDER BY created_date DESC LIMIT 3)
  8. AS pq ON true

To accomplish this with peewee is quite straightforward:

  1. subq = (Tweet
  2. .select(Tweet.message, Tweet.created_date)
  3. .where(Tweet.user == User.id)
  4. .order_by(Tweet.created_date.desc())
  5. .limit(3))
  6.  
  7. query = (User
  8. .select(User, subq.c.content, subq.c.created_date)
  9. .join(subq, JOIN.LEFT_LATERAL)
  10. .order_by(User.username, subq.c.created_date.desc()))
  11.  
  12. # We queried from the "perspective" of user, so the rows are User instances
  13. # with the addition of a "content" and "created_date" attribute for each of
  14. # the (up-to) 3 most-recent tweets for each user.
  15. for row in query:
  16. print(row.username, row.content, row.created_date)

To implement an equivalent query from the “perspective” of the Tweet model, wecan instead write:

  1. # subq is the same as the above example.
  2. subq = (Tweet
  3. .select(Tweet.message, Tweet.created_date)
  4. .where(Tweet.user == User.id)
  5. .order_by(Tweet.created_date.desc())
  6. .limit(3))
  7.  
  8. query = (Tweet
  9. .select(User.username, subq.c.content, subq.c.created_date)
  10. .from_(User)
  11. .join(subq, JOIN.LEFT_LATERAL)
  12. .order_by(User.username, subq.c.created_date.desc()))
  13.  
  14. # Each row is a "tweet" instance with an additional "username" attribute.
  15. # This will print the (up-to) 3 most-recent tweets from each user.
  16. for tweet in query:
  17. print(tweet.username, tweet.content, tweet.created_date)

Window functions

Window functions, which are supported by peewee, provide scalable, efficient performance.

The desired SQL is:

  1. SELECT subq.message, subq.username
  2. FROM (
  3. SELECT
  4. t2.message,
  5. t3.username,
  6. RANK() OVER (
  7. PARTITION BY t2.user_id
  8. ORDER BY t2.created_date DESC
  9. ) AS rnk
  10. FROM tweet AS t2
  11. INNER JOIN user AS t3 ON (t2.user_id = t3.id)
  12. ) AS subq
  13. WHERE (subq.rnk <= 3)

To accomplish this with peewee, we will wrap the ranked Tweets in an outer query that performs the filtering.

  1. TweetAlias = Tweet.alias()
  2.  
  3. # The subquery will select the relevant data from the Tweet and
  4. # User table, as well as ranking the tweets by user from newest
  5. # to oldest.
  6. subquery = (TweetAlias
  7. .select(
  8. TweetAlias.message,
  9. User.username,
  10. fn.RANK().over(
  11. partition_by=[TweetAlias.user],
  12. order_by=[TweetAlias.created_date.desc()]).alias('rnk'))
  13. .join(User, on=(TweetAlias.user == User.id))
  14. .alias('subq'))
  15.  
  16. # Since we can't filter on the rank, we are wrapping it in a query
  17. # and performing the filtering in the outer query.
  18. query = (Tweet
  19. .select(subquery.c.message, subquery.c.username)
  20. .from_(subquery)
  21. .where(subquery.c.rnk <= 3))

Other methods

If you’re not using Postgres, then unfortunately you’re left with options that exhibit less-than-ideal performance. For a more complete overview of common methods, check out this blog post. Below I will summarize the approaches and the corresponding SQL.

Using COUNT, we can get all tweets where there exist less than N tweets with more recent timestamps:

  1. TweetAlias = Tweet.alias()
  2.  
  3. # Create a correlated subquery that calculates the number of
  4. # tweets with a higher (newer) timestamp than the tweet we're
  5. # looking at in the outer query.
  6. subquery = (TweetAlias
  7. .select(fn.COUNT(TweetAlias.id))
  8. .where(
  9. (TweetAlias.created_date >= Tweet.created_date) &
  10. (TweetAlias.user == Tweet.user)))
  11.  
  12. # Wrap the subquery and filter on the count.
  13. query = (Tweet
  14. .select(Tweet, User)
  15. .join(User)
  16. .where(subquery <= 3))

We can achieve similar results by doing a self-join and performing the filtering in the HAVING clause:

  1. TweetAlias = Tweet.alias()
  2.  
  3. # Use a self-join and join predicates to count the number of
  4. # newer tweets.
  5. query = (Tweet
  6. .select(Tweet.id, Tweet.message, Tweet.user, User.username)
  7. .join(User)
  8. .switch(Tweet)
  9. .join(TweetAlias, on=(
  10. (TweetAlias.user == Tweet.user) &
  11. (TweetAlias.created_date >= Tweet.created_date)))
  12. .group_by(Tweet.id, Tweet.content, Tweet.user, User.username)
  13. .having(fn.COUNT(Tweet.id) <= 3))

The last example uses a LIMIT clause in a correlated subquery.

  1. TweetAlias = Tweet.alias()
  2.  
  3. # The subquery here will calculate, for the user who created the
  4. # tweet in the outer loop, the three newest tweets. The expression
  5. # will evaluate to `True` if the outer-loop tweet is in the set of
  6. # tweets represented by the inner query.
  7. query = (Tweet
  8. .select(Tweet, User)
  9. .join(User)
  10. .where(Tweet.id << (
  11. TweetAlias
  12. .select(TweetAlias.id)
  13. .where(TweetAlias.user == Tweet.user)
  14. .order_by(TweetAlias.created_date.desc())
  15. .limit(3))))