Automatic DISTINCT

Pony tries to avoid duplicates in a query result by automatically adding the DISTINCT SQL keyword where it is necessary, because useful queries with duplicates are very rare. When someone wants to retrieve objects with a specific criteria, they typically don’t expect that the same object will be returned more than once. Also, avoiding duplicates makes the query result more predictable: you don’t need to filter duplicates out of a query result.

Pony adds the DISCTINCT keyword only when there could be potential duplicates. Let’s consider a couple of examples.

  1. Retrieving objects with a criteria:
  1. Person.select(lambda p: p.age > 20 and p.name == 'John')

In this example, the query doesn’t return duplicates, because the result contains the primary key column of a Person. Since duplicates are not possible here, there is no need in the DISTINCT keyword, and Pony doesn’t add it:

  1. SELECT "p"."id", "p"."name", "p"."age"
  2. FROM "Person" "p"
  3. WHERE "p"."age" > 20
  4. AND "p"."name" = 'John'
  1. Retrieving object attributes:
  1. select(p.name for p in Person)

The result of this query returns not objects, but its attribute. This query result can contain duplicates, so Pony will add DISTINCT to this query:

  1. SELECT DISTINCT "p"."name"
  2. FROM "Person" "p"

The result of a such query typically used for a dropdown list, where duplicates are not expected. It is not easy to come up with a real use-case when you want to have duplicates here.

If you need to count persons with the same name, you’d better use an aggregate query:

  1. select((p.name, count(p)) for p in Person)

But if it is absolutely necessary to get all person’s names, including duplicates, you can do so by using the Query.without_distinct() method:

  1. select(p.name for p in Person).without_distinct()
  1. Retrieving objects using joins:
  1. select(p for p in Person for c in p.cars if c.make in ("Toyota", "Honda"))

This query can contain duplicates, so Pony eliminates them using DISTINCT:

  1. SELECT DISTINCT "p"."id", "p"."name", "p"."age"
  2. FROM "Person" "p", "Car" "c"
  3. WHERE "c"."make" IN ('Toyota', 'Honda')
  4. AND "p"."id" = "c"."owner"

Without using DISTINCT the duplicates are possible, because the query uses two tables (Person and Car), but only one table is used in the SELECT section. The query above returns only persons (and not their cars), and therefore it is typically not desirable to get the same person in the result more than once. We believe that without duplicates the result looks more intuitive.

But if for some reason you don’t need to exclude duplicates, you always can add without_distinct() to the query:

  1. select(p for p in Person for c in p.cars
  2. if c.make in ("Toyota", "Honda")).without_distinct()

The user probably would like to see the Person objects duplicates if the query result contains cars owned by each person. In this case the Pony query would be different:

  1. select((p, c) for p in Person for c in p.cars if c.make in ("Toyota", "Honda"))

And in this case Pony will not add the DISTINCT keyword to SQL query.

To summarize:

  1. The principle “all queries do not return duplicates by default” is easy to understand and doesn’t lead to surprises.

  2. Such behavior is what most users want in most cases.

  3. Pony doesn’t add DISTINCT when a query is not supposed to have duplicates.

  4. The query method without_distinct() can be used for forcing Pony do not eliminate duplicates.