How do I: remove duplicates?

PRQL doesn’t have a specific distinct keyword. Instead duplicate tuples in a relation can be removed by using group and take 1:

PRQL

  1. from employees
  2. select department
  3. group employees.* (
  4. take 1
  5. )

SQL

  1. SELECT
  2. DISTINCT department
  3. FROM
  4. employees

This also works with a wildcard:

PRQL

  1. from employees
  2. group employees.* (take 1)

SQL

  1. SELECT
  2. DISTINCT *
  3. FROM
  4. employees

Remove duplicates from each group?

To select a single row from each group group can be combined with sort and take:

PRQL

  1. # youngest employee from each department
  2. from employees
  3. group department (
  4. sort age
  5. take 1
  6. )

SQL

  1. WITH table_0 AS (
  2. SELECT
  3. *,
  4. ROW_NUMBER() OVER (
  5. PARTITION BY department
  6. ORDER BY
  7. age
  8. ) AS _expr_0
  9. FROM
  10. employees
  11. )
  12. SELECT
  13. *
  14. FROM
  15. table_0
  16. WHERE
  17. _expr_0 <= 1

Note that we can’t always compile to DISTINCT; when the columns in the group aren’t all the available columns, we need to use a window function:

PRQL

  1. from employees
  2. group {first_name, last_name} (take 1)

SQL

  1. WITH table_0 AS (
  2. SELECT
  3. *,
  4. ROW_NUMBER() OVER (PARTITION BY first_name, last_name) AS _expr_0
  5. FROM
  6. employees
  7. )
  8. SELECT
  9. *
  10. FROM
  11. table_0
  12. WHERE
  13. _expr_0 <= 1

When compiling to Postgres or DuckDB dialect, such queries will be compiled to DISTINCT ON, which is the most performant option.

PRQL

  1. prql target:sql.postgres
  2. from employees
  3. group department (
  4. sort age
  5. take 1
  6. )

SQL

  1. SELECT
  2. DISTINCT ON (department) *
  3. FROM
  4. employees
  5. ORDER BY
  6. department,
  7. age