6.1.6. The GROUP BY clause

GROUP BY merges output rows that have the same combination of values in its item list into a single row. Aggregate functions in the select list are applied to each group individually instead of to the dataset as a whole.

If the select list only contains aggregate columns or, more generally, columns whose values don’t depend on individual rows in the underlying set, GROUP BY is optional. When omitted, the final result set of will consist of a single row (provided that at least one aggregated column is present).

If the select list contains both aggregate columns and columns whose values may vary per row, the GROUP BY clause becomes mandatory.

Syntax

  1. SELECT ... FROM ...
  2. GROUP BY <grouping-item> [, <grouping-item> ...]
  3. [HAVING <grouped-row-condition>]
  4. ...
  5. <grouping-item> ::=
  6. <non-aggr-select-item>
  7. | <non-aggr-expression>
  8. <non-aggr-select-item> ::=
  9. column-copy
  10. | column-alias
  11. | column-position
Table 66. Arguments for the GROUP BY Clause
ArgumentDescription

non-aggr-expression

Any non-aggregating expression that is not included in the SELECT list, i.e. unselected columns from the source set or expressions that do not depend on the data in the set at all

column-copy

A literal copy, from the SELECT list, of an expression that contains no aggregate function

column-alias

The alias, from the SELECT list, of an expression (column) that contains no aggregate function

column-position

The position number, in the SELECT list, of an expression (column) that contains no aggregate function

A general rule of thumb is that every non-aggregate item in the SELECT list must also be in the GROUP BY list. You can do this in three ways:

  1. By copying the item verbatim from the select list, e.g. “class” or “'D:' || upper(doccode)”.

  2. By specifying the column alias, if it exists.

  3. By specifying the column position as an integer literal between 1 and the number of columns. Integer values resulting from expressions or parameter substitutions are simply invariables and will be used as such in the grouping. They will have no effect though, as their value is the same for each row.

If you group by a column position, the expression at that position is copied internally from the select list. If it concerns a subquery, that subquery will be executed again in the grouping phase. That is to say, grouping by the column position, rather than duplicating the subquery expression in the grouping clause, saves keystrokes and bytes, but it is not a way of saving processing cycles!

In addition to the required items, the grouping list may also contain:

  • Columns from the source table that are not in the select list, or non-aggregate expressions based on such columns. Adding such columns may further subdivide the groups. But since these columns are not in the select list, you can’t tell which aggregated row corresponds to which value in the column. So, in general, if you are interested in this information, you also include the column or expression in the select list — which brings you back to the rule: “every non-aggregate column in the select list must also be in the grouping list”.

  • Expressions that aren’t dependent on the data in the underlying set, e.g. constants, context variables, single-value non-correlated subselects etc. This is only mentioned for completeness, as adding such items is utterly pointless: they don’t affect the grouping at all. “Harmless but useless” items like these may also figure in the select list without being copied to the grouping list.

Examples

When the select list contains only aggregate columns, GROUP BY is not mandatory:

  1. select count(*), avg(age) from students
  2. where sex = 'M';

This will return a single row listing the number of male students and their average age. Adding expressions that don’t depend on values in individual rows of table STUDENTS doesn’t change that:

  1. select count(*), avg(age), current_date from students
  2. where sex = 'M';

The row will now have an extra column showing the current date, but other than that, nothing fundamental has changed. A GROUP BY clause is still not required.

However, in both the above examples it is allowed. This is perfectly valid:

  1. select count(*), avg(age) from students
  2. where sex = 'M'
  3. group by class;

and will return a row for each class that has boys in it, listing the number of boys and their average age in that particular class. (If you also leave the current_date field in, this value will be repeated on every row, which is not very exciting.)

The above query has a major drawback though: it gives you information about the different classes, but it doesn’t tell you which row applies to which class. In order to get that extra bit of information, the non-aggregate column CLASS must be added to the select list:

  1. select class, count(*), avg(age) from students
  2. where sex = 'M'
  3. group by class;

Now we have a useful query. Notice that the addition of column CLASS also makes the GROUP BY clause mandatory. We can’t drop that clause anymore, unless we also remove CLASS from the column list.

The output of our last query may look something like this:

CLASSCOUNTAVG

2A

12

13.5

2B

9

13.9

3A

11

14.6

3B

12

14.4

…​

…​

…​

The headings “COUNT” and “AVG” are not very informative. In a simple case like this, you might get away with that, but in general you should give aggregate columns a meaningful name by aliasing them:

  1. select class,
  2. count(*) as num_boys,
  3. avg(age) as boys_avg_age
  4. from students
  5. where sex = 'M'
  6. group by class;

As you may recall from the formal syntax of the columns list, the AS keyword is optional.

Adding more non-aggregate (or rather: row-dependent) columns requires adding them to the GROUP BY clause too. For instance, you might want to see the above information for girls as well; and you may also want to differentiate between boarding and day students:

  1. select class,
  2. sex,
  3. boarding_type,
  4. count(*) as number,
  5. avg(age) as avg_age
  6. from students
  7. group by class, sex, boarding_type;

This may give you the following result:

CLASSSEXBOARDING_TYPENUMBERAVG_AGE

2A

F

BOARDING

9

13.3

2A

F

DAY

6

13.5

2A

M

BOARDING

7

13.6

2A

M

DAY

5

13.4

2B

F

BOARDING

11

13.7

2B

F

DAY

5

13.7

2B

M

BOARDING

6

13.8

…​

…​

…​

…​

…​

Each row in the result set corresponds to one particular combination of the variables class, sex and boarding type. The aggregate results — number and average age — are given for each of these rather specific groups individually. In a query like this, you don’t see a total for boys as a whole, or day students as a whole. That’s the tradeoff: the more non-aggregate columns you add, the more you can pinpoint very specific groups, but the more you also lose sight of the general picture. Of course, you can still obtain the “coarser” aggregates through separate queries.

HAVING

Just as a WHERE clause limits the rows in a dataset to those that meet the search condition, so the HAVING subclause imposes restrictions on the aggregated rows in a grouped set. HAVING is optional, and can only be used in conjunction with GROUP BY.

The condition(s) in the HAVING clause can refer to:

  • Any aggregated column in the select list. This is the most widely used alternative.

  • Any aggregated expression that is not in the select list, but allowed in the context of the query. This is sometimes useful too.

  • Any column in the GROUP BY list. While legal, it is more efficient to filter on these non-aggregated data at an earlier stage: in the WHERE clause.

  • Any expression whose value doesn’t depend on the contents of the dataset (like a constant or a context variable). This is valid but utterly pointless, because it will either suppress the entire set or leave it untouched, based on conditions that have nothing to do with the set itself.

A HAVING clause can not contain:

  • Non-aggregated column expressions that are not in the GROUP BY list.

  • Column positions. An integer in the HAVING clause is just an integer.

  • Column aliases –- not even if they appear in the GROUP BY clause!

Examples

Building on our earlier examples, this could be used to skip small groups of students:

  1. select class,
  2. count(*) as num_boys,
  3. avg(age) as boys_avg_age
  4. from students
  5. where sex = 'M'
  6. group by class
  7. having count(*) >= 5;

To select only groups that have a minimum age spread:

  1. select class,
  2. count(*) as num_boys,
  3. avg(age) as boys_avg_age
  4. from students
  5. where sex = 'M'
  6. group by class
  7. having max(age) - min(age) > 1.2;

Notice that if you’re really interested in this information, you’d normally include min(age) and max(age) -– or the expression “max(age) - min(age)” –- in the select list as well!

To include only 3rd classes:

  1. select class,
  2. count(*) as num_boys,
  3. avg(age) as boys_avg_age
  4. from students
  5. where sex = 'M'
  6. group by class
  7. having class starting with '3';

Better would be to move this condition to the WHERE clause:

  1. select class,
  2. count(*) as num_boys,
  3. avg(age) as boys_avg_age
  4. from students
  5. where sex = 'M' and class starting with '3'
  6. group by class;