Aggregate functions

Aggregate functions operate on subsets defined by the GROUP BY clause. In the absence of a GROUP BY clause, aggregate functions operate on all elements of the result set. You can use aggregate functions in the GROUP BY, SELECT, and HAVING clauses.

OpenSearch supports the following aggregate functions.

FunctionDescription
AVGReturns the average of the results.
COUNTReturns the number of results.
SUMReturns the sum of the results.
MINReturns the minimum of the results.
MAXReturns the maximum of the results.
VAR_POP or VARIANCEReturns the population variance of the results after discarding nulls. Returns 0 when there is only one row of results.
VAR_SAMPReturns the sample variance of the results after discarding nulls. Returns null when there is only one row of results.
STD or STDDEVReturns the sample standard deviation of the results. Returns 0 when there is only one row of results.
STDDEV_POPReturns the population standard deviation of the results. Returns 0 when there is only one row of results.
STDDEV_SAMPReturns the sample standard deviation of the results. Returns null when there is only one row of results.

The examples below reference an employees table. You can try out the examples by indexing the following documents into OpenSearch using the bulk index operation:

  1. PUT employees/_bulk?refresh
  2. {"index":{"_id":"1"}}
  3. {"employee_id": 1, "department":1, "firstname":"Amber", "lastname":"Duke", "sales":1356, "sale_date":"2020-01-23"}
  4. {"index":{"_id":"2"}}
  5. {"employee_id": 1, "department":1, "firstname":"Amber", "lastname":"Duke", "sales":39224, "sale_date":"2021-01-06"}
  6. {"index":{"_id":"6"}}
  7. {"employee_id":6, "department":1, "firstname":"Hattie", "lastname":"Bond", "sales":5686, "sale_date":"2021-06-07"}
  8. {"index":{"_id":"7"}}
  9. {"employee_id":6, "department":1, "firstname":"Hattie", "lastname":"Bond", "sales":12432, "sale_date":"2022-05-18"}
  10. {"index":{"_id":"13"}}
  11. {"employee_id":13,"department":2, "firstname":"Nanette", "lastname":"Bates", "sales":32838, "sale_date":"2022-04-11"}
  12. {"index":{"_id":"18"}}
  13. {"employee_id":18,"department":2, "firstname":"Dale", "lastname":"Adams", "sales":4180, "sale_date":"2022-11-05"}

GROUP BY

The GROUP BY clause defines subsets of a result set. Aggregate functions operate on these subsets and return one result row for each subset.

You can use an identifier, ordinal, or expression in the GROUP BY clause.

Using an identifier in GROUP BY

You can specify the field name (column name) to aggregate on in the GROUP BY clause. For example, the following query returns the department numbers and the total sales for each department:

  1. SELECT department, sum(sales)
  2. FROM employees
  3. GROUP BY department;
departmentsum(sales)
158700
237018

Using an ordinal in GROUP BY

You can specify the column number to aggregate on in the GROUP BY clause. The column number is determined by the column position in the SELECT clause. For example, the following query is equivalent to the query above. It returns the department numbers and the total sales for each department. It groups the results by the first column of the result set, which is department:

  1. SELECT department, sum(sales)
  2. FROM employees
  3. GROUP BY 1;
departmentsum(sales)
158700
237018

Using an expression in GROUP BY

You can use an expression in the GROUP BY clause. For example, the following query returns the average sales for each year:

  1. SELECT year(sale_date), avg(sales)
  2. FROM employees
  3. GROUP BY year(sale_date);
year(start_date)avg(sales)
20201356.0
202122455.0
202216484.0

SELECT

You can use aggregate expressions in the SELECT clause either directly or as part of a larger expression. In addition, you can use expressions as arguments of aggregate functions.

Using aggregate expressions directly in SELECT

The following query returns the average sales for each department:

  1. SELECT department, avg(sales)
  2. FROM employees
  3. GROUP BY department;
departmentavg(sales)
114675.0
218509.0

Using aggregate expressions as part of larger expressions in SELECT

The following query calculates the average commission for the employees of each department as 5% of the average sales:

  1. SELECT department, avg(sales) * 0.05 as avg_commission
  2. FROM employees
  3. GROUP BY department;
departmentavg_commission
1733.75
2925.45

Using expressions as arguments to aggregate functions

The following query calculates the average commission amount for each department. First it calculates the commission amount for each sales value as 5% of the sales. Then it determines the average of all commission values:

  1. SELECT department, avg(sales * 0.05) as avg_commission
  2. FROM employees
  3. GROUP BY department;
departmentavg_commission
1733.75
2925.45

COUNT

The COUNT function accepts arguments, such as *, or literals, such as 1. The following table describes how various forms of the COUNT function operate.

Function typeDescription
COUNT(field)Counts the number of rows where the value of the given field (or expression) is not null.
COUNT()Counts the total number of rows in a table.
COUNT(1) (same as COUNT())Counts any non-null literal.

For example, the following query returns the count of sales for each year:

  1. SELECT year(sale_date), count(sales)
  2. FROM employees
  3. GROUP BY year(sale_date);
year(sale_date)count(sales)
20201
20212
20223

HAVING

Both WHERE and HAVING are used to filter results. The WHERE filter is applied before the GROUP BY phase, so you cannot use aggregate functions in a WHERE clause. However, you can use the WHERE clause to limit the rows to which the aggregate is then applied.

The HAVING filter is applied after the GROUP BY phase, so you can use the HAVING clause to limit the groups that are included in the results.

HAVING with GROUP BY

You can use aggregate expressions or their aliases defined in a SELECT clause in a HAVING condition.

The following query uses an aggregate expression in the HAVING clause. It returns the number of sales for each employee who made more than one sale:

  1. SELECT employee_id, count(sales)
  2. FROM employees
  3. GROUP BY employee_id
  4. HAVING count(sales) > 1;
employee_idcount(sales)
12
62

The aggregations in a HAVING clause do not have to be the same as the aggregations in a SELECT list. The following query uses the count function in the HAVING clause but the sum function in the SELECT clause. It returns the total sales amount for each employee who made more than one sale:

  1. SELECT employee_id, sum(sales)
  2. FROM employees
  3. GROUP BY employee_id
  4. HAVING count(sales) > 1;
employee_idsum (sales)
140580
618120

As an extension of the SQL standard, you are not restricted to using only identifiers in the GROUP BY clause. The following query uses an alias in the GROUP BY clause and is equivalent to the previous query:

  1. SELECT employee_id as id, sum(sales)
  2. FROM employees
  3. GROUP BY id
  4. HAVING count(sales) > 1;
idsum (sales)
140580
618120

You can also use an alias for an aggregate expression in the HAVING clause. The following query returns the total sales for each department where sales exceed $40,000:

  1. SELECT department, sum(sales) as total
  2. FROM employees
  3. GROUP BY department
  4. HAVING total > 40000;
departmenttotal
158700

If an identifier is ambiguous (for example, present both as a SELECT alias and as an index field), the preference is given to the alias. In the following query the identifier is replaced with the expression aliased in the SELECT clause:

  1. SELECT department, sum(sales) as sales
  2. FROM employees
  3. GROUP BY department
  4. HAVING sales > 40000;
departmentsales
158700

HAVING without GROUP BY

You can use a HAVING clause without a GROUP BY clause. In this case, the whole set of data is to be considered one group. The following query will return True if there is more than one value in the department column:

  1. SELECT 'True' as more_than_one_department FROM employees HAVING min(department) < max(department);
more_than_one_department
True

If all employees in the employee table belonged to the same department, the result would contain zero rows:

more_than_one_department