topn_agg()

Produces an aggregate that can be passed to the topn function to calculate the n most-frequent values in a column.

  1. topn_agg (
  2. n INTEGER,
  3. value AnyElement
  4. ) RETURNS SpaceSavingAggregate
warning

Experimental features could have bugs. They might not be backwards compatible, and could be removed in future releases. Use these features at your own risk, and do not use any experimental features in production.

Required arguments

NameTypeDescription
nINTEGERThe target number of most-frequent values
valueAnyElementColumn to aggregate

Optional arguments

NameTypeDescription
skewDOUBLE PRECISIONThe estimated skew of the data, defined as the s parameter of a zeta distribution. Must be greater than 1.0. Defaults to 1.1.

topn_agg assumes that the data is skewed. In other words, some values are more frequent than others. The degree of skew is defined by the s parameter of a zeta distribution.

The default value of 1.1 works on data with the following, or a more extreme, distribution:

NMinimum percentage of all values represented by the top N (approximate)
520%
1025%
2030%
5036%
10040%

Returns

ColumnTypeDescription
aggSpaceSavingAggregateAn object storing the most-frequent values of the given column and their estimated frequency.

Sample usage

Create a topN aggregate over the country column of the users table. Targets the top 10 most-frequent values:

  1. SELECT toolkit_experimental.topn_agg(10, country) FROM users;

Create a topN aggregate over the type column of the devices table. Estimates the skew of the data to be 1.05, and targets the 5 most-frequent values:

  1. SELECT toolkit_experimental.topn_agg(5, 1.05, type) FROM devices;

Get the 20 most frequent zip_code values of the employees table. Uses topn_agg as an intermediate step. topn_agg creates an aggregate for use in the topn function:

  1. SELECT topn(topn_agg(20, zip_code)) FROM employees;