hyperloglog() and approx_count_distinct() functions
Aggregate
Alternate aggregate
Accessor
Rollup
Function details
Extended examples
- Roll up two hyperloglogs
Approximate relative errors

hyperloglog() and approx_count_distinct() functions

Introduction

Estimate the number of distinct values in a dataset. This is also known as cardinality estimation. For large datasets and datasets with high cardinality (many distinct values), this can be much more efficient in both CPU and memory than an exact count using count(DISTINCT).

The estimation uses the hyperloglog++ algorithm. If you aren’t sure what parameters to set for the hyperloglog, try using the approx_count_distinct aggregate, which sets some reasonable default values.

Two-step aggregation

Hide content

This group of functions uses the two-step aggregation pattern.

Rather than calculating the final result in one step, you first create an intermediate aggregate by using the aggregate function.

Then, use any of the accessors on the intermediate aggregate to calculate a final result. You can also roll up multiple intermediate aggregates with the rollup functions.

The two-step aggregation pattern has several advantages:

More efficient because multiple accessors can reuse the same aggregate
Easier to reason about performance, because aggregation is separate from final computation
Easier to understand when calculations can be rolled up into larger intervals, especially in window functions and continuous aggregates
Can perform retrospective analysis even when underlying data is dropped, because the intermediate aggregate stores extra information not available in the final result

To learn more, see the blog post on two-step aggregates.

Functions in this group

warning

This function group includes some experimental functions. Experimental functions might change or be removed in future releases. We do not recommend using them in production. Experimental functions are marked with an Experimental tag.

Aggregate

hyperloglog

Aggregate data into a hyperloglog for approximate counting

Alternate aggregate

approx_count_distinct

ExperimentalAggregate data into a hyperloglog for approximate counting without specifying the number of buckets

Accessor

distinct_count

Estimate the number of distinct values from a hyperloglog

stderror

Estimate the relative standard error of a hyperloglog

Rollup

rollup

Roll up multiple hyperloglogs

Function details

hyperloglog()

Stabilized in Toolkit v1.3.0

Hide content

hyperloglog(

    buckets INTEGER,
    value AnyElement

) RETURNS Hyperloglog

This is the first step for estimating the approximate number of distinct values using the hyperloglog algorithm. Use hyperloglog to create an intermediate aggregate from your raw data. This intermediate form can then be used by one or more accessors in this group to compute final results.

Optionally, multiple such intermediate aggregate objects can be combined using rollup() before an accessor is applied.

If you’re not sure what value to set for buckets, try using the alternate aggregate function, approx_count_distinct(). approx_count_distinct also creates a hyperloglog, but it sets a default bucket value that should work for many use cases.

Required arguments

Name	Type	Description
`buckets`	`INTEGER`	Number of buckets in the hyperloglog. Increasing the number of buckets improves accuracy but increases memory use. Value is rounded up to the next power of 2, and must be between 2^4 (16) and 2^18. Setting a value less than 2^10 (1,024) may result in poor accuracy if the true cardinality is high and is not recommended. If unsure, start experimenting with 8,192 (2^13) which has an approximate error rate of 1.15%.
`value`	`AnyElement`	The column containing the elements to count. The type must have an extended, 64-bit, hash function.

Returns

Column	Type	Description
`hyperloglog`	`Hyperloglog`	A `hyperloglog` object which can be passed to other hyperloglog APIs for rollups and final calculation

Examples

Given a table called samples, with a column called weights, return a hyperloglog over the weights column:


SELECT hyperloglog(32768, weights) FROM samples;

Using the same data, build a view from the aggregate that you can pass to other hyperloglog functions:


CREATE VIEW hll AS SELECT hyperloglog(32768, data) FROM samples;

approx_count_distinct()

ExperimentalExperimental TimescaleDB Toolkit functions are not suitable for production environments. They may have bugs and may cause data loss. Click to learn more.

Introduced in Toolkit v1.8.0

Hide content

approx_count_distinct(

    value AnyElement

) RETURNS Hyperloglog

This is an alternate first step for approximating the number of distinct values. It provides some added convenience by using some sensible default parameters to create a hyperloglog.

Use approx_count_distinct to create an intermediate aggregate from your raw data. This intermediate form can then be used by one or more accessors in this group to compute final results.

Optionally, multiple such intermediate aggregate objects can be combined using rollup() before an accessor is applied.

Required arguments

Name	Type	Description
`value`	`AnyElement`	The column containing the elements to count. The type must have an extended, 64-bit, hash function.

Returns

Column	Type	Description
`hyperloglog`	`Hyperloglog`	A `hyperloglog` object which can be passed to other hyperloglog APIs for rollups and final calculation

Examples

Given a table called samples, with a column called weights, return a hyperloglog over the weights column::


SELECT toolkit_experimental.approx_count_distinct(weights) FROM samples;

Using the same data, build a view from the aggregate that you can pass to other hyperloglog functions:


CREATE VIEW hll AS SELECT toolkit_experimental.approx_count_distinct(data) FROM samples;

distinct_count()

Stabilized in Toolkit v1.3.0

Hide content

distinct_count(

    hyperloglog Hyperloglog

) RETURNS BIGINT

Estimate the number of distinct values from a hyperloglog

Required arguments

Name	Type	Description
`hyperloglog`	`Hyperloglog`	The hyperloglog to extract the count from.

Returns

Column	Type	Description
`distinct_count`	`BIGINT`	The number of distinct elements counted by the hyperloglog.

Examples

Estimate the number of distinct values from a hyperloglog named hyperloglog. The expected output is 98,814:


SELECT distinct_count(hyperloglog(8192, data))
  FROM generate_series(1, 100000) data
distinct_count
----------------
        98814

stderror()

Stabilized in Toolkit v1.3.0

Hide content

stderror(

    hyperloglog Hyperloglog

) RETURNS DOUBLE PRECISION

Estimate the relative standard error of a Hyperloglog. For approximate relative errors by number of buckets, see the relative errors section.

Required arguments

Name	Type	Description
`hyperloglog`	`Hyperloglog`	The hyperloglog to estimate the error of.

Returns

Column	Type	Description
`stderror`	`DOUBLE PRECISION`	The approximate relative standard error of the hyperloglog.

Examples

Estimate the relative standard error of a hyperloglog named hyperloglog. The expected output is 0.011490485194281396:


SELECT stderror(hyperloglog(8192, data))
  FROM generate_series(1, 100000) data
stderror       
----------------------
0.011490485194281396

rollup()

Stabilized in Toolkit v1.3.0

Hide content

rollup(

  hyperloglog Hyperloglog

) RETURNS Hyperloglog

Combine multiple intermediate hyperloglog aggregates, produced by hyperloglog, into a single intermediate hyperloglog aggregate. For example, you can use rollup to combine hyperloglog from 15-minute buckets into daily buckets.

Required arguments

Name	Type	Description
`hyperloglog`	`Hyperloglog`	The hyperloglog aggregates to roll up.

Returns

Column	Type	Description
`rollup`	`Hyperloglog`	A new hyperloglog aggregate created by combining the input hyperloglog aggregates.

Extended examples

Roll up two hyperloglogs

Roll up two hyperloglogs. The first hyperloglog buckets the integers from 1 to 100,000, and the second hyperloglog buckets the integers from 50,000 to 150,000. Accounting for overlap, the exact number of distinct values in the combined set is 150,000.

Calling distinct_count on the rolled-up hyperloglog yields a final value of 150,552, so the approximation is off by only 0.368%:


SELECT distinct_count(rollup(logs))
FROM (
    (SELECT hyperloglog(4096, v::text) logs FROM generate_series(1, 100000) v)
    UNION ALL
    (SELECT hyperloglog(4096, v::text) FROM generate_series(50000, 150000) v)
) hll;

Output:


distinct_count 
----------------
         150552

Approximate relative errors

These are the approximate errors for each bucket size:

precision	registers (bucket size)	error	column size (in bytes)
4	16	0.2600	12
5	32	0.1838	24
6	64	0.1300	48
7	128	0.0919	96
8	256	0.0650	192
9	512	0.0460	384
10	1024	0.0325	768
11	2048	0.0230	1536
12	4096	0.0163	3072
13	8192	0.0115	6144
14	16384	0.0081	12288
15	32768	0.0057	24576
16	65536	0.0041	49152
17	131072	0.0029	98304
18	262144	0.0020	196608