Get faster DISTINCT queries with SkipScan

SkipScan improves query times for DISTINCT queries. It works on PostgreSQL tables, TimescaleDB hypertables, and TimescaleDB distributed hypertables. SkipScan is included in TimescaleDB 2.2.1 and above.

Speed up DISTINCT queries

To query your database and find the most recent value of an item, you could use a DISTINCT query. For example, you might want to find the latest stock or cryptocurrency price for each of your investments. Or you might have graphs and alarms that repeatedly query the most recent values for every device or service.

As your tables get larger, DISTINCT queries tend to get slower. This is because PostgreSQL does not currently have a good mechanism for pulling a list of unique values from an ordered index. Even when you have an index that matches the exact order and columns for these kinds of queries, PostgreSQL scans the entire index to find all unique values. As a table grows, this operation keeps getting slower.

SkipScan allows queries to incrementally jump from one ordered value to the next without reading all of the rows in between. Without support for this feature, the database engine has to scan the entire ordered index and then de-duplicate at the end, which is a much slower process.

note

PostgreSQL has plans to implement a native feature like SkipScan, but it is unlikely to be included until at least PostgreSQL 15. This section documents TimescaleDB SkipScan, which is not a native PostgreSQL feature.

SkipScan is an optimization for queries of the form SELECT DISTINCT ON column_name. Conceptually, SkipScan is a regular IndexScan that skips across an index looking for the next value that is greater than the current value.

When you issue a query that uses SkipScan, the EXPLAIN output includes a new operator, or node, that can quickly return distinct items from a properly ordered index. With an IndexOnly scan, PostgreSQL has to scan the entire index, but SkipScan incrementally searches for each successive item in the ordered index. As it locates one item, the SkipScan node quickly restarts the search for the next item. This is a much more efficient way of finding distinct items in an ordered index.

For benchmarking information on how SkipScan compares to regular DISTINCT queries, see our SkipScan blog post.

Use SkipScan queries

SkipScan is included in TimescaleDB 2.2.1 and above. This section describes how to set up your database index and query to use a SkipScan node.

Your index must:

  • Contain the DISTINCT column as the first column.
  • Be a BTREE index.
  • Match the ORDER BY used in your query.

Your query must:

  • Use the DISTINCT keyword on a single column.

If the DISTINCT column is not the first column of the index, ensure any leading columns are used as constraints in your query. This means that if you are asking a question such as “retrieve a list of unique IDs in order” and “retrieve the last reading of each ID,” you need at least one index like this:

  1. CREATE INDEX "cpu_customer_tags_id_time_idx" \
  2. ON readings (customer_id, tags_id, time DESC)

With your index set up correctly, you should start to see immediate benefit for DISTINCT queries. When SkipScan is chosen for your query, the EXPLAIN ANALYZE output shows one or more Custom Scan (SkipScan) nodes, like this:

  1. -> Unique
  2. -> Merge Append
  3. Sort Key: _hyper_8_79_chunk.tags_id, _hyper_8_79_chunk."time" DESC
  4. -> Custom Scan (SkipScan) on _hyper_8_79_chunk
  5. -> Index Only Scan using _hyper_8_79_chunk_cpu_tags_id_time_idx on _hyper_8_79_chunk
  6. Index Cond: (tags_id > NULL::integer)
  7. -> Custom Scan (SkipScan) on _hyper_8_80_chunk
  8. -> Index Only Scan using _hyper_8_80_chunk_cpu_tags_id_time_idx on _hyper_8_80_chunk
  9. Index Cond: (tags_id > NULL::integer)