Downsample and retain data

InfluxDB can handle hundreds of thousands of data points per second. Working with that much data over a long period of time can create storage concerns. A natural solution is to downsample the data; keep the high precision raw data for only a limited time, and store the lower precision, summarized data longer. This guide describes how to automate the process of downsampling data and expiring old data using InfluxQL. To downsample and retain data using Flux and InfluxDB 2.0, see Process Data with InfluxDB tasks.

Definitions

  • Continuous query (CQ) is an InfluxQL query that runs automatically and periodically within a database. CQs require a function in the SELECT clause and must include a GROUP BY time() clause.

  • Retention policy (RP) is the part of InfluxDB data structure that describes for how long InfluxDB keeps data. InfluxDB compares your local server’s timestamp to the timestamps on your data and deletes data older than the RP’s DURATION. A single database can have several RPs and RPs are unique per database.

This guide doesn’t go into detail about the syntax for creating and managing CQs and RPs or tasks. If you’re new to these concepts, we recommend reviewing the following:

Sample data

This section uses fictional real-time data to track the number of food orders to a restaurant via phone and via website at ten second intervals. We store this data in a database or [bucket]() called food_data, in the measurement orders, and in the fields phone and website.

Sample:

  1. name: orders
  2. ------------
  3. time phone website
  4. 2016-05-10T23:18:00Z 10 30
  5. 2016-05-10T23:18:10Z 12 39
  6. 2016-05-10T23:18:20Z 11 56

Goal

Assume that, in the long run, we’re only interested in the average number of orders by phone and by website at 30 minute intervals. In the next steps, we use RPs and CQs to:

  • Automatically aggregate the ten-second resolution data to 30-minute resolution data
  • Automatically delete the raw, ten-second resolution data that are older than two hours
  • Automatically delete the 30-minute resolution data that are older than 52 weeks

Database preparation

We perform the following steps before writing the data to the database food_data. We do this before inserting any data because CQs only run against recent data; that is, data with timestamps that are no older than now() minus the FOR clause of the CQ, or now() minus the GROUP BY time() interval if the CQ has no FOR clause.

1. Create the database

  1. > CREATE DATABASE "food_data"

2. Create a two-hour DEFAULT retention policy

InfluxDB writes to the DEFAULT retention policy if we do not supply an explicit RP when writing a point to the database. We make the DEFAULT RP keep data for two hours, because we want InfluxDB to automatically write the incoming ten-second resolution data to that RP.

Use the CREATE RETENTION POLICY statement to create a DEFAULT RP:

  1. > CREATE RETENTION POLICY "two_hours" ON "food_data" DURATION 2h REPLICATION 1 DEFAULT

That query creates an RP called two_hours that exists in the database food_data. two_hours keeps data for a DURATION of two hours (2h) and it’s the DEFAULT RP for the database food_data.

The replication factor (REPLICATION 1) is a required parameter but must always be set to 1 for single node instances.

Note: When we created the food_data database in step 1, InfluxDB automatically generated an RP named autogen and set it as the DEFAULT RP for the database. The autogen RP has an infinite retention period. With the query above, the RP two_hours replaces autogen as the DEFAULT RP for the food_data database.

3. Create a 52-week retention policy

Next we want to create another retention policy that keeps data for 52 weeks and is not the DEFAULT retention policy (RP) for the database. Ultimately, the 30-minute rollup data will be stored in this RP.

Use the CREATE RETENTION POLICY statement to create a non-DEFAULT retention policy:

  1. > CREATE RETENTION POLICY "a_year" ON "food_data" DURATION 52w REPLICATION 1

That query creates a retention policy (RP) called a_year that exists in the database food_data. The a_year setting keeps data for a DURATION of 52 weeks (52w). Leaving out the DEFAULT argument ensures that a_year is not the DEFAULT RP for the database food_data. That is, write and read operations against food_data that do not specify an RP will still go to the two_hours RP (the DEFAULT RP).

4. Create the continuous query

Now that we’ve set up our RPs, we want to create a continuous query (CQ) that will automatically and periodically downsample the ten-second resolution data to the 30-minute resolution, and then store those results in a different measurement with a different retention policy.

Use the CREATE CONTINUOUS QUERY statement to generate a CQ:

  1. > CREATE CONTINUOUS QUERY "cq_30m" ON "food_data" BEGIN
  2. SELECT mean("website") AS "mean_website",mean("phone") AS "mean_phone"
  3. INTO "a_year"."downsampled_orders"
  4. FROM "orders"
  5. GROUP BY time(30m)
  6. END

That query creates a CQ called cq_30m in the database food_data. cq_30m tells InfluxDB to calculate the 30-minute average of the two fields website and phone in the measurement orders and in the DEFAULT RP two_hours. It also tells InfluxDB to write those results to the measurement downsampled_orders in the retention policy a_year with the field keys mean_website and mean_phone. InfluxDB will run this query every 30 minutes for the previous 30 minutes.

Note: Notice that we fully qualify (that is, we use the syntax "<retention_policy>"."<measurement>") the measurement in the INTO clause. InfluxDB requires that syntax to write data to an RP other than the DEFAULT RP.

Results

With the new CQ and two new RPs, food_data is ready to start receiving data. After writing data to our database and letting things run for a bit, we see two measurements: orders and downsampled_orders.

  1. > SELECT * FROM "orders" LIMIT 5
  2. name: orders
  3. ---------
  4. time phone website
  5. 2016-05-13T23:00:00Z 10 30
  6. 2016-05-13T23:00:10Z 12 39
  7. 2016-05-13T23:00:20Z 11 56
  8. 2016-05-13T23:00:30Z 8 34
  9. 2016-05-13T23:00:40Z 17 32
  10. > SELECT * FROM "a_year"."downsampled_orders" LIMIT 5
  11. name: downsampled_orders
  12. ---------------------
  13. time mean_phone mean_website
  14. 2016-05-13T15:00:00Z 12 23
  15. 2016-05-13T15:30:00Z 13 32
  16. 2016-05-13T16:00:00Z 19 21
  17. 2016-05-13T16:30:00Z 3 26
  18. 2016-05-13T17:00:00Z 4 23

The data in orders are the raw, ten-second resolution data that reside in the two-hour RP. The data in downsampled_orders are the aggregated, 30-minute resolution data that are subject to the 52-week RP.

Notice that the first timestamps in downsampled_orders are older than the first timestamps in orders. This is because InfluxDB has already deleted data from orders with timestamps that are older than our local server’s timestamp minus two hours (assume we executed the SELECT queries at 2016-05-14T00:59:59Z). InfluxDB will only start dropping data from downsampled_orders after 52 weeks.

Notes:

  • Notice that we fully qualify (that is, we use the syntax "<retention_policy>"."<measurement>") downsampled_orders in the second SELECT statement. We must specify the RP in that query to SELECT data that reside in an RP other than the DEFAULT RP.

  • By default, InfluxDB checks to enforce an RP every 30 minutes. Between checks, orders may have data that are older than two hours. The rate at which InfluxDB checks to enforce an RP is a configurable setting, see Database Configuration.

Using a combination of RPs and CQs, we’ve successfully set up our database to automatically keep the high precision raw data for a limited time, create lower precision data, and store that lower precision data for a longer period of time. Now that you have a general understanding of how these features can work together, check out the detailed documentation on CQs and RPs to see all that they can do for you.