Window functions

A Window function refers to an aggregate function that operates ona sliding window of data that is being processed as part of a SELECT query.Window functions make it possible to do things like:

  • Perform aggregations against subsets of a result-set.
  • Calculate a running total.
  • Rank results.
  • Compare a row value to a value in the preceding (or succeeding!) row(s).peewee comes with support for SQL window functions, which can be created bycalling Function.over() and passing in your partitioning or orderingparameters.

For the following examples, we’ll use the following model and sample data:

  1. class Sample(Model):
  2. counter = IntegerField()
  3. value = FloatField()
  4.  
  5. data = [(1, 10),
  6. (1, 20),
  7. (2, 1),
  8. (2, 3),
  9. (3, 100)]
  10. Sample.insert_many(data, fields=[Sample.counter, Sample.value]).execute()

Our sample table now contains:

idcountervalue
1110.0
2120.0
321.0
423.0
53100.0

Ordered Windows

Let’s calculate a running sum of the value field. In order for it to be a“running” sum, we need it to be ordered, so we’ll order with respect to theSample’s id field:

  1. query = Sample.select(
  2. Sample.counter,
  3. Sample.value,
  4. fn.SUM(Sample.value).over(order_by=[Sample.id]).alias('total'))
  5.  
  6. for sample in query:
  7. print(sample.counter, sample.value, sample.total)
  8.  
  9. # 1 10. 10.
  10. # 1 20. 30.
  11. # 2 1. 31.
  12. # 2 3. 34.
  13. # 3 100 134.

For another example, we’ll calculate the difference between the current valueand the previous value, when ordered by the id:

  1. difference = Sample.value - fn.LAG(Sample.value, 1).over(order_by=[Sample.id])
  2. query = Sample.select(
  3. Sample.counter,
  4. Sample.value,
  5. difference.alias('diff'))
  6.  
  7. for sample in query:
  8. print(sample.counter, sample.value, sample.diff)
  9.  
  10. # 1 10. NULL
  11. # 1 20. 10. -- (20 - 10)
  12. # 2 1. -19. -- (1 - 20)
  13. # 2 3. 2. -- (3 - 1)
  14. # 3 100 97. -- (100 - 3)

Partitioned Windows

Let’s calculate the average value for each distinct “counter” value. Noticethat there are three possible values for the counter field (1, 2, and 3).We can do this by calculating the AVG() of the value column over awindow that is partitioned depending on the counter field:

  1. query = Sample.select(
  2. Sample.counter,
  3. Sample.value,
  4. fn.AVG(Sample.value).over(partition_by=[Sample.counter]).alias('cavg'))
  5.  
  6. for sample in query:
  7. print(sample.counter, sample.value, sample.cavg)
  8.  
  9. # 1 10. 15.
  10. # 1 20. 15.
  11. # 2 1. 2.
  12. # 2 3. 2.
  13. # 3 100 100.

We can use ordering within partitions by specifying both the order_by andpartition_by parameters. For an example, let’s rank the samples by valuewithin each distinct counter group.

  1. query = Sample.select(
  2. Sample.counter,
  3. Sample.value,
  4. fn.RANK().over(
  5. order_by=[Sample.value],
  6. partition_by=[Sample.counter]).alias('rank'))
  7.  
  8. for sample in query:
  9. print(sample.counter, sample.value, sample.rank)
  10.  
  11. # 1 10. 1
  12. # 1 20. 2
  13. # 2 1. 1
  14. # 2 3. 2
  15. # 3 100 1

Bounded windows

By default, window functions are evaluated using an unbounded preceding startfor the window, and the current row as the end. We can change the bounds ofthe window our aggregate functions operate on by specifying a start and/orend in the call to Function.over(). Additionally, Peewee comeswith helper-methods on the Window object for generating theappropriate boundary references:

To examine how boundaries work, we’ll calculate a running total of thevalue column, ordered with respect to id, but we’ll only look therunning total of the current row and it’s two preceding rows:

  1. query = Sample.select(
  2. Sample.counter,
  3. Sample.value,
  4. fn.SUM(Sample.value).over(
  5. order_by=[Sample.id],
  6. start=Window.preceding(2),
  7. end=Window.CURRENT_ROW).alias('rsum'))
  8.  
  9. for sample in query:
  10. print(sample.counter, sample.value, sample.rsum)
  11.  
  12. # 1 10. 10.
  13. # 1 20. 30. -- (20 + 10)
  14. # 2 1. 31. -- (1 + 20 + 10)
  15. # 2 3. 24. -- (3 + 1 + 20)
  16. # 3 100 104. -- (100 + 3 + 1)

Note

Technically we did not need to specify the end=Window.CURRENT becausethat is the default. It was shown in the example for demonstration.

Let’s look at another example. In this example we will calculate the “opposite”of a running total, in which the total sum of all values is decreased by thevalue of the samples, ordered by id. To accomplish this, we’ll calculatethe sum from the current row to the last row.

  1. query = Sample.select(
  2. Sample.counter,
  3. Sample.value,
  4. fn.SUM(Sample.value).over(
  5. order_by=[Sample.id],
  6. start=Window.CURRENT_ROW,
  7. end=Window.following()).alias('rsum'))
  8.  
  9. # 1 10. 134. -- (10 + 20 + 1 + 3 + 100)
  10. # 1 20. 124. -- (20 + 1 + 3 + 100)
  11. # 2 1. 104. -- (1 + 3 + 100)
  12. # 2 3. 103. -- (3 + 100)
  13. # 3 100 100. -- (100)

Filtered Aggregates

Aggregate functions may also support filter functions (Postgres and Sqlite3.25+), which get translated into a FILTER (WHERE…) clause. Filterexpressions are added to an aggregate function with theFunction.filter() method.

For an example, we will calculate the running sum of the value field withrespect to the id, but we will filter-out any samples whose counter=2.

  1. query = Sample.select(
  2. Sample.counter,
  3. Sample.value,
  4. fn.SUM(Sample.value).filter(Sample.counter != 2).over(
  5. order_by=[Sample.id]).alias('csum'))
  6.  
  7. for sample in query:
  8. print(sample.counter, sample.value, sample.csum)
  9.  
  10. # 1 10. 10.
  11. # 1 20. 30.
  12. # 2 1. 30.
  13. # 2 3. 30.
  14. # 3 100 130.

Note

The call to filter() must precede the call toover().

Reusing Window Definitions

If you intend to use the same window definition for multiple aggregates, youcan create a Window object. The Window object takes thesame parameters as Function.over(), and can be passed to theover() method in-place of the individual parameters.

Here we’ll declare a single window, ordered with respect to the sample id,and call several window functions using that window definition:

  1. win = Window(order_by=[Sample.id])
  2. query = Sample.select(
  3. Sample.counter,
  4. Sample.value,
  5. fn.LEAD(Sample.value).over(win),
  6. fn.LAG(Sample.value).over(win),
  7. fn.SUM(Sample.value).over(win)
  8. ).window(win) # Include our window definition in query.
  9.  
  10. for row in query.tuples():
  11. print(row)
  12.  
  13. # counter value lead() lag() sum()
  14. # 1 10. 20. NULL 10.
  15. # 1 20. 1. 10. 30.
  16. # 2 1. 3. 20. 31.
  17. # 2 3. 100. 1. 34.
  18. # 3 100. NULL 3. 134.

Multiple window definitions

In the previous example, we saw how to declare a Window definitionand re-use it for multiple different aggregations. You can include as manywindow definitions as you need in your queries, but it is necessary to ensureeach window has a unique alias:

  1. w1 = Window(order_by=[Sample.id]).alias('w1')
  2. w2 = Window(partition_by=[Sample.counter]).alias('w2')
  3. query = Sample.select(
  4. Sample.counter,
  5. Sample.value,
  6. fn.SUM(Sample.value).over(w1).alias('rsum'), # Running total.
  7. fn.AVG(Sample.value).over(w2).alias('cavg') # Avg per category.
  8. ).window(w1, w2) # Include our window definitions.
  9.  
  10. for sample in query:
  11. print(sample.counter, sample.value, sample.rsum, sample.cavg)
  12.  
  13. # counter value rsum cavg
  14. # 1 10. 10. 15.
  15. # 1 20. 30. 15.
  16. # 2 1. 31. 2.
  17. # 2 3. 34. 2.
  18. # 3 100 134. 100.

Similarly, if you have multiple window definitions that share similardefinitions, it is possible to extend a previously-defined window definition.For example, here we will be partitioning the data-set by the counter value, sowe’ll be doing our aggregations with respect to the counter. Then we’ll definea second window that extends this partitioning, and adds an ordering clause:

  1. w1 = Window(partition_by=[Sample.counter]).alias('w1')
  2.  
  3. # By extending w1, this window definition will also be partitioned
  4. # by "counter".
  5. w2 = Window(extends=w1, order_by=[Sample.value.desc()]).alias('w2')
  6.  
  7. query = (Sample
  8. .select(Sample.counter, Sample.value,
  9. fn.SUM(Sample.value).over(w1).alias('group_sum'),
  10. fn.RANK().over(w2).alias('revrank'))
  11. .window(w1, w2)
  12. .order_by(Sample.id))
  13.  
  14. for sample in query:
  15. print(sample.counter, sample.value, sample.group_sum, sample.revrank)
  16.  
  17. # counter value group_sum revrank
  18. # 1 10. 30. 2
  19. # 1 20. 30. 1
  20. # 2 1. 4. 2
  21. # 2 3. 4. 1
  22. # 3 100. 100. 1

Frame types: RANGE vs ROWS vs GROUPS

Depending on the frame type, the database will process ordered groupsdifferently. Let’s create two additional Sample rows to visualize thedifference:

  1. >>> Sample.create(counter=1, value=20.)
  2. <Sample 6>
  3. >>> Sample.create(counter=2, value=1.)
  4. <Sample 7>

Our table now contains:

idcountervalue
1110.0
2120.0
321.0
423.0
53100.0
6120.0
721.0

Let’s examine the difference by calculating a “running sum” of the samples,ordered with respect to the counter and value fields. To specify theframe type, we can use either:

The behavior of RANGE, when there are logical duplicates,may lead to unexpected results:

  1. query = Sample.select(
  2. Sample.counter,
  3. Sample.value,
  4. fn.SUM(Sample.value).over(
  5. order_by=[Sample.counter, Sample.value],
  6. frame_type=Window.RANGE).alias('rsum'))
  7.  
  8. for sample in query.order_by(Sample.counter, Sample.value):
  9. print(sample.counter, sample.value, sample.rsum)
  10.  
  11. # counter value rsum
  12. # 1 10. 10.
  13. # 1 20. 50.
  14. # 1 20. 50.
  15. # 2 1. 52.
  16. # 2 1. 52.
  17. # 2 3. 55.
  18. # 3 100 155.

With the inclusion of the new rows we now have some rows that have duplicatecategory and value values. The RANGE frame typecauses these duplicates to be evaluated together rather than separately.

The more expected result can be achieved by using ROWS asthe frame-type:

  1. query = Sample.select(
  2. Sample.counter,
  3. Sample.value,
  4. fn.SUM(Sample.value).over(
  5. order_by=[Sample.counter, Sample.value],
  6. frame_type=Window.ROWS).alias('rsum'))
  7.  
  8. for sample in query.order_by(Sample.counter, Sample.value):
  9. print(sample.counter, sample.value, sample.rsum)
  10.  
  11. # counter value rsum
  12. # 1 10. 10.
  13. # 1 20. 30.
  14. # 1 20. 50.
  15. # 2 1. 51.
  16. # 2 1. 52.
  17. # 2 3. 55.
  18. # 3 100 155.

Peewee uses these rules for determining what frame-type to use:

  • If the user specifies a frame_type, that frame type will be used.
  • If start and/or end boundaries are specified Peewee will default tousing ROWS.
  • If the user did not specify frame type or start/end boundaries, Peewee willuse the database default, which is RANGE.

The Window.GROUPS frame type looks at the window range specificationin terms of groups of rows, based on the ordering term(s). Using GROUPS, wecan define the frame so it covers distinct groupings of rows. Let’s look at anexample:

  1. query = (Sample
  2. .select(Sample.counter, Sample.value,
  3. fn.SUM(Sample.value).over(
  4. order_by=[Sample.counter, Sample.value],
  5. frame_type=Window.GROUPS,
  6. start=Window.preceding(1)).alias('gsum'))
  7. .order_by(Sample.counter, Sample.value))
  8.  
  9. for sample in query:
  10. print(sample.counter, sample.value, sample.gsum)
  11.  
  12. # counter value gsum
  13. # 1 10 10
  14. # 1 20 50
  15. # 1 20 50 (10) + (20+0)
  16. # 2 1 42
  17. # 2 1 42 (20+20) + (1+1)
  18. # 2 3 5 (1+1) + 3
  19. # 3 100 103 (3) + 100

As you can hopefully infer, the window is grouped by its ordering term, whichis (counter, value). We are looking at a window that extends between oneprevious group and the current group.

Note

For information about the window function APIs, see:

For general information on window functions, read the postgres window functions tutorial

Additionally, the postgres docsand the sqlite docscontain a lot of good information.