Doris Window function usage

Window function introduction

Analysis functions are a special kind of built-in functions. Similar to the aggregation function, the analysis function also calculates a data value for multiple input rows. The difference is that the analysis function processes the input data in a specific window instead of grouping calculations according to group by. The data in each window can be sorted and grouped using the over() clause. The analysis function calculates a separate value for each row of the result set, instead of calculating a value for each group by group. This flexible way allows users to add additional columns in the select clause, giving users more opportunities to reorganize and filter the result set. Analysis functions can only appear in the select list and the outermost order by clause. In the query process, the analysis function will take effect at the end, that is, it will be executed after the join, where and group by operations are completed. Analytical functions are often used in the fields of finance and scientific computing to analyze trends, calculate outliers, and perform bucket analysis on large amounts of data.

The syntax of the analysis function:

  1. function(args) OVER(partition_by_clause order_by_clause [window_clause])
  2. partition_by_clause ::= PARTITION BY expr [, expr ...]
  3. order_by_clause ::= ORDER BY expr [ASC | DESC] [, expr [ASC | DESC] ...]

Function

Currently supported functions include AVG(), COUNT(), DENSE_RANK(), FIRST_VALUE(), LAG(), LAST_VALUE(), LEAD(), MAX(), MIN(), RANK(), ROW_NUMBER() and SUM ().

Partition By clause

The Partition By clause is similar to Group By. It groups the input rows according to the specified one or more columns, and rows with the same value will be grouped into a group.

Order By clause

The Order By clause is basically the same as the outer Order By. It defines the order of the input rows. If Partition By is specified, Order By defines the order within each Partition group. The only difference with the outer Order By is that the Order By n (n is a positive integer) in the OVER clause is equivalent to doing nothing, while the outer Order By n means sorting according to the nth column.

For example:

This example shows the addition of an id column to the select list, its value is 1, 2, 3, etc., in order according to the date_and_time column in the events table.

  1. SELECT
  2. row_number() OVER (ORDER BY date_and_time) AS id,
  3. c1, c2, c3, c4
  4. FROM events;

Window clause

The Window clause is used to specify an operation range for the analysis function, based on the current behavior, and several lines before and after the analysis function as the object of operation. The methods supported by the Window clause are: AVG(), COUNT(), FIRST_VALUE(), LAST_VALUE() and SUM(). For MAX() and MIN(), the window clause can specify the start range UNBOUNDED PRECEDING

grammar:

  1. ROWS BETWEEN [ { m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ]

Example:

Suppose we have the following stock data, the stock code is JDR, and the closing price is the daily closing price.

  1. create table stock_ticker (stock_symbol string, closing_price decimal(8,2), closing_date timestamp);
  2. ...load some data...
  3. select * from stock_ticker order by stock_symbol, closing_date
  4. | stock_symbol | closing_price | closing_date |
  5. |--------------|---------------|---------------------|
  6. | JDR | 12.86 | 2014-10-02 00:00:00 |
  7. | JDR | 12.89 | 2014-10-03 00:00:00 |
  8. | JDR | 12.94 | 2014-10-04 00:00:00 |
  9. | JDR | 12.55 | 2014-10-05 00:00:00 |
  10. | JDR | 14.03 | 2014-10-06 00:00:00 |
  11. | JDR | 14.75 | 2014-10-07 00:00:00 |
  12. | JDR | 13.98 | 2014-10-08 00:00:00 |

This query uses an analytical function to generate the moving_average column, and its value is the average price of stocks in 3 days, that is, the average price of the previous day, the current day, and the next day. The first day does not have the value of the previous day, and the last day does not have the value of the next day, so these two rows only calculate the average of the two days. Here Partition By does not play a role, because all the data is JDR data, but if there is other stock information, Partition By will ensure that the analysis function value is applied to this Partition.

  1. select stock_symbol, closing_date, closing_price,
  2. avg(closing_price) over (partition by stock_symbol order by closing_date
  3. rows between 1 preceding and 1 following) as moving_average
  4. from stock_ticker;
  5. | stock_symbol | closing_date | closing_price | moving_average |
  6. |--------------|---------------------|---------------|----------------|
  7. | JDR | 2014-10-02 00:00:00 | 12.86 | 12.87 |
  8. | JDR | 2014-10-03 00:00:00 | 12.89 | 12.89 |
  9. | JDR | 2014-10-04 00:00:00 | 12.94 | 12.79 |
  10. | JDR | 2014-10-05 00:00:00 | 12.55 | 13.17 |
  11. | JDR | 2014-10-06 00:00:00 | 14.03 | 13.77 |
  12. | JDR | 2014-10-07 00:00:00 | 14.75 | 14.25 |
  13. | JDR | 2014-10-08 00:00:00 | 13.98 | 14.36 |

Function example

This section introduces the methods that can be used as analysis functions in Doris.

AVG()

grammar:

  1. AVG([DISTINCT | ALL] *expression*) [OVER (*analytic_clause*)]

For example:

Calculate the x average value of the current row and each row of data before and after it.

  1. select x, property,
  2. avg(x) over
  3. (
  4. partition by property
  5. order by x
  6. rows between 1 preceding and 1 following
  7. ) as 'moving average'
  8. from int_t where property in ('odd','even');
  9. | x | property | moving average |
  10. |----|----------|----------------|
  11. | 2 | even | 3 |
  12. | 4 | even | 4 |
  13. | 6 | even | 6 |
  14. | 8 | even | 8 |
  15. | 10 | even | 9 |
  16. | 1 | odd | 2 |
  17. | 3 | odd | 3 |
  18. | 5 | odd | 5 |
  19. | 7 | odd | 7 |
  20. | 9 | odd | 8 |

COUNT()

grammar:

  1. COUNT([DISTINCT | ALL] expression) [OVER (analytic_clause)]

For example:

Count the number of occurrences of x from the current line to the first line.

  1. select x, property,
  2. count(x) over
  3. (
  4. partition by property
  5. order by x
  6. rows between unbounded preceding and current row
  7. ) as 'cumulative total'
  8. from int_t where property in ('odd','even');
  9. | x | property | cumulative count |
  10. |----|----------|------------------|
  11. | 2 | even | 1 |
  12. | 4 | even | 2 |
  13. | 6 | even | 3 |
  14. | 8 | even | 4 |
  15. | 10 | even | 5 |
  16. | 1 | odd | 1 |
  17. | 3 | odd | 2 |
  18. | 5 | odd | 3 |
  19. | 7 | odd | 4 |
  20. | 9 | odd | 5 |

DENSE_RANK()

The DENSE_RANK() function is used to indicate the ranking. Unlike RANK(), DENSE_RANK() does not have vacant numbers. For example, if there are two parallel ones, the third number of DENSE_RANK() is still 2, and the third number of RANK() is 3.

grammar:

  1. DENSE_RANK() OVER(partition_by_clause order_by_clause)

For example:

The following example shows the ranking of the x column grouped by the property column:

  1. select x, y, dense_rank() over(partition by x order by y) as rank from int_t;
  2. | x | y | rank |
  3. |----|------|----------|
  4. | 1 | 1 | 1 |
  5. | 1 | 2 | 2 |
  6. | 1 | 2 | 2 |
  7. | 2 | 1 | 1 |
  8. | 2 | 2 | 2 |
  9. | 2 | 3 | 3 |
  10. | 3 | 1 | 1 |
  11. | 3 | 1 | 1 |
  12. | 3 | 2 | 2 |

FIRST_VALUE()

FIRST_VALUE() returns the first value in the window range.

grammar:

  1. FIRST_VALUE(expr) OVER(partition_by_clause order_by_clause [window_clause])

For example:

We have the following data

  1. select name, country, greeting from mail_merge;
  2. | name | country | greeting |
  3. |---------|---------|--------------|
  4. | Pete | USA | Hello |
  5. | John | USA | Hi |
  6. | Boris | Germany | Guten tag |
  7. | Michael | Germany | Guten morgen |
  8. | Bjorn | Sweden | Hej |
  9. | Mats | Sweden | Tja |

Use FIRST_VALUE() to group by country and return the value of the first greeting in each group:

  1. select country, name,
  2. first_value(greeting)
  3. over (partition by country order by name, greeting) as greeting from mail_merge;
  4. | country | name | greeting |
  5. |---------|---------|-----------|
  6. | Germany | Boris | Guten tag |
  7. | Germany | Michael | Guten tag |
  8. | Sweden | Bjorn | Hej |
  9. | Sweden | Mats | Hej |
  10. | USA | John | Hi |
  11. | USA | Pete | Hi |

LAG()

The LAG() method is used to calculate the value of several lines forward from the current line.

grammar:

  1. LAG (expr, offset, default) OVER (partition_by_clause order_by_clause)

For example:

Calculate the closing price of the previous day

  1. select stock_symbol, closing_date, closing_price,
  2. lag(closing_price,1, 0) over (partition by stock_symbol order by closing_date) as "yesterday closing"
  3. from stock_ticker
  4. order by closing_date;
  5. | stock_symbol | closing_date | closing_price | yesterday closing |
  6. |--------------|---------------------|---------------|-------------------|
  7. | JDR | 2014-09-13 00:00:00 | 12.86 | 0 |
  8. | JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 |
  9. | JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 |
  10. | JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 |
  11. | JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 |
  12. | JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 |
  13. | JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 |

LAST_VALUE()

LAST_VALUE() returns the last value in the window range. Contrary to FIRST_VALUE().

grammar:

  1. LAST_VALUE(expr) OVER(partition_by_clause order_by_clause [window_clause])

Use the data in the FIRST_VALUE() example:

  1. select country, name,
  2. last_value(greeting)
  3. over (partition by country order by name, greeting) as greeting
  4. from mail_merge;
  5. | country | name | greeting |
  6. |---------|---------|--------------|
  7. | Germany | Boris | Guten morgen |
  8. | Germany | Michael | Guten morgen |
  9. | Sweden | Bjorn | Tja |
  10. | Sweden | Mats | Tja |
  11. | USA | John | Hello |
  12. | USA | Pete | Hello |

LEAD()

The LEAD() method is used to calculate the value of several rows from the current row.

grammar:

  1. LEAD (expr, offset, default]) OVER (partition_by_clause order_by_clause)

For example:

Calculate the trend of the closing price of the next day compared to the closing price of the day, that is, whether the closing price of the next day is higher or lower than that of the day.

  1. select stock_symbol, closing_date, closing_price,
  2. case
  3. (lead(closing_price,1, 0)
  4. over (partition by stock_symbol order by closing_date)-closing_price) > 0
  5. when true then "higher"
  6. when false then "flat or lower"
  7. end as "trending"
  8. from stock_ticker
  9. order by closing_date;
  10. | stock_symbol | closing_date | closing_price | trending |
  11. |--------------|---------------------|---------------|---------------|
  12. | JDR | 2014-09-13 00:00:00 | 12.86 | higher |
  13. | JDR | 2014-09-14 00:00:00 | 12.89 | higher |
  14. | JDR | 2014-09-15 00:00:00 | 12.94 | flat or lower |
  15. | JDR | 2014-09-16 00:00:00 | 12.55 | higher |
  16. | JDR | 2014-09-17 00:00:00 | 14.03 | higher |
  17. | JDR | 2014-09-18 00:00:00 | 14.75 | flat or lower |
  18. | JDR | 2014-09-19 00:00:00 | 13.98 | flat or lower |

MAX()

grammar:

  1. MAX([DISTINCT | ALL] expression) [OVER (analytic_clause)]

For example:

Calculate the maximum value from the first line to the line after the current line

  1. select x, property,
  2. max(x) over
  3. (
  4. order by property, x
  5. rows between unbounded preceding and 1 following
  6. ) as 'local maximum'
  7. from int_t where property in ('prime','square');
  8. | x | property | local maximum |
  9. |---|----------|---------------|
  10. | 2 | prime | 3 |
  11. | 3 | prime | 5 |
  12. | 5 | prime | 7 |
  13. | 7 | prime | 7 |
  14. | 1 | square | 7 |
  15. | 4 | square | 9 |
  16. | 9 | square | 9 |

MIN()

grammar:

  1. MIN([DISTINCT | ALL] expression) [OVER (analytic_clause)]

For example:

Calculate the minimum value from the first line to the line after the current line

  1. select x, property,
  2. min(x) over
  3. (
  4. order by property, x desc
  5. rows between unbounded preceding and 1 following
  6. ) as 'local minimum'
  7. from int_t where property in ('prime','square');
  8. | x | property | local minimum |
  9. |---|----------|---------------|
  10. | 7 | prime | 5 |
  11. | 5 | prime | 3 |
  12. | 3 | prime | 2 |
  13. | 2 | prime | 2 |
  14. | 9 | square | 2 |
  15. | 4 | square | 1 |
  16. | 1 | square | 1 |

RANK()

The RANK() function is used to indicate ranking. Unlike DENSE_RANK(), RANK() will have vacant numbers. For example, if there are two parallel 1s, the third number in RANK() is 3, not 2.

grammar:

  1. RANK() OVER(partition_by_clause order_by_clause)

For example:

Rank according to x

  1. select x, y, rank() over(partition by x order by y) as rank from int_t;
  2. | x | y | rank |
  3. |----|------|----------|
  4. | 1 | 1 | 1 |
  5. | 1 | 2 | 2 |
  6. | 1 | 2 | 2 |
  7. | 2 | 1 | 1 |
  8. | 2 | 2 | 2 |
  9. | 2 | 3 | 3 |
  10. | 3 | 1 | 1 |
  11. | 3 | 1 | 1 |
  12. | 3 | 2 | 3 |

ROW_NUMBER()

For each row of each Partition, an integer that starts from 1 and increases continuously is returned. Unlike RANK() and DENSE_RANK(), the value returned by ROW_NUMBER() will not be repeated or vacant, and is continuously increasing.

grammar:

  1. ROW_NUMBER() OVER(partition_by_clause order_by_clause)

For example:

  1. select x, y, row_number() over(partition by x order by y) as rank from int_t;
  2. | x | y | rank |
  3. |---|------|----------|
  4. | 1 | 1 | 1 |
  5. | 1 | 2 | 2 |
  6. | 1 | 2 | 3 |
  7. | 2 | 1 | 1 |
  8. | 2 | 2 | 2 |
  9. | 2 | 3 | 3 |
  10. | 3 | 1 | 1 |
  11. | 3 | 1 | 2 |
  12. | 3 | 2 | 3 |

SUM()

grammar:

  1. SUM([DISTINCT | ALL] expression) [OVER (analytic_clause)]

For example:

Group according to property, and calculate the sum of the x column of the current row and each row before and after in the group.

  1. select x, property,
  2. sum(x) over
  3. (
  4. partition by property
  5. order by x
  6. rows between 1 preceding and 1 following
  7. ) as 'moving total'
  8. from int_t where property in ('odd','even');
  9. | x | property | moving total |
  10. |----|----------|--------------|
  11. | 2 | even | 6 |
  12. | 4 | even | 12 |
  13. | 6 | even | 18 |
  14. | 8 | even | 24 |
  15. | 10 | even | 18 |
  16. | 1 | odd | 4 |
  17. | 3 | odd | 9 |
  18. | 5 | odd | 15 |
  19. | 7 | odd | 21 |
  20. | 9 | odd | 16 |