业务背景

按分组取出TOP值,是非常常见的业务需求。

比如提取每位歌手的下载量TOP 10的曲目、提取每个城市纳税前10的人或企业。

传统方法

传统的方法是使用窗口查询,PostgreSQL是支持窗口查询的。

例子

测试表和测试数据,生成10000个分组,1000万条记录。

  1. postgres=# create table tbl(c1 int, c2 int, c3 int);
  2. CREATE TABLE
  3. postgres=# create index idx1 on tbl(c1,c2);
  4. CREATE INDEX
  5. postgres=# insert into tbl select mod(trunc(random()*10000)::int, 10000), trunc(random()*10000000) from generate_series(1,10000000);
  6. INSERT 0 10000000

使用窗口查询的执行计划

  1. postgres=# explain select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
  2. QUERY PLAN
  3. ----------------------------------------------------------------------------------------
  4. Subquery Scan on t (cost=0.43..770563.03 rows=3333326 width=20)
  5. Filter: (t.rn <= 10)
  6. -> WindowAgg (cost=0.43..645563.31 rows=9999977 width=12)
  7. -> Index Scan using idx1 on tbl (cost=0.43..470563.72 rows=9999977 width=12)
  8. (4 rows)

使用窗口查询的结果举例

  1. postgres=# select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
  2. rn | c1 | c2 | c3
  3. ----+------+--------+----
  4. 1 | 0 | 1657 |
  5. 2 | 0 | 3351 |
  6. 3 | 0 | 6347 |
  7. 4 | 0 | 12688 |
  8. 5 | 0 | 16991 |
  9. 6 | 0 | 19584 |
  10. 7 | 0 | 24694 |
  11. 8 | 0 | 36646 |
  12. 9 | 0 | 40882 |
  13. 10 | 0 | 41599 |
  14. 1 | 1 | 14465 |
  15. 2 | 1 | 29032 |
  16. 3 | 1 | 39969 |
  17. 4 | 1 | 41094 |
  18. 5 | 1 | 69481 |
  19. 6 | 1 | 70919 |
  20. 7 | 1 | 75575 |
  21. 8 | 1 | 81102 |
  22. 9 | 1 | 87496 |
  23. 10 | 1 | 90603 |
  24. ......

使用窗口查询的效率,20.1秒

  1. postgres=# explain (analyze,verbose,costs,timing,buffers) select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
  2. QUERY PLAN
  3. ----------------------------------------------------------------------------------------------------------------------------------------------------
  4. Subquery Scan on t (cost=0.43..770563.03 rows=3333326 width=20) (actual time=0.040..20813.469 rows=100000 loops=1)
  5. Output: t.rn, t.c1, t.c2, t.c3
  6. Filter: (t.rn <= 10)
  7. Rows Removed by Filter: 9900000
  8. Buffers: shared hit=10035535
  9. -> WindowAgg (cost=0.43..645563.31 rows=9999977 width=12) (actual time=0.035..18268.027 rows=10000000 loops=1)
  10. Output: row_number() OVER (?), tbl.c1, tbl.c2, tbl.c3
  11. Buffers: shared hit=10035535
  12. -> Index Scan using idx1 on public.tbl (cost=0.43..470563.72 rows=9999977 width=12) (actual time=0.026..11913.677 rows=10000000 loops=1)
  13. Output: tbl.c1, tbl.c2, tbl.c3
  14. Buffers: shared hit=10035535
  15. Planning time: 0.110 ms
  16. Execution time: 20833.747 ms
  17. (13 rows)

雕虫小技

如何优化?

可以参考我之前写的,使用递归查询,优化count distinct的方法

本文同样需要用到递归查询,获得分组ID

  1. postgres=# with recursive t1 as (
  2. postgres(# (select min(c1) c1 from tbl )
  3. postgres(# union all
  4. postgres(# (select (select min(tbl.c1) c1 from tbl where tbl.c1>t.c1) c1 from t1 t where t.c1 is not null)
  5. postgres(# )
  6. postgres-# select * from t1;

写成SRF函数,如下

  1. postgres=# create or replace function f() returns setof tbl as $$
  2. postgres$# declare
  3. postgres$# v int;
  4. postgres$# begin
  5. postgres$# for v in with recursive t1 as (
  6. postgres$# (select min(c1) c1 from tbl )
  7. postgres$# union all
  8. postgres$# (select (select min(tbl.c1) c1 from tbl where tbl.c1>t.c1) c1 from t1 t where t.c1 is not null)
  9. postgres$# )
  10. postgres$# select * from t1
  11. postgres$# LOOP
  12. postgres$# return query select * from tbl where c1=v order by c2 limit 10;
  13. postgres$# END LOOP;
  14. postgres$# return;
  15. postgres$#
  16. postgres$# end;
  17. postgres$# $$ language plpgsql strict;
  18. CREATE FUNCTION

优化后的查询结果例子

  1. postgres=# select * from f();
  2. c1 | c2 | c3
  3. ------+--------+----
  4. 0 | 1657 |
  5. 0 | 3351 |
  6. 0 | 6347 |
  7. 0 | 12688 |
  8. 0 | 16991 |
  9. 0 | 19584 |
  10. 0 | 24694 |
  11. 0 | 36646 |
  12. 0 | 40882 |
  13. 0 | 41599 |
  14. 1 | 14465 |
  15. 1 | 29032 |
  16. 1 | 39969 |
  17. 1 | 41094 |
  18. 1 | 69481 |
  19. 1 | 70919 |
  20. 1 | 75575 |
  21. 1 | 81102 |
  22. 1 | 87496 |
  23. 1 | 90603 |
  24. ......

优化后,只需要464毫秒返回10000个分组的TOP 10。

  1. postgres=# explain (analyze,verbose,timing,costs,buffers) select * from f();
  2. QUERY PLAN
  3. ---------------------------------------------------------------------------------------------------------------------
  4. Function Scan on public.f (cost=0.25..10.25 rows=1000 width=12) (actual time=419.218..444.810 rows=100000 loops=1)
  5. Output: c1, c2, c3
  6. Function Call: f()
  7. Buffers: shared hit=170407, temp read=221 written=220
  8. Planning time: 0.037 ms
  9. Execution time: 464.257 ms
  10. (6 rows)

小结

  1. 传统的方法使用窗口查询,输出多个每个分组的TOP 10,需要扫描所有的记录。效率较低。

  2. 由于分组不是非常多,只有10000个,所以可以选择使用递归的方法,用上索引取TOP 10,速度非常快。

  3. 目前PostgreSQL的递归语法不支持递归的启动表写在subquery里面,也不支持启动表在递归查询中使用order by,所以不能直接使用递归得出结果,目前需要套一层函数。