背景

有一个这样的场景,一张小表A,里面存储了一些ID,大约几百个。

(比如说巡逻车辆ID,环卫车辆的ID,公交车,微公交的ID)。

另外有一张日志表B,每条记录中的ID是来自前面那张小表的,但不是每个ID都出现在这张日志表中,比如说一天可能只有几十个ID会出现在这个日志表的当天的数据中。

(比如车辆的行车轨迹数据,每秒上报轨迹,数据量就非常庞大)。

那么我怎么快速的找出今天没有出现的ID呢。

(哪些巡逻车辆没有出现在这个片区,是不是偷懒了?哪些环卫车辆没有出行,哪些公交或微公交没有出行)?

select id from A where id not in (select id from B where time between ? and ?);

这个QUERY会很慢,有什么优化方法呢。

当然,你还可以让车辆签到的方式来解决这个问题,但是总有未签到的,或者没有这种设计的时候,那么怎么解决呢?

优化方法

其实方法也很精妙,和我之前做的两个CASE很相似。

《时序数据合并场景加速分析和实现 - 复合索引,窗口分组查询加速,变态递归加速》

《distinct xx和count(distinct xx)的变态递归优化方法 - 索引收敛(skip scan)扫描》

在B表中,其实ID的值是很稀疏的,只是由于是流水,所以总量大。

优化的手段就是对B的取值区间,做递归的收敛查询,然后再做NOT IN就很快了。

例子

建表

  1. create table a(id int primary key, info text);
  2. create table b(id int primary key, aid int, crt_time timestamp);
  3. create index b_aid on b(aid);

插入测试数据

  1. -- a表插入1000
  2. insert into a select generate_series(1,1000), md5(random()::text);
  3. -- b表插入500万条,只包含aid500id
  4. insert into b select generate_series(1,5000000), generate_series(1,500), clock_timestamp();

优化前的性能

  1. \timing
  2. explain (analyze,verbose,timing,costs,buffers) select * from a where id not in (select aid from b);
  3. QUERY PLAN
  4. ----------------------------------------------------------------------------------------------------------------------------------
  5. Seq Scan on public.a (cost=0.00..67030021.50 rows=500 width=37) (actual time=2932.080..618776.881 rows=500 loops=1)
  6. Output: a.id, a.info
  7. Filter: (NOT (SubPlan 1))
  8. Rows Removed by Filter: 500
  9. Buffers: shared hit=27037, temp read=4264454 written=8545
  10. SubPlan 1
  11. -> Materialize (cost=0.00..121560.00 rows=5000000 width=4) (actual time=0.002..298.049 rows=2500125 loops=1000)
  12. Output: b.aid
  13. Buffers: shared hit=27028, temp read=4264454 written=8545
  14. -> Seq Scan on public.b (cost=0.00..77028.00 rows=5000000 width=4) (actual time=0.009..888.427 rows=5000000 loops=1)
  15. Output: b.aid
  16. Buffers: shared hit=27028
  17. Planning time: 0.969 ms
  18. Execution time: 618794.299 ms
  19. (14 rows)

另外你有一种选择是使用outer join, b表同样需要全扫一遍,有很大的改进,不过还可以更好,继续往后看。

  1. postgres=# explain (analyze,verbose,timing,costs,buffers) select a.id from a left join b on (a.id=b.aid) where b.* is null;
  2. QUERY PLAN
  3. ----------------------------------------------------------------------------------------------------------------------------
  4. Hash Right Join (cost=31.50..145809.50 rows=25000 width=4) (actual time=2376.777..2376.862 rows=500 loops=1)
  5. Output: a.id
  6. Hash Cond: (b.aid = a.id)
  7. Filter: (b.* IS NULL)
  8. Rows Removed by Filter: 5000000
  9. Buffers: shared hit=27037
  10. -> Seq Scan on public.b (cost=0.00..77028.00 rows=5000000 width=44) (actual time=0.012..1087.997 rows=5000000 loops=1)
  11. Output: b.aid, b.*
  12. Buffers: shared hit=27028
  13. -> Hash (cost=19.00..19.00 rows=1000 width=4) (actual time=0.355..0.355 rows=1000 loops=1)
  14. Output: a.id
  15. Buckets: 1024 Batches: 1 Memory Usage: 44kB
  16. Buffers: shared hit=9
  17. -> Seq Scan on public.a (cost=0.00..19.00 rows=1000 width=4) (actual time=0.010..0.183 rows=1000 loops=1)
  18. Output: a.id
  19. Buffers: shared hit=9
  20. Planning time: 0.302 ms
  21. Execution time: 2376.934 ms
  22. (18 rows)

递归收敛优化后的性能

  1. explain (analyze,verbose,timing,costs,buffers)
  2. select * from a where id not in
  3. (
  4. with recursive skip as (
  5. (
  6. select min(aid) aid from b where aid is not null
  7. )
  8. union all
  9. (
  10. select (select min(aid) aid from b where b.aid > s.aid and b.aid is not null)
  11. from skip s where s.aid is not null
  12. ) -- 这里的where s.aid is not null 一定要加,否则就死循环了.
  13. )
  14. select aid from skip where aid is not null
  15. );
  16. QUERY PLAN
  17. ------------------------------------------------------------------------------------------------------
  18. Seq Scan on public.a (cost=54.98..76.48 rows=500 width=37) (actual time=10.837..10.957 rows=500 loops=1)
  19. Output: a.id, a.info
  20. Filter: (NOT (hashed SubPlan 5))
  21. Rows Removed by Filter: 500
  22. Buffers: shared hit=2012
  23. SubPlan 5
  24. -> CTE Scan on skip (cost=52.71..54.73 rows=100 width=4) (actual time=0.042..10.386 rows=500 loops=1)
  25. Output: skip.aid
  26. Filter: (skip.aid IS NOT NULL)
  27. Rows Removed by Filter: 1
  28. Buffers: shared hit=2003
  29. CTE skip
  30. -> Recursive Union (cost=0.46..52.71 rows=101 width=4) (actual time=0.037..10.104 rows=501 loops=1)
  31. Buffers: shared hit=2003
  32. -> Result (cost=0.46..0.47 rows=1 width=4) (actual time=0.036..0.036 rows=1 loops=1)
  33. Output: $1
  34. Buffers: shared hit=4
  35. InitPlan 3 (returns $1)
  36. -> Limit (cost=0.43..0.46 rows=1 width=4) (actual time=0.031..0.032 rows=1 loops=1)
  37. Output: b_1.aid
  38. Buffers: shared hit=4
  39. -> Index Only Scan using b_aid on public.b b_1 (cost=0.43..131903.43 rows=5000000 width=4) (actual time=0.030..0.030 rows=1 loops=1)
  40. Output: b_1.aid
  41. Index Cond: (b_1.aid IS NOT NULL)
  42. Heap Fetches: 1
  43. Buffers: shared hit=4
  44. -> WorkTable Scan on skip s (cost=0.00..5.02 rows=10 width=4) (actual time=0.019..0.019 rows=1 loops=501)
  45. Output: (SubPlan 2)
  46. Filter: (s.aid IS NOT NULL)
  47. Rows Removed by Filter: 0
  48. Buffers: shared hit=1999
  49. SubPlan 2
  50. -> Result (cost=0.47..0.48 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=500)
  51. Output: $3
  52. Buffers: shared hit=1999
  53. InitPlan 1 (returns $3)
  54. -> Limit (cost=0.43..0.47 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=500)
  55. Output: b.aid
  56. Buffers: shared hit=1999
  57. -> Index Only Scan using b_aid on public.b (cost=0.43..66153.48 rows=1666667 width=4) (actual time=0.017..0.017 rows=1 loops=500)
  58. Output: b.aid
  59. Index Cond: ((b.aid > s.aid) AND (b.aid IS NOT NULL))
  60. Heap Fetches: 499
  61. Buffers: shared hit=1999
  62. Planning time: 0.323 ms
  63. Execution time: 11.082 ms
  64. (46 rows)

采用收敛查询优化后,耗时从最初的 618794毫秒 降低到了 11毫秒 ,感觉一下子节约了好多青春。