查询改写 - 基于规则的查询改写 - 《OceanBase v1.4 官方教程》

子查询相关改写

优化器对于子查询一般使用嵌套执行的方式，也就是父查询每生成一行数据后，都需要执行一次子查询，使用这种方式需要多次执行子查询，执行效率很低，对于子查询的优化，一般会使用改写先转换为连接操作，可大大提高执行效率，主要好处有：

可避免子查询多次执行
优化器可根据统计信息选择更优的连接顺序和连接方法
子查询的连接条件、过滤条件改写为父查询的条件后，优化器可以进行进一步优化，比如条件下压等。

视图合并

视图合并是指将代表一个视图的子查询合并到包含该视图的查询中，视图合并后，有助于优化器增加连接顺序的选择、访问路径的选择以及进一步做其他改写操作，从而选择更优的执行计划。OceanBase支持对SPJ(select-project-join)的视图进行视图合并。如下示例中，SQL_A可改写为SQL_B,

create table t1 (c1 int, c2 int);
create table t2 (c1 int primary key, c2 int);
create table t3 (c1 int primary key, c2 int);
SQL_A: select t1.c1, v.c1 
       from t1, (select t2.c1, t3.c2 
                 from t2, t3 
                 where t2.c1 = t3.c1) v 
       where t1.c2 = v.c2;
<==>
SQL_B: select t1.c1, t2.c1 
       from t1, t2, t3 
       where t2.c1 = t3.c1 and t1.c2 = t3.c2;

如果SQL_A不进行改写，则其连接顺序有以下几种：

t1, v(t2,t3)
t1, v(t3,t2)
v(t2,t3), t1
v(t3,t2), t1进行视图合并改写后，可选择的连接顺序有：
t1, t2, t3
t1, t3, t2
t2, t1, t3
t2, t3, t1
t3, t1, t2
t3, t2, t1可以看出，进行view merge后，连接顺序可选择空间增加，对于复杂查询，视图合并后，对路径的选择和可改写的空间均会增大，从而使得优化器可生成更优的计划。

子查询展开

子查询展开是指将where条件中子查询提升到父查询中，并作为连接条件与父查询并列进行展开。转换后子查询将不存在，外层父查询中会变成多表连接。好处是优化器在进行路径选择，连接方法和连接排序是都会考虑到子查询中的表，从而可以获得更优的执行计划, 一般涉及的子查询表达式有not in、in、not exist、exist、any、all；

改写条件当生成的连接语句能返回与原始语句相同的行
展开为半连接(SEMI JOIN / ANTI JOIN)如下示例中，t2.c2不具有唯一性，改为semi join，该语句改写后执行计划为：

create table t1 (c1 int, c2 int);
create table t2 (c1 int primary key, c2 int);
explain select * from t1 where t1.c1 in  (select t2.c2 from t2)\G;
*************************** 1. row ***************************
Query Plan: =======================================
|ID|OPERATOR      |NAME|EST. ROWS|COST|
---------------------------------------
|0 |HASH SEMI JOIN|    |495      |3931|
|1 | TABLE SCAN   |t1  |1000     |499 |
|2 | TABLE SCAN   |t2  |1000     |433 |
=======================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2]), filter(nil),
      equal_conds([t1.c1 = t2.c2]), other_conds(nil)
  1 - output([t1.c1], [t1.c2]), filter(nil),
      access([t1.c1], [t1.c2]), partitions(p0)
  2 - output([t2.c2]), filter(nil),
      access([t2.c2]), partitions(p0)

将上面子查询前面操作符改为not in后，可改写为anti join，具体计划如下：

explain select * from t1 where t1.c1 not in  (select t2.c2 from t2)\G;
*************************** 1. row ***************************
Query Plan: ================================================
|ID|OPERATOR             |NAME|EST. ROWS|COST  |
------------------------------------------------
|0 |NESTED-LOOP ANTI JOIN|    |0        |520245|
|1 | TABLE SCAN          |t1  |1000     |499   |
|2 | TABLE SCAN          |t2  |22       |517   |
================================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2]), filter(nil),
      conds(nil), nl_params_([t1.c1], [(T_OP_IS, t1.c1, NULL, 0)])
  1 - output([t1.c1], [t1.c2], [(T_OP_IS, t1.c1, NULL, 0)]), filter(nil),
      access([t1.c1], [t1.c2]), partitions(p0)
  2 - output([t2.c2]), filter([(T_OP_OR, ? = t2.c2, ?, (T_OP_IS, t2.c2, NULL, 0))]),
      access([t2.c2]), partitions(p0)

子查询展开为内连接将上面示例SQL_A中如果将t2.c2改为t2.c1，由于t2.c1 为主键，子查询输出具有唯一性，此时可以直接转换为内连接：

SQL_A: select * from t1 where t1.c1 in (select t2.c1 from t2);
<==>
SQL_B: select t1.* from t1, t2 where t.c1 = t2.c1;

以上SQL_A改写后计划如下：

explain select * from t1 where t1.c1 in (select t2.c1 from t2)\G;
*************************** 1. row ***************************
Query Plan: ====================================
|ID|OPERATOR   |NAME|EST. ROWS|COST|
------------------------------------
|0 |HASH JOIN  |    |1980     |3725|
|1 | TABLE SCAN|t2  |1000     |411 |
|2 | TABLE SCAN|t1  |1000     |499 |
====================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2]), filter(nil),
      equal_conds([t1.c1 = t2.c1]), other_conds(nil)
  1 - output([t2.c1]), filter(nil),
      access([t2.c1]), partitions(p0)
  2 - output([t1.c1], [t1.c2]), filter(nil),
      access([t1.c1], [t1.c2]), partitions(p0)

对于not in、in、not exist、exist、any、all都可以对应做以上类似的改写操作；

any/all使用MAX／MIN改写

对于any/all的子查询，如果子查询中没有group by子句、聚集函数以及having时，则以下这些表达式可以使用聚集函数MIN/MAX进行等价转换, 其中col_item为单独列且有非NULL属性：

val > ALL(SELECT col_item ...)  <==> val > ALL(SELECT MAX(col_item) ...);
val >= ALL(SELECT col_item ...) <==> val >= ALL(SELECT MAX(col_item) ...);
val < ALL(SELECT col_item ...)  <==> val < ALL(SELECT MIN(col_item) ...);
val <= ALL(SELECT col_item ...) <==> val <= ALL(SELECT MIN(col_item) ...);
val > ANY(SELECT col_item ...)  <==> val > ANY(SELECT MIN(col_item) ...);
val >= ANY(SELECT col_item ...) <==> val >= ANY(SELECT MIN(col_item) ...);
val < ANY(SELECT col_item ...)  <==> val < ANY(SELECT MAX(col_item) ...);
val <= ANY(SELECT col_item ...) <==> val <= ANY(SELECT MAX(col_item) ...);

将子查询更改为含有max/min的子查询后，再结合使用MAX/MIN的改写，可减少改写前对内表的多次扫描, 举例如下；

select c1 from t1 where c1 > any(select c1 from t2);
<==>
select c1 from t1 where c1 > any(select min(c1) from t2);

结合MAX/MIN的改写后, 可利用t2.c1的主键序将limit 1直接下压到table scan，将min值输出，执行计划为：

explain select c1 from t1 where c1 > any(select c1 from t2)\G;
*************************** 1. row ***************************
Query Plan: ===================================================
|ID|OPERATOR        |NAME          |EST. ROWS|COST|
---------------------------------------------------
|0 |SUBPLAN FILTER  |              |1        |73  |
|1 | TABLE SCAN     |t1            |1        |37  |
|2 | SCALAR GROUP BY|              |1        |37  |
|3 |  SUBPLAN SCAN  |subquery_table|1        |37  |
|4 |   TABLE SCAN   |t2            |1        |36  |
===================================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1]), filter([t1.c1 > ANY(subquery(1))]),
      exec_params_(nil), onetime_exprs_(nil), init_plan_idxs_([1])
  1 - output([t1.c1]), filter(nil),
      access([t1.c1]), partitions(p0)
  2 - output([T_FUN_MIN(subquery_table.c1)]), filter(nil),
      group(nil), agg_func([T_FUN_MIN(subquery_table.c1)])
  3 - output([subquery_table.c1]), filter(nil),
      access([subquery_table.c1])
  4 - output([t2.c1]), filter(nil),
      access([t2.c1]), partitions(p0),
      limit(1), offset(nil)

外连接消除

外连接操作可分为左外连接，右外连接和全外连接，连接过程中，外连接左右顺序不能变换，这使得优化器对连接顺序的选择受到限制。外连接消除是指将外连接转换成内连接，从而可以提供更多可选择的连接路径，供优化器考虑。

进行外连接消除，需要存在“空值拒绝条件”，即where条件中，存在当内表生成的值为null时，使得输出为false的条件。例如：

select t1.c1, t2.c2 from t1 left join t2 on t1.c2 = t2.c2

这是一个外连接，在其输出行中t2.c2可能为null。如果加上一个条件t2.c2 > 5，则通过该条件过滤后，t2.c1输出不可能为NULL，从而可以将外连接转换为内连接。

select t1.c1, t2.c2 from t1 left join t2 on t1.c2 = t2.c2 where t2.c2 > 5
<==>
select t1.c1, t2.c2 from t1 inner join t2 on t1.c2 = t2.c2 where t2.c2 > 5

简化条件改写

having条件消除

如果查询中没有聚集操操作及group by则having可以合并到where条件中，并将having条件删除，从而可以将having条件在where条件中同一管理优化，并进行进一步相关优化。

select * from t1, t2 where t1.c1 = t2.c1  having t1.c2 > 1
<==>
select * from t1, t2 where t1.c1 = t2.c1  and t1.c2 > 1

改写后计划如下, 可以看到t1.c2 > 1条件被下压到了TABLE SCAN层；

explain select * from t1, t2 where t1.c1 = t2.c1  having t1.c2 > 1\G;
*************************** 1. row ***************************
Query Plan: =========================================
|ID|OPERATOR        |NAME|EST. ROWS|COST|
-----------------------------------------
|0 |NESTED-LOOP JOIN|    |1        |59  |
|1 | TABLE SCAN     |t1  |1        |37  |
|2 | TABLE GET      |t2  |1        |36  |
=========================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2], [t2.c1], [t2.c2]), filter(nil),
      conds(nil), nl_params_([t1.c1])
  1 - output([t1.c1], [t1.c2]), filter([t1.c2 > 1]),
      access([t1.c1], [t1.c2]), partitions(p0)
  2 - output([t2.c1], [t2.c2]), filter(nil),
      access([t2.c1], [t2.c2]), partitions(p0)

等价关系推导

等价关系推导是指利用比较操作符的传递性，推倒出新的条件表达式, 从而减少需要处理的行数或者选择到更有效的索引。OceanBase可对等值连接进行推导，比如a ＝ b and a > 1 可以推导出 a ＝ b and a > 1 and b > 1，如果b上有索引，且b > 1在该索引选择率很低，则可以大大提升访问b列所在表的性能。

如下举例可以看出，条件t1.c1 = t2.c2 and t1.c1 > 2，等价推导后为t1.c1 = t2.c2 and t1.c1 > 2 and t2.c2 > 2, 从计划中可以看到t2.c2已下压到TABLE SCAN，且使用t2.c2对应的索引。

create table t1(c1 int primary key, c2 int);
create table t2(c1 int primary key, c2 int, c3 int, key idx_c2(c2));
explain extended_noaddr select t1.c1, t2.c2 
                        from t1, t2 
                        where t1.c1 = t2.c2 and t1.c1 > 2\G;
*************************** 1. row ***************************
Query Plan: ==========================================
|ID|OPERATOR   |NAME      |EST. ROWS|COST|
------------------------------------------
|0 |MERGE JOIN |          |5        |78  |
|1 | TABLE SCAN|t2(idx_c2)|5        |37  |
|2 | TABLE SCAN|t1        |3        |37  |
==========================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t2.c2]), filter(nil),
      equal_conds([t1.c1 = t2.c2]), other_conds(nil)
  1 - output([t2.c2]), filter(nil),
      access([t2.c2]), partitions(p0),
      is_index_back=false,
      range_key([t2.c2], [t2.c1]), range(2,MAX ; MAX,MAX),
      range_cond([t2.c2 > 2])
  2 - output([t1.c1]), filter(nil),
      access([t1.c1]), partitions(p0),
      is_index_back=false,
      range_key([t1.c1]), range(2 ; MAX),
      range_cond([t1.c1 > 2])

恒真/假消除

对于如下恒真恒假条件：

false and expr ＝恒false；
true or expr = 恒true；可以将这些恒真恒假条件消除，比如以下SQL， where 0 > 1 and c1 ＝ 3，由于0 > 1使得and恒假，所以该SQL不用执行，可直接返回，从而加快查询的执行。

explain extended_noaddr select * from t1 where 0 > 1 and c1 = 3\G;
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t1  |0        |38  |
===================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2]), filter([0], [t1.c1 = 3]), startup_filter([0]),
      access([t1.c1], [t1.c2]), partitions(p0),
      is_index_back=false, filter_before_indexback[false,false],
      range_key([t1.__pk_increment], [t1.__pk_cluster_id], [t1.__pk_partition_id]),
      range(MAX,MAX,MAX ; MIN,MIN,MIN)always false

非SPJ的改写

冗余排序消除

冗余排序消除是指删除order item中不需要的项，减少排序开销，以下三种情况可进行排序消除：

ORDER BY表达式列表中有重复列，可进行去重后排序；

Select * from t1 where c2 = 5 order by c1, c1, c2, c3 
<==>
Select * from t1 where c2 = 5 order by c1, c2, c3

ORDER BY列中存在where中有单值条件的列，则该列排序可删除

Select * from t1 where c2 = 5 order by c1, c2, c3
<==>
Select * from t1 where c2 = 5 order by c1, c3

如果本层查询有order by但是没有limit，且本层查询位于父查询的集合操作中，则order by可消除。因为对两个有序的集合做union操作，其结果是乱序的。但是如果order by中有limit，则语义是取最大/小的N个，此时不能消除order by，否则有语义错误。

(select c1,c2 from t1 order by c1) union (select c3,c4 from t2 order by c3)
<==>
(select c1,c2 from t1) union (select c3,c4 from t2)

limit下压

limit下压改写是指将limit下降到子查询中，OceanBase现在支持在不改变语义的情况下，将limit下压到视图(示例1)及union对应子查询(示例2)中。

select * from (select * from t1 order by c1) a limit 1; 
<==>
select * from (select * from t1 order by c1 limit 1) a limit 1;

(select c1,c2 from t1) union all (select c3,c4 from t2) limit 5 
<==>
(select c1,c2 from t1 limit 5) union all (select c3,c4 from t2 limit 5) limit 5

distinct消除

如果select item中只包含常量，则可以消除distinct, 并加上limit 1；

Select distinct 1,2 from t1 
<==> 
Select 1,2 from t1 limit 1
create table t1 (c1 int primary key, c2 int);
explain extended_noaddr Select distinct 1,2 from t1\G;
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t1  |1        |36  |
===================================
Outputs & filters:
-------------------------------------
  0 - output([1], [2]), filter(nil),
      access([t1.c1]), partitions(p0),
      limit(1), offset(nil),
      is_index_back=false,
      range_key([t1.c1]), range(MIN ; MAX)always true

如果select item中包含确保唯一性约束的列，则distinct 能够消除, 如下举例中 (c1, c2)为主键，可确保c1, c2, c3唯一性，从而distinct 可消除；

create table t2(c1 int, c2 int, c3 int, primary key(c1, c2));
select distinct c1, c2, c3 from t2
<==>
select c1, c2 c3 from t2;
 explain select distinct c1, c2, c3 from t2\G;
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t2  |1000     |455 |
===================================
Outputs & filters:
-------------------------------------
  0 - output([t2.c1], [t2.c2], [t2.c3]), filter(nil),
      access([t2.c1], [t2.c2], [t2.c3]), partitions(p0)

MIN/MAX改写

当MIN/MAX函数中参数为索引前缀列, 且不含group by时，可将该scalar aggregate转换为走索引扫描1行的情况，举例如下：

create table t1 (c1 int primary key, c2 int, c3 int, key idx_c2_c3(c2,c3));
select min(c2) from t1;
<==>
select min(c2) from (select c2 from t2 order by c2 limit 1) as t;
explain select min(c2) from t1\G;
*************************** 1. row ***************************
Query Plan: ==================================================
|ID|OPERATOR       |NAME          |EST. ROWS|COST|
--------------------------------------------------
|0 |SCALAR GROUP BY|              |1        |37  |
|1 | SUBPLAN SCAN  |subquery_table|1        |37  |
|2 |  TABLE SCAN   |t1(idx_c2_c3) |1        |36  |
==================================================
Outputs & filters:
-------------------------------------
  0 - output([T_FUN_MIN(subquery_table.c2)]), filter(nil),
      group(nil), agg_func([T_FUN_MIN(subquery_table.c2)])
  1 - output([subquery_table.c2]), filter(nil),
      access([subquery_table.c2])
  2 - output([t1.c2]), filter([(T_OP_IS_NOT, t1.c2, NULL, 0)]),
      access([t1.c2]), partitions(p0),
      limit(1), offset(nil)

如果select MIN/MAX的参数为常量，且包含group by则可已将MIN/MAX改为常量，从而减少MIN/MAX的计算开销。

select max(1) from t1 group by c1;
<==>
select 1 from t1 group by c1;
explain extended_noaddr select max(1) from t1 group by c1\G;
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t1  |1000     |411 |
===================================
Outputs & filters:
-------------------------------------
  0 - output([1]), filter(nil),
      access([t1.c1]), partitions(p0),
      is_index_back=false,
      range_key([t1.c1]), range(MIN ; MAX)always true

如果select MIN/MAX的参数为常量，且不含group by，可进行如下改写，从而走索引只需扫描1行。

select max(1) from t1;
<==> 
select max(t.a) from (select 1 as a from t1 limit 1) t;
explain extended_noaddr select max(1) from t1\G;
*************************** 1. row ***************************
Query Plan: ==================================================
|ID|OPERATOR       |NAME          |EST. ROWS|COST|
--------------------------------------------------
|0 |SCALAR GROUP BY|              |1        |37  |
|1 | SUBPLAN SCAN  |subquery_table|1        |37  |
|2 |  TABLE SCAN   |t1            |1        |36  |
==================================================
Outputs & filters:
-------------------------------------
  0 - output([T_FUN_MAX(subquery_table.subquery_col_alias)]), filter(nil),
      group(nil), agg_func([T_FUN_MAX(subquery_table.subquery_col_alias)])
  1 - output([subquery_table.subquery_col_alias]), filter(nil),
      access([subquery_table.subquery_col_alias])
  2 - output([1]), filter(nil),
      access([t1.c1]), partitions(p0),
      limit(1), offset(nil),
      is_index_back=false,
      range_key([t1.c1]), range(MIN ; MAX)always true