查询改写 - 基于规则的查询改写 - 《OceanBase v2.2.50 官方教程》

子查询相关改写
外连接消除
简化条件改写
非SPJ的改写

子查询相关改写

优化器对于子查询一般使用嵌套执行的方式，也就是父查询每生成一行数据后，都需要执行一次子查询，使用这种方式需要多次执行子查询，执行效率很低，对于子查询的优化，一般会使用改写先转换为连接操作，可大大提高执行效率，主要优点如下：

可避免子查询多次执行；
优化器可根据统计信息选择更优的连接顺序和连接方法；
子查询的连接条件、过滤条件改写为父查询的条件后，优化器可以进行进一步优化，比如条件下压等。

视图合并

视图合并是指将代表一个视图的子查询合并到包含该视图的查询中，视图合并后，有助于优化器增加连接顺序的选择、访问路径的选择以及进一步做其他改写操作，从而选择更优的执行计划。OceanBase 支持对 SPJ (select-project-join) 的视图进行视图合并。
如下示例为 SQL_A 改写为 SQL_B：

create table t1 (c1 int, c2 int);
create table t2 (c1 int primary key, c2 int);
create table t3 (c1 int primary key, c2 int);
SQL_A: select t1.c1, v.c1 
       from t1, (select t2.c1, t3.c2 
                 from t2, t3 
                 where t2.c1 = t3.c1) v 
       where t1.c2 = v.c2;
<==>
SQL_B: select t1.c1, t2.c1 
       from t1, t2, t3 
       where t2.c1 = t3.c1 and t1.c2 = t3.c2;

如果 SQL_A 不进行改写，则其连接顺序有以下几种：

t1, v(t2,t3)
t1, v(t3,t2)
v(t2,t3), t1
v(t3,t2), t1

进行视图合并改写后，可选择的连接顺序有：

t1, t2, t3
t1, t3, t2
t2, t1, t3
t2, t3, t1
t3, t1, t2
t3, t2, t1

可以看出，进行 view merge 后，连接顺序可选择空间增加，对于复杂查询，视图合并后，对路径的选择和可改写的空间均会增大，从而使得优化器可生成更优的计划。

子查询展开

子查询展开是指将 where 条件中子查询提升到父查询中，并作为连接条件与父查询并列进行展开。转换后子查询将不存在，外层父查询中会变成多表连接。好处是优化器在进行路径选择，连接方法和连接排序是都会考虑到子查询中的表，从而可以获得更优的执行计划, 一般涉及的子查询表达式有 not in、in、not exist、exist、any、all。

改写条件
当生成的连接语句能返回与原始语句相同的行。
展开为半连接(SEMI JOIN / ANTI JOIN)
如下例所示，t2.c2 不具有唯一性，改为 semi join，该语句改写后执行计划为：

create table t1 (c1 int, c2 int);
create table t2 (c1 int primary key, c2 int);
explain select * from t1 where t1.c1 in  (select t2.c2 from t2)\G;
*************************** 1. row ***************************
Query Plan: =======================================
|ID|OPERATOR      |NAME|EST. ROWS|COST|
---------------------------------------
|0 |HASH SEMI JOIN|    |495      |3931|
|1 | TABLE SCAN   |t1  |1000     |499 |
|2 | TABLE SCAN   |t2  |1000     |433 |
=======================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2]), filter(nil),
      equal_conds([t1.c1 = t2.c2]), other_conds(nil)
  1 - output([t1.c1], [t1.c2]), filter(nil),
      access([t1.c1], [t1.c2]), partitions(p0)
  2 - output([t2.c2]), filter(nil),
      access([t2.c2]), partitions(p0)

将查询前面操作符改为 not in 后，可改写为 anti join，具体计划如下例所示：

explain select * from t1 where t1.c1 not in  (select t2.c2 from t2)\G;
*************************** 1. row ***************************
Query Plan: ================================================
|ID|OPERATOR             |NAME|EST. ROWS|COST  |
------------------------------------------------
|0 |NESTED-LOOP ANTI JOIN|    |0        |520245|
|1 | TABLE SCAN          |t1  |1000     |499   |
|2 | TABLE SCAN          |t2  |22       |517   |
================================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2]), filter(nil),
      conds(nil), nl_params_([t1.c1], [(T_OP_IS, t1.c1, NULL, 0)])
  1 - output([t1.c1], [t1.c2], [(T_OP_IS, t1.c1, NULL, 0)]), filter(nil),
      access([t1.c1], [t1.c2]), partitions(p0)
  2 - output([t2.c2]), filter([(T_OP_OR, ? = t2.c2, ?, (T_OP_IS, t2.c2, NULL, 0))]),
      access([t2.c2]), partitions(p0)

子查询展开为内连接
上面示例的 SQL_A 中如果将 t2.c2 改为 t2.c1，由于 t2.c1 为主键，子查询输出具有唯一性，此时可以直接转换为内连接，如下例所示：

SQL_A: select * from t1 where t1.c1 in (select t2.c1 from t2);
<==>
SQL_B: select t1.* from t1, t2 where t.c1 = t2.c1;

SQL_A 改写后计划如下例所示：

explain select * from t1 where t1.c1 in (select t2.c1 from t2)\G;
*************************** 1. row ***************************
Query Plan: ====================================
|ID|OPERATOR   |NAME|EST. ROWS|COST|
------------------------------------
|0 |HASH JOIN  |    |1980     |3725|
|1 | TABLE SCAN|t2  |1000     |411 |
|2 | TABLE SCAN|t1  |1000     |499 |
====================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2]), filter(nil),
      equal_conds([t1.c1 = t2.c1]), other_conds(nil)
  1 - output([t2.c1]), filter(nil),
      access([t2.c1]), partitions(p0)
  2 - output([t1.c1], [t1.c2]), filter(nil),
      access([t1.c1], [t1.c2]), partitions(p0)

对于not in、in、not exist、exist、any、all 都可以对应做类似的改写操作。

any/all 使用 MAX/MIN 改写

对于 any/all 的子查询，如果子查询中没有 group by 子句、聚集函数以及 having 时，则以下表达式可以使用聚集函数 MIN/MAX 进行等价转换, 其中 col_item 为单独列且有非 NULL 属性：

val > ALL(SELECT col_item ...)  <==> val > ALL(SELECT MAX(col_item) ...);
val >= ALL(SELECT col_item ...) <==> val >= ALL(SELECT MAX(col_item) ...);
val < ALL(SELECT col_item ...)  <==> val < ALL(SELECT MIN(col_item) ...);
val <= ALL(SELECT col_item ...) <==> val <= ALL(SELECT MIN(col_item) ...);
val > ANY(SELECT col_item ...)  <==> val > ANY(SELECT MIN(col_item) ...);
val >= ANY(SELECT col_item ...) <==> val >= ANY(SELECT MIN(col_item) ...);
val < ANY(SELECT col_item ...)  <==> val < ANY(SELECT MAX(col_item) ...);
val <= ANY(SELECT col_item ...) <==> val <= ANY(SELECT MAX(col_item) ...);

将子查询更改为含有 max/min 的子查询后，再结合使用 MAX/MIN 的改写，可减少改写前对内表的多次扫描, 如下例所示：

select c1 from t1 where c1 > any(select c1 from t2);
<==>
select c1 from t1 where c1 > any(select min(c1) from t2);

结合 MAX/MIN 的改写后, 可利用 t2.c1 的主键序将 limit 1 直接下压到 table scan，将 min 值输出，执行计划为：

explain select c1 from t1 where c1 > any(select c1 from t2)\G;
*************************** 1. row ***************************
Query Plan: ===================================================
|ID|OPERATOR        |NAME          |EST. ROWS|COST|
---------------------------------------------------
|0 |SUBPLAN FILTER  |              |1        |73  |
|1 | TABLE SCAN     |t1            |1        |37  |
|2 | SCALAR GROUP BY|              |1        |37  |
|3 |  SUBPLAN SCAN  |subquery_table|1        |37  |
|4 |   TABLE SCAN   |t2            |1        |36  |
===================================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1]), filter([t1.c1 > ANY(subquery(1))]),
      exec_params_(nil), onetime_exprs_(nil), init_plan_idxs_([1])
  1 - output([t1.c1]), filter(nil),
      access([t1.c1]), partitions(p0)
  2 - output([T_FUN_MIN(subquery_table.c1)]), filter(nil),
      group(nil), agg_func([T_FUN_MIN(subquery_table.c1)])
  3 - output([subquery_table.c1]), filter(nil),
      access([subquery_table.c1])
  4 - output([t2.c1]), filter(nil),
      access([t2.c1]), partitions(p0),
      limit(1), offset(nil)

外连接消除

外连接操作可分为左外连接，右外连接和全外连接，连接过程中，外连接左右顺序不能变换，这使得优化器对连接顺序的选择受到限制。外连接消除是指将外连接转换成内连接，从而可以提供更多可选择的连接路径，供优化器考虑。

进行外连接消除，需要存在“空值拒绝条件”，即 where 条件中，存在当内表生成的值为 null 时，使得输出为 false 的条件。

如下例所示：

select t1.c1, t2.c2 from t1 left join t2 on t1.c2 = t2.c2

这是一个外连接，在其输出行中 t2.c2 可能为 null。如果加上一个条件 t2.c2 > 5，则通过该条件过滤后，t2.c1 输出不可能为 NULL，从而可以将外连接转换为内连接。

select t1.c1, t2.c2 from t1 left join t2 on t1.c2 = t2.c2 where t2.c2 > 5
<==>
select t1.c1, t2.c2 from t1 inner join t2 on t1.c2 = t2.c2 where t2.c2 > 5

简化条件改写

having条件消除

如果查询中没有聚集操操作及 group by 则 having 可以合并到 where 条件中，并将 having 条件删除，从而可以将 having 条件在 where 条件中同一管理优化，并进行进一步相关优化。

select * from t1, t2 where t1.c1 = t2.c1  having t1.c2 > 1
<==>
select * from t1, t2 where t1.c1 = t2.c1  and t1.c2 > 1

改写后计划如下例所示, t1.c2 > 1 条件被下压到了 TABLE SCAN 层。

explain select * from t1, t2 where t1.c1 = t2.c1  having t1.c2 > 1\G;
*************************** 1. row ***************************
Query Plan: =========================================
|ID|OPERATOR        |NAME|EST. ROWS|COST|
-----------------------------------------
|0 |NESTED-LOOP JOIN|    |1        |59  |
|1 | TABLE SCAN     |t1  |1        |37  |
|2 | TABLE GET      |t2  |1        |36  |
=========================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2], [t2.c1], [t2.c2]), filter(nil),
      conds(nil), nl_params_([t1.c1])
  1 - output([t1.c1], [t1.c2]), filter([t1.c2 > 1]),
      access([t1.c1], [t1.c2]), partitions(p0)
  2 - output([t2.c1], [t2.c2]), filter(nil),
      access([t2.c1], [t2.c2]), partitions(p0)

等价关系推导

等价关系推导是指利用比较操作符的传递性，推倒出新的条件表达式, 从而减少需要处理的行数或者选择到更有效的索引。OceanBase 可对等值连接进行推导，比如 a ＝ b and a > 1 可以推导出 a ＝ b and a > 1 and b > 1，如果 b 上有索引，且 b > 1 在该索引选择率很低，则可以大大提升访问 b 列所在表的性能。

如下例所示，条件 t1.c1 = t2.c2 and t1.c1 > 2，等价推导后为 t1.c1 = t2.c2 and t1.c1 > 2 and t2.c2 > 2, 从计划中可以看到 t2.c2 已下压到 TABLE SCAN，且使用 t2.c2 对应的索引。

create table t1(c1 int primary key, c2 int);
create table t2(c1 int primary key, c2 int, c3 int, key idx_c2(c2));
explain extended_noaddr select t1.c1, t2.c2 
                        from t1, t2 
                        where t1.c1 = t2.c2 and t1.c1 > 2\G;
*************************** 1. row ***************************
Query Plan: ==========================================
|ID|OPERATOR   |NAME      |EST. ROWS|COST|
------------------------------------------
|0 |MERGE JOIN |          |5        |78  |
|1 | TABLE SCAN|t2(idx_c2)|5        |37  |
|2 | TABLE SCAN|t1        |3        |37  |
==========================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t2.c2]), filter(nil),
      equal_conds([t1.c1 = t2.c2]), other_conds(nil)
  1 - output([t2.c2]), filter(nil),
      access([t2.c2]), partitions(p0),
      is_index_back=false,
      range_key([t2.c2], [t2.c1]), range(2,MAX ; MAX,MAX),
      range_cond([t2.c2 > 2])
  2 - output([t1.c1]), filter(nil),
      access([t1.c1]), partitions(p0),
      is_index_back=false,
      range_key([t1.c1]), range(2 ; MAX),
      range_cond([t1.c1 > 2])

恒真/假消除

对于如下恒真恒假条件：

false and expr ＝恒false；
true or expr = 恒true；

可以将这些恒真恒假条件消除，比如以下 SQL， where 0 > 1 and c1 ＝ 3，由于0 > 1使得 and 恒假，所以该 SQL 不用执行，可直接返回，从而加快查询的执行。

explain extended_noaddr select * from t1 where 0 > 1 and c1 = 3\G;
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t1  |0        |38  |
===================================
Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2]), filter([0], [t1.c1 = 3]), startup_filter([0]),
      access([t1.c1], [t1.c2]), partitions(p0),
      is_index_back=false, filter_before_indexback[false,false],
      range_key([t1.__pk_increment], [t1.__pk_cluster_id], [t1.__pk_partition_id]),
      range(MAX,MAX,MAX ; MIN,MIN,MIN)always false

非SPJ的改写

冗余排序消除

冗余排序消除是指删除 order item 中不需要的项，减少排序开销，以下三种情况可进行排序消除：

ORDER BY 表达式列表中有重复列，可进行去重后排序；

Select * from t1 where c2 = 5 order by c1, c1, c2, c3 
<==>
Select * from t1 where c2 = 5 order by c1, c2, c3

ORDER BY 列中存在 where 中有单值条件的列，则该列排序可删除；

Select * from t1 where c2 = 5 order by c1, c2, c3
<==>
Select * from t1 where c2 = 5 order by c1, c3

如果本层查询有 order by 但是没有 limit，且本层查询位于父查询的集合操作中，则 order by 可消除。因为对两个有序的集合做 union 操作，其结果是乱序的。但是如果 order by 中有 limit，则语义是取最大/小的 N 个，此时不能消除order by，否则有语义错误。

(select c1,c2 from t1 order by c1) union (select c3,c4 from t2 order by c3)
<==>
(select c1,c2 from t1) union (select c3,c4 from t2)

limit 下压

limit 下压改写是指将 limit 下降到子查询中，OceanBase 现在支持在不改变语义的情况下，将 limit 下压到视图(示例1)及union 对应子查询(示例2)中。

select * from (select * from t1 order by c1) a limit 1; 
<==>
select * from (select * from t1 order by c1 limit 1) a limit 1;

(select c1,c2 from t1) union all (select c3,c4 from t2) limit 5 
<==>
(select c1,c2 from t1 limit 5) union all(select c3,c4 from t2 limit 5) limit 5

distinct消除

如果 select item 中只包含常量，则可以消除 distinct, 并加上 limit 1。

Select distinct 1,2 from t1 
<==> 
Select 1,2 from t1 limit 1
create table t1 (c1 int primary key, c2 int);
explain extended_noaddr Select distinct 1,2 from t1\G;
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t1  |1        |36  |
===================================
Outputs & filters:
-------------------------------------
  0 - output([1], [2]), filter(nil),
      access([t1.c1]), partitions(p0),
      limit(1), offset(nil),
      is_index_back=false,
      range_key([t1.c1]), range(MIN ; MAX)always true

如果 select item 中包含确保唯一性约束的列，则 distinct 能够消除, 如下举例中 (c1, c2)为主键，可确保 c1, c2, c3 唯一性，从而 distinct 可消除。

create table t2(c1 int, c2 int, c3 int, primary key(c1, c2));
select distinct c1, c2, c3 from t2
<==>
select c1, c2 c3 from t2;
 explain select distinct c1, c2, c3 from t2\G;
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t2  |1000     |455 |
===================================
Outputs & filters:
-------------------------------------
  0 - output([t2.c1], [t2.c2], [t2.c3]), filter(nil),
      access([t2.c1], [t2.c2], [t2.c3]), partitions(p0)

MIN/MAX改写

当 MIN/MAX 函数中参数为索引前缀列, 且不含 group by 时，可将该 scalar aggregate 转换为走索引扫描1行的情况，如下例所示：

create table t1 (c1 int primary key, c2 int, c3 int, key idx_c2_c3(c2,c3));
select min(c2) from t1;
<==>
select min(c2) from (select c2 from t2 order by c2 limit 1) as t;
explain select min(c2) from t1\G;
*************************** 1. row ***************************
Query Plan: ==================================================
|ID|OPERATOR       |NAME          |EST. ROWS|COST|
--------------------------------------------------
|0 |SCALAR GROUP BY|              |1        |37  |
|1 | SUBPLAN SCAN  |subquery_table|1        |37  |
|2 |  TABLE SCAN   |t1(idx_c2_c3) |1        |36  |
==================================================
Outputs & filters:
-------------------------------------
  0 - output([T_FUN_MIN(subquery_table.c2)]), filter(nil),
      group(nil), agg_func([T_FUN_MIN(subquery_table.c2)])
  1 - output([subquery_table.c2]), filter(nil),
      access([subquery_table.c2])
  2 - output([t1.c2]), filter([(T_OP_IS_NOT, t1.c2, NULL, 0)]),
      access([t1.c2]), partitions(p0),
      limit(1), offset(nil)

如果 select MIN/MAX 的参数为常量，且包含 group by 则可已将 MIN/MAX 改为常量，从而减少 MIN/MAX 的计算开销。

select max(1) from t1 group by c1;
<==>
select 1 from t1 group by c1;
explain extended_noaddr select max(1) from t1 group by c1\G;
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t1  |1000     |411 |
===================================
Outputs & filters:
-------------------------------------
  0 - output([1]), filter(nil),
      access([t1.c1]), partitions(p0),
      is_index_back=false,
      range_key([t1.c1]), range(MIN ; MAX)always true

如果 select MIN/MAX 的参数为常量，且不含 group by，可按如下例所示进行改写，从而走索引只需扫描1行。

select max(1) from t1;
<==> 
select max(t.a) from (select 1 as a from t1 limit 1) t;
explain extended_noaddr select max(1) from t1\G;
*************************** 1. row ***************************
Query Plan: ==================================================
|ID|OPERATOR       |NAME          |EST. ROWS|COST|
--------------------------------------------------
|0 |SCALAR GROUP BY|              |1        |37  |
|1 | SUBPLAN SCAN  |subquery_table|1        |37  |
|2 |  TABLE SCAN   |t1            |1        |36  |
==================================================
Outputs & filters:
-------------------------------------
  0 - output([T_FUN_MAX(subquery_table.subquery_col_alias)]), filter(nil),
      group(nil), agg_func([T_FUN_MAX(subquery_table.subquery_col_alias)])
  1 - output([subquery_table.subquery_col_alias]), filter(nil),
      access([subquery_table.subquery_col_alias])
  2 - output([1]), filter(nil),
      access([t1.c1]), partitions(p0),
      limit(1), offset(nil),
      is_index_back=false,
      range_key([t1.c1]), range(MIN ; MAX)always true