问题定义

Join graph

为了应用经典的图算法,我们需要通过join构建出一张图,该图的定义为

  • join语句中的每一个table都可以作为一个节点
  • join语句中的每一个predicate都可以作为一条边

Join tree

在经典的volcano模型中,每个query 会转成一颗树去执行,而join 部分也会转为其中的子树。不同的join tree对应不同的执行顺序,经典的join tree可以分为三类:left-deep、zig-zag 和 bushy

问题叙述

输入:一个join graph $G = (V, E)$

输出:一个join tree $T$,使得$C(T)$最小

DP算法(bottom-up)

Join Ordering的动态规划算法可以避免无谓的计算,比如:

更形式化的描述是:

如果一个join tree T是最优的,那么T的每个子树也是最优的

根据上面的理论,我们可以确定DP算法的大致思路:

  • 将大图划分为子图,首先求得子图的最优join tree
  • 通过组合子图的join tree来进一步求得大图的join tree

比如一个join graph其可能有四个表${R_1, R_2, R_3, R_4}$,动态规划算法流程如下

  • 我们先考虑集合大小为1的解
  • 当求解集合大小为$k$ 的解时,我们会将其划分为一对子集来求解

pic

该算法需要注意

  1. 我们在将子集对的join tree合并成大的时,需要判断两个子集是否可以连接
  2. 不同的join tree对应不同的划分方式
    1. 如果是left-deep tree,我们只需要结合大小为1大小为k-1的子树
    2. 如果是bushy tree,如果需要考虑所有的组合情况

Dpccp:基于连通子图的枚举

上述算法考虑了所有可能的组合方案。然而,当join graph不是clique的时候,这些方案有很多不合法(两个子图不连通),为了避免这种不合法的方案,Dpccp被提出。 Dpccp的核心策略就是我们只会枚举一对连通的子图。更形式化的说法是

对于一个图$G=(V, E); V是点的集合,E是边的集合$,假设我们已经解出了一个子图$S_1$,我们需要枚举所有可能的子图$S_2$,$S_2$需要满足以下条件:

  1. $S_1$和 $S_2$至少存在一条边相连:$\exist u \in S_1, \exist v \in S_2, (u, v) \in E$
  2. $S_1$和$S_2$不相交
  3. $S_1$和 $S_2$是连通图(连通图也就是其内部任意两个点都可以找到一条path连接)

而根据动态规划算法的要求,$S_2$还需要满足:

  1. $S_2$存在一个最优解

在Dpccp里$S_1$被称之为csg(connect-subgraph),$S_2$被称之为cmp(complement),$(S_1, S_2)$也被称之为csg-cmp pair。由此我们可以得到基本的算法:pic

枚举补图

如果给定一个连通子图$S_1$,DPccp如何枚举它的子图?在算法中,我们是通过邻居节点来枚举补图的,步骤如下:

  • 找到该csg的邻居节点:即该节点和$S_1$中的节点存在一条边相连
  • 以邻居节点为起点,枚举出候选节点的所有的连通子图

pic

枚举顺序

在枚举csg-cmp-pair的时候,我们需要确定一个枚举顺序,使得

  • 所有的子图和补图都枚举一遍
  • 在枚举一个csg-cmp-pair $<S_1,S_2>$的时候,保证我们已经有$S_1, S_2$的最优解

为了满足上述要求Dpccp会给所有节点排序编号,即使得所有节点有全序关系。 然后按照如下步骤枚举子图$S_1$:

  1. 按照从大到小的顺序确定一个初始节点$S ={u}$,其编号为$#u$
  2. 找到这个集合$S$的邻居集 $N$,将$N$的所有的子集加入$S$以扩张$S$
    1. $N$中所有节点需要满足其编号 $\le #u$
  3. 递归的进行第二步

整体的算法步骤如下:

  1. 找到 csg
  2. 枚举csg的所有补图
  3. 用邻居扩张csg,然后对新的csg应用步骤2

问题

  • 在join中,谓词可能会引用多个表:$T_1.x + T_2.y = T_3.x + T_4.y$(HyperGraph)

Dphyp: 基于hyper graph的枚举

为了解决谓词多多引用问题,Dphyp定义了HyperGraph。HyperGraph与上述Join Graph不同的是

在HyperGraph中,每个节点是表的集合,其可能包括一个或多个表,我们将其称之为hyper node

HyperGraph相比较普通的图而言,我们需要注意

  • 在扩张子图的时候,每个hyper node要么全加入,要么不加入
  • Hyper node可能互相包含或者重叠

在Dpccp的时候,我们是通过找到Neighborhood节点,然后逐步加入Neighborhood节点来扩张子图的。所以我们的问题就是:

  1. 对于给定的子图 $S_1$,如何正确的计算它的相邻集合
  2. 给定相邻集合,我们如何得到合法的补图

编号

对于HyperGraph中的所有表(不是节点),我们同样将他们编号得到一个全序关系。 对于每个hyper node,其编号为包含表的最小编号,记作: pic

相邻集合

首先我们给出对于子图$S$相邻集合的定义,其中$X$是排除掉的节点(比如编号小于起点的表) pic 为了简化实现,我们可以只记录每个hyper node的编号,也就是 pic

枚举补图

这样我们便可通过将这些编号作为起点来进一步的枚举出所有补图 伪代码如下,该代码的功能是给定$S_1$,枚举其所有补图$S_2$。(其中$B_{min}(S_1) = {t | #t \geq min(S_1)}$):

  1. 计算邻居节点
  2. 列举所有可以成为cmp的邻居节点
  3. 以邻居节点为初始集,扩张新的补图 image.png 其中EnumerateCmpRec函数的功能是扩张补图$S_2$:

  4. 找到出事集的邻居节点集合

  5. 将集合加入$S_2$扩张 $S_2$
    1. 列举出所有合法的$S_2$:
    2. 进一步扩张$S_2$

image.png

去重

因为hyper node可能会存在包含或者重叠的关系,所以上述的枚举可能会做很多的冗余工作(比如对于$u_1 ={R_1, R_2, R_3}, u_2={R_2, R_3}$就会重复枚举${R_2, R_3}$)。

所以为了提高效率,我们需要尽量避免这种重复过程。已知在上述过程中,我们会让相邻表作为起点,然后逐步扩张它来生成子图。所以如果有多个hyper node重叠/包含,我们只需要保存一个表即可。

pic

代码实现

数据结构

在Hyper graph中,没个表用其编号(实际就是索引)表示

每个边包含其链接的hyper node,每个hyper node使用位图表示(如果这条边的节点只包含一个表,被称为simple edge)

  1. struct Hyperedge {
  2. // The endpoints (hypernodes) of this hyperedge. See the comment about
  3. // duplicated edges in Node.
  4. //
  5. // left and right may not overlap, and both must have at least one bit set.
  6. NodeMap left;
  7. NodeMap right;
  8. };

每个节点Node包含所有引用自己的边:

  1. struct Node {
  2. std::vector<unsigned> complex_edges, simple_edges;
  3. NodeMap simple_neighborhood = 0;
  4. };

HyperGraph就是持有所有边和节点的结构体:

  1. struct Hypergraph {
  2. std::vector<Node> nodes; // Maximum 8*sizeof(NodeMap) elements.
  3. std::vector<Hyperedge> edges;
  4. };

流程入口

入口代码:

  1. AccessPath *FindBestQueryPlan(THD *thd, Query_block *query_block, string *trace) {
  2. MakeJoinHypergraph();
  3. if (EnumerateAllConnectedPartitions()) {
  4. SimplifyQueryGraph();
  5. EnumerateAllConnectedPartitions());
  6. }
  7. .....construct access path
  8. }

构建HyperGraph

自底向上的构建节点和边

  1. bool MakeJoinHypergraph(THD *thd, string *trace, JoinHypergraph *graph) {
  2. // 1. 构建expression tree
  3. RelationalExpression *root =
  4. MakeRelationalExpressionFromJoinList(thd, query_block->top_join_list);
  5. // 2. 谓词下推:对于无法推到 join/table 中到where 条件,会构建出inner join
  6. PushDownJoinConditions(thd, root, table_num_to_companion_set, &table_filters,
  7. &cycle_inducing_edges, trace);
  8. // 3. 构建HyperGraph
  9. MakeJoinGraphFromRelationalExpression(thd, root, trace, graph);
  10. }
  1. void MakeJoinGraphFromRelationalExpression(THD *thd, RelationalExpression *expr,
  2. string *trace,
  3. JoinHypergraph *graph) {
  4. if (expr->type == RelationalExpression::TABLE) {
  5. graph->nodes.push_back(JoinHypergraph::Node{expr});
  6. return;
  7. }
  8. MakeJoinGraphFromRelationalExpression(thd, expr->left, trace, graph);
  9. MakeJoinGraphFromRelationalExpression(thd, expr->right, trace, graph);
  10. const Hyperedge edge = FindHyperedgeAndJoinConflicts(thd, used_nodes, expr);
  11. graph->graph.AddEdge(edge.left, edge.right);
  12. }

构建边(基于CD-C冲突检测算法)

  1. Hyperedge FindHyperedgeAndJoinConflicts(THD *thd, NodeMap used_nodes,
  2. Mem_root_array<ConflictRule> conflict_rules(thd->mem_root);
  3. ForEachJoinOperator(
  4. expr->left, [expr, &conflict_rules](RelationalExpression *child) {
  5. if (!OperatorsAreAssociative(*child, *expr)) {
  6. const NodeMap left = IntersectIfNotDegenerate(
  7. child->conditions_used_tables, child->left->nodes_in_subtree);
  8. conflict_rules.push_back(
  9. ConflictRule{child->right->nodes_in_subtree, left});
  10. }
  11. if (!OperatorsAreLeftAsscom(*child, *expr)) {
  12. const NodeMap right = IntersectIfNotDegenerate(
  13. child->conditions_used_tables, child->right->nodes_in_subtree);
  14. conflict_rules.push_back(
  15. ConflictRule{child->left->nodes_in_subtree, right});
  16. }
  17. });
  18. ForEachJoinOperator(
  19. expr->right, [expr, &conflict_rules](RelationalExpression *child) {
  20. if (!OperatorsAreAssociative(*expr, *child)) {
  21. const NodeMap right = IntersectIfNotDegenerate(
  22. child->conditions_used_tables, child->right->nodes_in_subtree);
  23. conflict_rules.push_back(
  24. ConflictRule{child->left->nodes_in_subtree, right});
  25. }
  26. if (!OperatorsAreRightAsscom(*expr, *child)) {
  27. const NodeMap left = IntersectIfNotDegenerate(
  28. child->conditions_used_tables, child->left->nodes_in_subtree);
  29. conflict_rules.push_back(
  30. ConflictRule{child->right->nodes_in_subtree, left});
  31. }
  32. });
  33. NodeMap total_eligibility_set =
  34. AbsorbConflictRulesIntoTES(used_nodes, &conflict_rules);
  35. if (!Overlaps(total_eligibility_set, expr->left->nodes_in_subtree)) {
  36. total_eligibility_set |= expr->left->nodes_in_subtree;
  37. total_eligibility_set =
  38. AbsorbConflictRulesIntoTES(total_eligibility_set, &conflict_rules);
  39. }
  40. if (!Overlaps(total_eligibility_set, expr->right->nodes_in_subtree)) {
  41. total_eligibility_set |= expr->right->nodes_in_subtree;
  42. total_eligibility_set =
  43. AbsorbConflictRulesIntoTES(total_eligibility_set, &conflict_rules);
  44. }
  45. expr->conflict_rules = std::move(conflict_rules);
  46. const NodeMap left = total_eligibility_set & expr->left->nodes_in_subtree;
  47. const NodeMap right = total_eligibility_set & expr->right->nodes_in_subtree;
  48. return {left, right};
  49. }

DPhyp 算法

入口函数:

  1. bool EnumerateAllConnectedPartitions(const Hypergraph &g, Receiver *receiver) {
  2. for (int seed_idx = g.nodes.size() - 1; seed_idx >= 0; --seed_idx) {
  3. NodeMap seed = TableBitmap(seed_idx);
  4. NodeMap forbidden = TablesBetween(0, seed_idx);
  5. NodeMap full_neighborhood = 0;
  6. NeighborhoodCache cache(0);
  7. // 1. 找到邻居节点
  8. NodeMap neighborhood =
  9. FindNeighborhood(g, seed, forbidden, seed, &cache, &full_neighborhood);
  10. // 2.枚举所有到补图
  11. EnumerateComplementsTo(g, seed_idx, seed, full_neighborhood,
  12. neighborhood, receiver));
  13. // 3. 扩张子图,然后再继续枚举补图
  14. ExpandSubgraph(g, seed_idx, seed, full_neighborhood, neighborhood,
  15. forbidden | seed, receiver));
  16. }
  17. return false;
  18. }

计算邻居节点

和上述伪代码完全一致,不过这里引入了一个新的数据结构full_neighborhood,该数据结构用于来判定后面的补图和子图是否连通。

  1. inline NodeMap FindNeighborhood(const Hypergraph &g, NodeMap subgraph,
  2. NodeMap forbidden, NodeMap just_grown_by,
  3. NeighborhoodCache *cache,
  4. NodeMap *full_neighborhood_arg) {
  5. NodeMap full_neighborhood =
  6. *full_neighborhood_arg; // Keep us out of aliasing trouble.
  7. NodeMap neighborhood = 0;
  8. NodeMap to_search =
  9. cache->InitSearch(just_grown_by, &neighborhood, &full_neighborhood);
  10. for (size_t node_idx : BitsSetIn(to_search)) {
  11. // Simple edges.
  12. // NOTE: This node's simple neighborhood will be added lazily to
  13. // full_neighborhood below. Forbidden nodes will also be removed below.
  14. neighborhood |= g.nodes[node_idx].simple_neighborhood;
  15. // Now go through the complex edges and see which ones point out of the
  16. // subgraph.
  17. for (size_t edge_idx : g.nodes[node_idx].complex_edges) {
  18. const Hyperedge e = g.edges[edge_idx];
  19. if (IsSubset(e.left, subgraph) &&
  20. !Overlaps(e.right, subgraph | forbidden)) {
  21. // e.right is an interesting hypernode (part of E↓'(S,X)).
  22. full_neighborhood |= e.right;
  23. if (!Overlaps(e.right, neighborhood)) {
  24. neighborhood |= IsolateLowestBit(e.right);
  25. }
  26. }
  27. }
  28. }
  29. neighborhood &= ~(subgraph | forbidden);
  30. full_neighborhood |= neighborhood;
  31. full_neighborhood_arg = full_neighborhood;
  32. return neighborhood;
  33. }

枚举补图

  1. template <class Receiver>
  2. [[nodiscard]] bool EnumerateComplementsTo(
  3. const Hypergraph &g, size_t lowest_node_idx, NodeMap subgraph,
  4. NodeMap full_neighborhood, NodeMap neighborhood, Receiver *receiver) {
  5. NodeMap forbidden = TablesBetween(0, lowest_node_idx);
  6. for (size_t seed_idx : BitsSetInDescending(neighborhood)) {
  7. // 1. 首先考虑simple node(也就是指包括一个表的hyper node)
  8. // 1.1. simple node <-> simple node
  9. NodeMap seed = TableBitmap(seed_idx);
  10. if (Overlaps(g.nodes[seed_idx].simple_neighborhood, subgraph)) {
  11. // 1.1.1 找到对应的那条边
  12. for (size_t edge_idx : g.nodes[seed_idx].simple_edges) {
  13. const Hyperedge e = g.edges[edge_idx];
  14. if (Overlaps(e.right, subgraph)) {
  15. // 1.1.2 将csg(subgraph)-cmp(seed)-pair加入dpTable中
  16. receiver->FoundSubgraphPair(subgraph, seed, edge_idx / 2));
  17. }
  18. }
  19. }
  20. // 1.2 simple node <-> complex node
  21. for (size_t edge_idx : g.nodes[seed_idx].complex_edges) {
  22. const Hyperedge e = g.edges[edge_idx];
  23. if (e.left == seed && IsSubset(e.right, subgraph)) {
  24. receiver->FoundSubgraphPair(subgraph, seed, edge_idx / 2));
  25. }
  26. }
  27. // 2. 继续扩张补图来进一步完成枚举(感觉这里不应该将TablesBetween(0, seed_idx)加入forbidden)
  28. NodeMap new_forbidden =
  29. forbidden | subgraph | (neighborhood & TablesBetween(0, seed_idx));
  30. NodeMap new_full_neighborhood = 0; // Unused; see comment on TryConnecting.
  31. NodeMap new_neighborhood = FindNeighborhood(g, seed, new_forbidden, seed,
  32. &cache, &new_full_neighborhood);
  33. ExpandComplement(g, lowest_node_idx, subgraph, full_neighborhood, seed,
  34. new_neighborhood, new_forbidden, receiver));
  35. }
  36. return false;
  37. }

如果判断子图和补图链接

  1. template <class Receiver>
  2. [[nodiscard]] bool TryConnecting(const Hypergraph &g, NodeMap subgraph,
  3. NodeMap subgraph_full_neighborhood,
  4. NodeMap complement, Receiver *receiver) {
  5. for (size_t node_idx : BitsSetIn(complement & subgraph_full_neighborhood)) {
  6. // 1. 遍历所有既在补图又在full neighborhod中的节点
  7. // 2. 查看它们是否存在一条边,使得这条边的一个节点完整的包含在子图内
  8. // Simple edges.
  9. if (Overlaps(g.nodes[node_idx].simple_neighborhood, subgraph)) {
  10. for (size_t edge_idx : g.nodes[node_idx].simple_edges) {
  11. const Hyperedge e = g.edges[edge_idx];
  12. if (Overlaps(e.right, subgraph) && Overlaps(e.left, complement)) {
  13. receiver->FoundSubgraphPair(subgraph, complement, edge_idx / 2));
  14. }
  15. }
  16. }
  17. // Complex edges.
  18. NodeMap node = TableBitmap(node_idx);
  19. for (size_t edge_idx : g.nodes[node_idx].complex_edges) {
  20. const Hyperedge e = g.edges[edge_idx];
  21. // NOTE: We call IsolateLowestBit() so that we only see the edge once.
  22. if (IsolateLowestBit(e.left) == node && IsSubset(e.left, complement) &&
  23. IsSubset(e.right, subgraph)) {
  24. receiver->FoundSubgraphPair(subgraph, complement, edge_idx / 2)
  25. }
  26. }
  27. }
  28. return false;
  29. }

扩张子图

  1. template <class Receiver>
  2. [[nodiscard]] bool ExpandSubgraph(const Hypergraph &g, size_t lowest_node_idx,
  3. NodeMap subgraph, NodeMap full_neighborhood,
  4. NodeMap neighborhood, NodeMap forbidden,
  5. Receiver *receiver) {
  6. // 1. 针对那些已经求得最优解的子图,我们可以去枚举它们的补图
  7. for (NodeMap grow_by : NonzeroSubsetsOf(neighborhood)) {
  8. NodeMap grown_subgraph = subgraph | grow_by;
  9. if (receiver->HasSeen(grown_subgraph)) {
  10. NodeMap new_full_neighborhood = full_neighborhood;
  11. NodeMap new_neighborhood =
  12. FindNeighborhood(g, subgraph | grow_by, forbidden, grow_by, &cache,
  13. &new_full_neighborhood);
  14. new_neighborhood |= forbidden & ~TablesBetween(0, lowest_node_idx);
  15. new_neighborhood |= neighborhood;
  16. EnumerateComplementsTo(g, lowest_node_idx, grown_subgraph,
  17. new_full_neighborhood, new_neighborhood,
  18. receiver))
  19. }
  20. }
  21. }
  22. // 2. 进一步去扩张csg
  23. for (NodeMap grow_by : NonzeroSubsetsOf(neighborhood)) {
  24. NodeMap grown_subgraph = subgraph | grow_by;
  25. NodeMap new_forbidden = (forbidden | neighborhood) & ~grown_subgraph;
  26. NodeMap new_full_neighborhood = full_neighborhood;
  27. NodeMap new_neighborhood =
  28. FindNeighborhood(g, subgraph | grow_by, new_forbidden, grow_by, &cache,
  29. &new_full_neighborhood);
  30. ExpandSubgraph(g, lowest_node_idx, grown_subgraph,
  31. new_full_neighborhood, new_neighborhood, new_forbidden,
  32. receiver));
  33. }
  34. }
  35. return false;
  36. }