diff options
| author | Richard Guo <rguo@postgresql.org> | 2025-08-19 09:35:40 +0900 |
|---|---|---|
| committer | Richard Guo <rguo@postgresql.org> | 2025-08-19 09:35:40 +0900 |
| commit | 24225ad9aafc576295e210026d8ffa9f50d61145 (patch) | |
| tree | 3634e26824f97b86244458a5fdae827034ac6bdf /src/backend/optimizer/path/joinpath.c | |
| parent | 3c07944d048c82ae67884c1f09e81f53f769ac2a (diff) | |
Pathify RHS unique-ification for semijoin planning
There are two implementation techniques for semijoins: one uses the
JOIN_SEMI jointype, where the executor emits at most one matching row
per left-hand side (LHS) row; the other unique-ifies the right-hand
side (RHS) and then performs a plain inner join.
The latter technique currently has some drawbacks related to the
unique-ification step.
* Only the cheapest-total path of the RHS is considered during
unique-ification. This may cause us to miss some optimization
opportunities; for example, a path with a better sort order might be
overlooked simply because it is not the cheapest in total cost. Such
a path could help avoid a sort at a higher level, potentially
resulting in a cheaper overall plan.
* We currently rely on heuristics to choose between hash-based and
sort-based unique-ification. A better approach would be to generate
paths for both methods and allow add_path() to decide which one is
preferable, consistent with how path selection is handled elsewhere in
the planner.
* In the sort-based implementation, we currently pay no attention to
the pathkeys of the input subpath or the resulting output. This can
result in redundant sort nodes being added to the final plan.
This patch improves semijoin planning by creating a new RelOptInfo for
the RHS rel to represent its unique-ified version. It then generates
multiple paths that represent elimination of distinct rows from the
RHS, considering both a hash-based implementation using the cheapest
total path of the original RHS rel, and sort-based implementations
that either exploit presorted input paths or explicitly sort the
cheapest total path. All resulting paths compete in add_path(), and
those deemed worthy of consideration are added to the new RelOptInfo.
Finally, the unique-ified rel is joined with the other side of the
semijoin using a plain inner join.
As a side effect, most of the code related to the JOIN_UNIQUE_OUTER
and JOIN_UNIQUE_INNER jointypes -- used to indicate that the LHS or
RHS path should be made unique -- has been removed. Besides, the
T_Unique path now has the same meaning for both semijoins and upper
DISTINCT clauses: it represents adjacent-duplicate removal on
presorted input. This patch unifies their handling by sharing the
same data structures and functions.
This patch also removes the UNIQUE_PATH_NOOP related code along the
way, as it is dead code -- if the RHS rel is provably unique, the
semijoin should have already been simplified to a plain inner join by
analyzejoins.c.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-EBnaRvEs7frTLbsXiweSTUXifsteF-d3rvv01FKO86w@mail.gmail.com
Diffstat (limited to 'src/backend/optimizer/path/joinpath.c')
| -rw-r--r-- | src/backend/optimizer/path/joinpath.c | 338 |
1 files changed, 95 insertions, 243 deletions
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c index ebedc5574ca..3b9407eb2eb 100644 --- a/src/backend/optimizer/path/joinpath.c +++ b/src/backend/optimizer/path/joinpath.c @@ -112,12 +112,12 @@ static void generate_mergejoin_paths(PlannerInfo *root, * "flipped around" if we are considering joining the rels in the opposite * direction from what's indicated in sjinfo. * - * Also, this routine and others in this module accept the special JoinTypes - * JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER to indicate that we should - * unique-ify the outer or inner relation and then apply a regular inner - * join. These values are not allowed to propagate outside this module, - * however. Path cost estimation code may need to recognize that it's - * dealing with such a case --- the combination of nominal jointype INNER + * Also, this routine accepts the special JoinTypes JOIN_UNIQUE_OUTER and + * JOIN_UNIQUE_INNER to indicate that the outer or inner relation has been + * unique-ified and a regular inner join should then be applied. These values + * are not allowed to propagate outside this routine, however. Path cost + * estimation code, as well as match_unsorted_outer, may need to recognize that + * it's dealing with such a case --- the combination of nominal jointype INNER * with sjinfo->jointype == JOIN_SEMI indicates that. */ void @@ -129,6 +129,7 @@ add_paths_to_joinrel(PlannerInfo *root, SpecialJoinInfo *sjinfo, List *restrictlist) { + JoinType save_jointype = jointype; JoinPathExtraData extra; bool mergejoin_allowed = true; ListCell *lc; @@ -165,10 +166,10 @@ add_paths_to_joinrel(PlannerInfo *root, * reduce_unique_semijoins would've simplified it), so there's no point in * calling innerrel_is_unique. However, if the LHS covers all of the * semijoin's min_lefthand, then it's appropriate to set inner_unique - * because the path produced by create_unique_path will be unique relative - * to the LHS. (If we have an LHS that's only part of the min_lefthand, - * that is *not* true.) For JOIN_UNIQUE_OUTER, pass JOIN_INNER to avoid - * letting that value escape this module. + * because the unique relation produced by create_unique_paths will be + * unique relative to the LHS. (If we have an LHS that's only part of the + * min_lefthand, that is *not* true.) For JOIN_UNIQUE_OUTER, pass + * JOIN_INNER to avoid letting that value escape this module. */ switch (jointype) { @@ -200,6 +201,13 @@ add_paths_to_joinrel(PlannerInfo *root, } /* + * If the outer or inner relation has been unique-ified, handle as a plain + * inner join. + */ + if (jointype == JOIN_UNIQUE_OUTER || jointype == JOIN_UNIQUE_INNER) + jointype = JOIN_INNER; + + /* * Find potential mergejoin clauses. We can skip this if we are not * interested in doing a mergejoin. However, mergejoin may be our only * way of implementing a full outer join, so override enable_mergejoin if @@ -329,7 +337,7 @@ add_paths_to_joinrel(PlannerInfo *root, joinrel->fdwroutine->GetForeignJoinPaths) joinrel->fdwroutine->GetForeignJoinPaths(root, joinrel, outerrel, innerrel, - jointype, &extra); + save_jointype, &extra); /* * 6. Finally, give extensions a chance to manipulate the path list. They @@ -339,7 +347,7 @@ add_paths_to_joinrel(PlannerInfo *root, */ if (set_join_pathlist_hook) set_join_pathlist_hook(root, joinrel, outerrel, innerrel, - jointype, &extra); + save_jointype, &extra); } /* @@ -1364,7 +1372,6 @@ sort_inner_and_outer(PlannerInfo *root, JoinType jointype, JoinPathExtraData *extra) { - JoinType save_jointype = jointype; Path *outer_path; Path *inner_path; Path *cheapest_partial_outer = NULL; @@ -1403,37 +1410,15 @@ sort_inner_and_outer(PlannerInfo *root, return; /* - * If unique-ification is requested, do it and then handle as a plain - * inner join. - */ - if (jointype == JOIN_UNIQUE_OUTER) - { - outer_path = (Path *) create_unique_path(root, outerrel, - outer_path, extra->sjinfo); - Assert(outer_path); - jointype = JOIN_INNER; - } - else if (jointype == JOIN_UNIQUE_INNER) - { - inner_path = (Path *) create_unique_path(root, innerrel, - inner_path, extra->sjinfo); - Assert(inner_path); - jointype = JOIN_INNER; - } - - /* * If the joinrel is parallel-safe, we may be able to consider a partial - * merge join. However, we can't handle JOIN_UNIQUE_OUTER, because the - * outer path will be partial, and therefore we won't be able to properly - * guarantee uniqueness. Similarly, we can't handle JOIN_FULL, JOIN_RIGHT - * and JOIN_RIGHT_ANTI, because they can produce false null extended rows. + * merge join. However, we can't handle JOIN_FULL, JOIN_RIGHT and + * JOIN_RIGHT_ANTI, because they can produce false null extended rows. * Also, the resulting path must not be parameterized. */ if (joinrel->consider_parallel && - save_jointype != JOIN_UNIQUE_OUTER && - save_jointype != JOIN_FULL && - save_jointype != JOIN_RIGHT && - save_jointype != JOIN_RIGHT_ANTI && + jointype != JOIN_FULL && + jointype != JOIN_RIGHT && + jointype != JOIN_RIGHT_ANTI && outerrel->partial_pathlist != NIL && bms_is_empty(joinrel->lateral_relids)) { @@ -1441,7 +1426,7 @@ sort_inner_and_outer(PlannerInfo *root, if (inner_path->parallel_safe) cheapest_safe_inner = inner_path; - else if (save_jointype != JOIN_UNIQUE_INNER) + else cheapest_safe_inner = get_cheapest_parallel_safe_total_inner(innerrel->pathlist); } @@ -1580,13 +1565,9 @@ generate_mergejoin_paths(PlannerInfo *root, List *trialsortkeys; Path *cheapest_startup_inner; Path *cheapest_total_inner; - JoinType save_jointype = jointype; int num_sortkeys; int sortkeycnt; - if (jointype == JOIN_UNIQUE_OUTER || jointype == JOIN_UNIQUE_INNER) - jointype = JOIN_INNER; - /* Look for useful mergeclauses (if any) */ mergeclauses = find_mergeclauses_for_outer_pathkeys(root, @@ -1636,10 +1617,6 @@ generate_mergejoin_paths(PlannerInfo *root, extra, is_partial); - /* Can't do anything else if inner path needs to be unique'd */ - if (save_jointype == JOIN_UNIQUE_INNER) - return; - /* * Look for presorted inner paths that satisfy the innersortkey list --- * or any truncation thereof, if we are allowed to build a mergejoin using @@ -1819,7 +1796,6 @@ match_unsorted_outer(PlannerInfo *root, JoinType jointype, JoinPathExtraData *extra) { - JoinType save_jointype = jointype; bool nestjoinOK; bool useallclauses; Path *inner_cheapest_total = innerrel->cheapest_total_path; @@ -1855,12 +1831,6 @@ match_unsorted_outer(PlannerInfo *root, nestjoinOK = false; useallclauses = true; break; - case JOIN_UNIQUE_OUTER: - case JOIN_UNIQUE_INNER: - jointype = JOIN_INNER; - nestjoinOK = true; - useallclauses = false; - break; default: elog(ERROR, "unrecognized join type: %d", (int) jointype); @@ -1873,24 +1843,20 @@ match_unsorted_outer(PlannerInfo *root, * If inner_cheapest_total is parameterized by the outer rel, ignore it; * we will consider it below as a member of cheapest_parameterized_paths, * but the other possibilities considered in this routine aren't usable. + * + * Furthermore, if the inner side is a unique-ified relation, we cannot + * generate any valid paths here, because the inner rel's dependency on + * the outer rel makes unique-ification meaningless. */ if (PATH_PARAM_BY_REL(inner_cheapest_total, outerrel)) + { inner_cheapest_total = NULL; - /* - * If we need to unique-ify the inner path, we will consider only the - * cheapest-total inner. - */ - if (save_jointype == JOIN_UNIQUE_INNER) - { - /* No way to do this with an inner path parameterized by outer rel */ - if (inner_cheapest_total == NULL) + if (RELATION_WAS_MADE_UNIQUE(innerrel, extra->sjinfo, jointype)) return; - inner_cheapest_total = (Path *) - create_unique_path(root, innerrel, inner_cheapest_total, extra->sjinfo); - Assert(inner_cheapest_total); } - else if (nestjoinOK) + + if (nestjoinOK) { /* * Consider materializing the cheapest inner path, unless @@ -1915,20 +1881,6 @@ match_unsorted_outer(PlannerInfo *root, continue; /* - * If we need to unique-ify the outer path, it's pointless to consider - * any but the cheapest outer. (XXX we don't consider parameterized - * outers, nor inners, for unique-ified cases. Should we?) - */ - if (save_jointype == JOIN_UNIQUE_OUTER) - { - if (outerpath != outerrel->cheapest_total_path) - continue; - outerpath = (Path *) create_unique_path(root, outerrel, - outerpath, extra->sjinfo); - Assert(outerpath); - } - - /* * The result will have this sort order (even if it is implemented as * a nestloop, and even if some of the mergeclauses are implemented by * qpquals rather than as true mergeclauses): @@ -1936,21 +1888,7 @@ match_unsorted_outer(PlannerInfo *root, merge_pathkeys = build_join_pathkeys(root, joinrel, jointype, outerpath->pathkeys); - if (save_jointype == JOIN_UNIQUE_INNER) - { - /* - * Consider nestloop join, but only with the unique-ified cheapest - * inner path - */ - try_nestloop_path(root, - joinrel, - outerpath, - inner_cheapest_total, - merge_pathkeys, - jointype, - extra); - } - else if (nestjoinOK) + if (nestjoinOK) { /* * Consider nestloop joins using this outer path and various @@ -2001,17 +1939,13 @@ match_unsorted_outer(PlannerInfo *root, extra); } - /* Can't do anything else if outer path needs to be unique'd */ - if (save_jointype == JOIN_UNIQUE_OUTER) - continue; - /* Can't do anything else if inner rel is parameterized by outer */ if (inner_cheapest_total == NULL) continue; /* Generate merge join paths */ generate_mergejoin_paths(root, joinrel, innerrel, outerpath, - save_jointype, extra, useallclauses, + jointype, extra, useallclauses, inner_cheapest_total, merge_pathkeys, false); } @@ -2019,41 +1953,35 @@ match_unsorted_outer(PlannerInfo *root, /* * Consider partial nestloop and mergejoin plan if outerrel has any * partial path and the joinrel is parallel-safe. However, we can't - * handle JOIN_UNIQUE_OUTER, because the outer path will be partial, and - * therefore we won't be able to properly guarantee uniqueness. Nor can - * we handle joins needing lateral rels, since partial paths must not be - * parameterized. Similarly, we can't handle JOIN_FULL, JOIN_RIGHT and + * handle joins needing lateral rels, since partial paths must not be + * parameterized. Similarly, we can't handle JOIN_FULL, JOIN_RIGHT and * JOIN_RIGHT_ANTI, because they can produce false null extended rows. */ if (joinrel->consider_parallel && - save_jointype != JOIN_UNIQUE_OUTER && - save_jointype != JOIN_FULL && - save_jointype != JOIN_RIGHT && - save_jointype != JOIN_RIGHT_ANTI && + jointype != JOIN_FULL && + jointype != JOIN_RIGHT && + jointype != JOIN_RIGHT_ANTI && outerrel->partial_pathlist != NIL && bms_is_empty(joinrel->lateral_relids)) { if (nestjoinOK) consider_parallel_nestloop(root, joinrel, outerrel, innerrel, - save_jointype, extra); + jointype, extra); /* * If inner_cheapest_total is NULL or non parallel-safe then find the - * cheapest total parallel safe path. If doing JOIN_UNIQUE_INNER, we - * can't use any alternative inner path. + * cheapest total parallel safe path. */ if (inner_cheapest_total == NULL || !inner_cheapest_total->parallel_safe) { - if (save_jointype == JOIN_UNIQUE_INNER) - return; - - inner_cheapest_total = get_cheapest_parallel_safe_total_inner(innerrel->pathlist); + inner_cheapest_total = + get_cheapest_parallel_safe_total_inner(innerrel->pathlist); } if (inner_cheapest_total) consider_parallel_mergejoin(root, joinrel, outerrel, innerrel, - save_jointype, extra, + jointype, extra, inner_cheapest_total); } } @@ -2118,24 +2046,17 @@ consider_parallel_nestloop(PlannerInfo *root, JoinType jointype, JoinPathExtraData *extra) { - JoinType save_jointype = jointype; Path *inner_cheapest_total = innerrel->cheapest_total_path; Path *matpath = NULL; ListCell *lc1; - if (jointype == JOIN_UNIQUE_INNER) - jointype = JOIN_INNER; - /* - * Consider materializing the cheapest inner path, unless: 1) we're doing - * JOIN_UNIQUE_INNER, because in this case we have to unique-ify the - * cheapest inner path, 2) enable_material is off, 3) the cheapest inner - * path is not parallel-safe, 4) the cheapest inner path is parameterized - * by the outer rel, or 5) the cheapest inner path materializes its output - * anyway. + * Consider materializing the cheapest inner path, unless: 1) + * enable_material is off, 2) the cheapest inner path is not + * parallel-safe, 3) the cheapest inner path is parameterized by the outer + * rel, or 4) the cheapest inner path materializes its output anyway. */ - if (save_jointype != JOIN_UNIQUE_INNER && - enable_material && inner_cheapest_total->parallel_safe && + if (enable_material && inner_cheapest_total->parallel_safe && !PATH_PARAM_BY_REL(inner_cheapest_total, outerrel) && !ExecMaterializesOutput(inner_cheapest_total->pathtype)) { @@ -2169,23 +2090,6 @@ consider_parallel_nestloop(PlannerInfo *root, if (!innerpath->parallel_safe) continue; - /* - * If we're doing JOIN_UNIQUE_INNER, we can only use the inner's - * cheapest_total_path, and we have to unique-ify it. (We might - * be able to relax this to allow other safe, unparameterized - * inner paths, but right now create_unique_path is not on board - * with that.) - */ - if (save_jointype == JOIN_UNIQUE_INNER) - { - if (innerpath != innerrel->cheapest_total_path) - continue; - innerpath = (Path *) create_unique_path(root, innerrel, - innerpath, - extra->sjinfo); - Assert(innerpath); - } - try_partial_nestloop_path(root, joinrel, outerpath, innerpath, pathkeys, jointype, extra); @@ -2227,7 +2131,6 @@ hash_inner_and_outer(PlannerInfo *root, JoinType jointype, JoinPathExtraData *extra) { - JoinType save_jointype = jointype; bool isouterjoin = IS_OUTER_JOIN(jointype); List *hashclauses; ListCell *l; @@ -2290,6 +2193,8 @@ hash_inner_and_outer(PlannerInfo *root, Path *cheapest_startup_outer = outerrel->cheapest_startup_path; Path *cheapest_total_outer = outerrel->cheapest_total_path; Path *cheapest_total_inner = innerrel->cheapest_total_path; + ListCell *lc1; + ListCell *lc2; /* * If either cheapest-total path is parameterized by the other rel, we @@ -2301,114 +2206,64 @@ hash_inner_and_outer(PlannerInfo *root, PATH_PARAM_BY_REL(cheapest_total_inner, outerrel)) return; - /* Unique-ify if need be; we ignore parameterized possibilities */ - if (jointype == JOIN_UNIQUE_OUTER) - { - cheapest_total_outer = (Path *) - create_unique_path(root, outerrel, - cheapest_total_outer, extra->sjinfo); - Assert(cheapest_total_outer); - jointype = JOIN_INNER; - try_hashjoin_path(root, - joinrel, - cheapest_total_outer, - cheapest_total_inner, - hashclauses, - jointype, - extra); - /* no possibility of cheap startup here */ - } - else if (jointype == JOIN_UNIQUE_INNER) - { - cheapest_total_inner = (Path *) - create_unique_path(root, innerrel, - cheapest_total_inner, extra->sjinfo); - Assert(cheapest_total_inner); - jointype = JOIN_INNER; + /* + * Consider the cheapest startup outer together with the cheapest + * total inner, and then consider pairings of cheapest-total paths + * including parameterized ones. There is no use in generating + * parameterized paths on the basis of possibly cheap startup cost, so + * this is sufficient. + */ + if (cheapest_startup_outer != NULL) try_hashjoin_path(root, joinrel, - cheapest_total_outer, + cheapest_startup_outer, cheapest_total_inner, hashclauses, jointype, extra); - if (cheapest_startup_outer != NULL && - cheapest_startup_outer != cheapest_total_outer) - try_hashjoin_path(root, - joinrel, - cheapest_startup_outer, - cheapest_total_inner, - hashclauses, - jointype, - extra); - } - else + + foreach(lc1, outerrel->cheapest_parameterized_paths) { + Path *outerpath = (Path *) lfirst(lc1); + /* - * For other jointypes, we consider the cheapest startup outer - * together with the cheapest total inner, and then consider - * pairings of cheapest-total paths including parameterized ones. - * There is no use in generating parameterized paths on the basis - * of possibly cheap startup cost, so this is sufficient. + * We cannot use an outer path that is parameterized by the inner + * rel. */ - ListCell *lc1; - ListCell *lc2; - - if (cheapest_startup_outer != NULL) - try_hashjoin_path(root, - joinrel, - cheapest_startup_outer, - cheapest_total_inner, - hashclauses, - jointype, - extra); + if (PATH_PARAM_BY_REL(outerpath, innerrel)) + continue; - foreach(lc1, outerrel->cheapest_parameterized_paths) + foreach(lc2, innerrel->cheapest_parameterized_paths) { - Path *outerpath = (Path *) lfirst(lc1); + Path *innerpath = (Path *) lfirst(lc2); /* - * We cannot use an outer path that is parameterized by the - * inner rel. + * We cannot use an inner path that is parameterized by the + * outer rel, either. */ - if (PATH_PARAM_BY_REL(outerpath, innerrel)) + if (PATH_PARAM_BY_REL(innerpath, outerrel)) continue; - foreach(lc2, innerrel->cheapest_parameterized_paths) - { - Path *innerpath = (Path *) lfirst(lc2); - - /* - * We cannot use an inner path that is parameterized by - * the outer rel, either. - */ - if (PATH_PARAM_BY_REL(innerpath, outerrel)) - continue; + if (outerpath == cheapest_startup_outer && + innerpath == cheapest_total_inner) + continue; /* already tried it */ - if (outerpath == cheapest_startup_outer && - innerpath == cheapest_total_inner) - continue; /* already tried it */ - - try_hashjoin_path(root, - joinrel, - outerpath, - innerpath, - hashclauses, - jointype, - extra); - } + try_hashjoin_path(root, + joinrel, + outerpath, + innerpath, + hashclauses, + jointype, + extra); } } /* * If the joinrel is parallel-safe, we may be able to consider a - * partial hash join. However, we can't handle JOIN_UNIQUE_OUTER, - * because the outer path will be partial, and therefore we won't be - * able to properly guarantee uniqueness. Also, the resulting path - * must not be parameterized. + * partial hash join. However, the resulting path must not be + * parameterized. */ if (joinrel->consider_parallel && - save_jointype != JOIN_UNIQUE_OUTER && outerrel->partial_pathlist != NIL && bms_is_empty(joinrel->lateral_relids)) { @@ -2421,11 +2276,9 @@ hash_inner_and_outer(PlannerInfo *root, /* * Can we use a partial inner plan too, so that we can build a - * shared hash table in parallel? We can't handle - * JOIN_UNIQUE_INNER because we can't guarantee uniqueness. + * shared hash table in parallel? */ if (innerrel->partial_pathlist != NIL && - save_jointype != JOIN_UNIQUE_INNER && enable_parallel_hash) { cheapest_partial_inner = @@ -2441,19 +2294,18 @@ hash_inner_and_outer(PlannerInfo *root, * Normally, given that the joinrel is parallel-safe, the cheapest * total inner path will also be parallel-safe, but if not, we'll * have to search for the cheapest safe, unparameterized inner - * path. If doing JOIN_UNIQUE_INNER, we can't use any alternative - * inner path. If full, right, right-semi or right-anti join, we - * can't use parallelism (building the hash table in each backend) + * path. If full, right, right-semi or right-anti join, we can't + * use parallelism (building the hash table in each backend) * because no one process has all the match bits. */ - if (save_jointype == JOIN_FULL || - save_jointype == JOIN_RIGHT || - save_jointype == JOIN_RIGHT_SEMI || - save_jointype == JOIN_RIGHT_ANTI) + if (jointype == JOIN_FULL || + jointype == JOIN_RIGHT || + jointype == JOIN_RIGHT_SEMI || + jointype == JOIN_RIGHT_ANTI) cheapest_safe_inner = NULL; else if (cheapest_total_inner->parallel_safe) cheapest_safe_inner = cheapest_total_inner; - else if (save_jointype != JOIN_UNIQUE_INNER) + else cheapest_safe_inner = get_cheapest_parallel_safe_total_inner(innerrel->pathlist); |
