Allow the planner to collapse explicit inner JOINs together, rather than

necessarily following the JOIN syntax to develop the query plan. The old behavior is still available by setting GUC variable JOIN_COLLAPSE_LIMIT to 1. Also create a GUC variable FROM_COLLAPSE_LIMIT to control the similar decision about when to collapse sub-SELECT lists into their parent lists. (This behavior existed already, but the limit was always GEQO_THRESHOLD/2; now it's separately adjustable.)
author: Tom Lane <tgl@sss.pgh.pa.us> 2003-01-25 23:10:30 +0000
committer: Tom Lane <tgl@sss.pgh.pa.us> 2003-01-25 23:10:30 +0000
commit: 9f5f2124754ccd605671bfe952c220b46a0e730b (patch)
tree: e6be6eab43ffe733b9c785d62cec74497098c694 /doc/src/sgml
parent: 15ab7a87206d657a4182d2932970384d540004d0 (diff)
3 files changed, 105 insertions, 34 deletions
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 4c150d50220..6928f1d853e 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1,5 +1,5 @@
 <!--
-$Header: /cvsroot/pgsql/doc/src/sgml/perform.sgml,v 1.23 2003/01/12 18:42:59 tgl Exp $
+$Header: /cvsroot/pgsql/doc/src/sgml/perform.sgml,v 1.24 2003/01/25 23:10:27 tgl Exp $
 -->
 
  <chapter id="performance-tips">
@@ -591,53 +591,93 @@ SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
   </para>
 
   <para>
-   The <productname>PostgreSQL</productname> query planner treats all
-   explicit <literal>JOIN</> syntaxes as constraining the join order, even though
-   it is not logically necessary to make such a constraint for inner
-   joins.  Therefore, although all of these queries give the same result:
+   Explicit inner join syntax (<literal>INNER JOIN</>, <literal>CROSS
+   JOIN</>, or unadorned <literal>JOIN</>) is semantically the same as
+   listing the input relations in <literal>FROM</>, so it does not need to
+   constrain the join order.  But it is possible to instruct the
+   <productname>PostgreSQL</productname> query planner to treat
+   explicit inner <literal>JOIN</>s as constraining the join order anyway.
+   For example, these three queries are logically equivalent:
 <programlisting>
 SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
 SELECT * FROM a CROSS JOIN b CROSS JOIN c WHERE a.id = b.id AND b.ref = c.id;
 SELECT * FROM a JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
 </programlisting>
+   But if we tell the planner to honor the <literal>JOIN</> order,
    the second and third take less time to plan than the first.  This effect
    is not worth worrying about for only three tables, but it can be a
    lifesaver with many tables.
   </para>
 
   <para>
+   To force the planner to follow the <literal>JOIN</> order for inner joins,
+   set the <varname>JOIN_COLLAPSE_LIMIT</> run-time parameter to 1.
+   (Other possible values are discussed below.)
+  </para>
+
+  <para>
    You do not need to constrain the join order completely in order to
-   cut search time, because it's OK to use <literal>JOIN</> operators in a plain
-   <literal>FROM</> list.  For example,
+   cut search time, because it's OK to use <literal>JOIN</> operators
+   within items of a plain <literal>FROM</> list.  For example, consider
 <programlisting>
 SELECT * FROM a CROSS JOIN b, c, d, e WHERE ...;
 </programlisting>
+   With <varname>JOIN_COLLAPSE_LIMIT</> = 1, this
    forces the planner to join A to B before joining them to other tables,
    but doesn't constrain its choices otherwise.  In this example, the
    number of possible join orders is reduced by a factor of 5.
   </para>
 
   <para>
-   If you have a mix of outer and inner joins in a complex query, you
-   might not want to constrain the planner's search for a good ordering
-   of inner joins inside an outer join.  You can't do that directly in the
-   <literal>JOIN</> syntax, but you can get around the syntactic limitation by using
-   subselects.  For example,
+   Constraining the planner's search in this way is a useful technique
+   both for reducing planning time and for directing the planner to a
+   good query plan.  If the planner chooses a bad join order by default,
+   you can force it to choose a better order via <literal>JOIN</> syntax
+   --- assuming that you know of a better order, that is.  Experimentation
+   is recommended.
+  </para>
+
+  <para>
+   A closely related issue that affects planning time is collapsing of
+   sub-SELECTs into their parent query.  For example, consider
+<programlisting>
+SELECT *
+FROM x, y,
+     (SELECT * FROM a, b, c WHERE something) AS ss
+WHERE somethingelse
+</programlisting>
+   This situation might arise from use of a view that contains a join;
+   the view's SELECT rule will be inserted in place of the view reference,
+   yielding a query much like the above.  Normally, the planner will try
+   to collapse the sub-query into the parent, yielding
 <programlisting>
-SELECT * FROM d LEFT JOIN
-        (SELECT * FROM a, b, c WHERE ...) AS ss
-        ON (...);
+SELECT * FROM x, y, a, b, c WHERE something AND somethingelse
 </programlisting>
-   Here, joining to D must be the last step in the query plan, but the
-   planner is free to consider various join orders for A, B, and C.
+   This usually results in a better plan than planning the sub-query
+   separately.  (For example, the outer WHERE conditions might be such that
+   joining X to A first eliminates many rows of A, thus avoiding the need to
+   form the full logical output of the sub-select.)  But at the same time,
+   we have increased the planning time; here, we have a five-way join
+   problem replacing two separate three-way join problems.  Because of the
+   exponential growth of the number of possibilities, this makes a big
+   difference.  The planner tries to avoid getting stuck in huge join search
+   problems by not collapsing a sub-query if more than
+   <varname>FROM_COLLAPSE_LIMIT</> FROM-items would result in the parent
+   query.  You can trade off planning time against quality of plan by
+   adjusting this run-time parameter up or down.
   </para>
 
   <para>
-   Constraining the planner's search in this way is a useful technique
-   both for reducing planning time and for directing the planner to a
-   good query plan.  If the planner chooses a bad join order by default,
-   you can force it to choose a better order via <literal>JOIN</> syntax --- assuming
-   that you know of a better order, that is.  Experimentation is recommended.
+   <varname>FROM_COLLAPSE_LIMIT</> and <varname>JOIN_COLLAPSE_LIMIT</>
+   are similarly named because they do almost the same thing: one controls
+   when the planner will <quote>flatten out</> sub-SELECTs, and the
+   other controls when it will flatten out explicit inner JOINs.  Typically
+   you would either set <varname>JOIN_COLLAPSE_LIMIT</> equal to
+   <varname>FROM_COLLAPSE_LIMIT</> (so that explicit JOINs and sub-SELECTs
+   act similarly) or set <varname>JOIN_COLLAPSE_LIMIT</> to 1 (if you want
+   to control join order with explicit JOINs).  But you might set them
+   differently if you are trying to fine-tune the tradeoff between planning
+   time and run time.
   </para>
  </sect1>
 
diff --git a/doc/src/sgml/release.sgml b/doc/src/sgml/release.sgml
index 18d71f5866b..daccd63da77 100644
--- a/doc/src/sgml/release.sgml
+++ b/doc/src/sgml/release.sgml
@@ -1,5 +1,5 @@
 <!--
-$Header: /cvsroot/pgsql/doc/src/sgml/release.sgml,v 1.180 2003/01/23 23:38:51 petere Exp $
+$Header: /cvsroot/pgsql/doc/src/sgml/release.sgml,v 1.181 2003/01/25 23:10:27 tgl Exp $
 -->
 
 <appendix id="release">
@@ -24,6 +24,7 @@ CDATA means the content is "SGML-free", so you can write without
 worries about funny characters.
 -->
 <literallayout><![CDATA[
+Explicit JOINs no longer constrain query plan, unless JOIN_COLLAPSE_LIMIT = 1
 Performance of "foo IN (SELECT ...)" queries has been considerably improved
 FETCH 0 now re-fetches cursor's current row, per SQL spec
 Revised executor state representation; plan trees are read-only to executor now
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 34321cae52d..cf8a648ba84 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1,5 +1,5 @@
 <!--
-$Header: /cvsroot/pgsql/doc/src/sgml/runtime.sgml,v 1.166 2003/01/11 05:04:14 momjian Exp $
+$Header: /cvsroot/pgsql/doc/src/sgml/runtime.sgml,v 1.167 2003/01/25 23:10:27 tgl Exp $
 -->
 
 <Chapter Id="runtime">
@@ -774,6 +774,19 @@ env PGOPTIONS='-c geqo=off' psql
      </varlistentry>
 
      <varlistentry>
+      <term><varname>FROM_COLLAPSE_LIMIT</varname> (<type>integer</type>)</term>
+      <listitem>
+       <para>
+        The planner will merge sub-queries into upper queries if the resulting
+	FROM list would have no more than this many items.  Smaller values
+	reduce planning time but may yield inferior query plans.
+	The default is 8.  It is usually wise to keep this less than
+	<literal>GEQO_THRESHOLD</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <indexterm>
        <primary>genetic query optimization</primary>
       </indexterm>
@@ -826,12 +839,27 @@ env PGOPTIONS='-c geqo=off' psql
       <listitem>
        <para>
         Use genetic query optimization to plan queries with at least
-        this many <literal>FROM</> items involved. (Note that a
+        this many <literal>FROM</> items involved. (Note that an outer
         <literal>JOIN</> construct counts as only one <literal>FROM</>
         item.) The default is 11. For simpler queries it is usually best
-        to use the deterministic, exhaustive planner. This parameter
-        also controls how hard the optimizer will try to merge subquery
-        <literal>FROM</literal> clauses into the upper query.
+        to use the deterministic, exhaustive planner, but for queries with
+	many tables the deterministic planner takes too long.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>JOIN_COLLAPSE_LIMIT</varname> (<type>integer</type>)</term>
+      <listitem>
+       <para>
+	The planner will flatten explicit inner <literal>JOIN</> constructs
+	into lists of <literal>FROM</> items whenever a list of no more than
+	this many items would result.  Usually this is set the same as
+	<literal>FROM_COLLAPSE_LIMIT</>.  Setting it to 1 prevents any
+	flattening of inner <literal>JOIN</>s, allowing explicit
+	<literal>JOIN</> syntax to be used to control the join order.
+	Intermediate values might be useful to trade off planning time
+	against quality of plan.
        </para>
       </listitem>
      </varlistentry>
@@ -1842,8 +1870,8 @@ dynamic_library_path = '/usr/local/lib/postgresql:/home/my_project/lib:$libdir'
         server. The default is 64. Each buffer is typically 8192
         bytes. This must be greater than 16, as well as at least twice
         the value of <varname>MAX_CONNECTIONS</varname>; however, a
-        higher value can often improve performance on modern
-        machines. Values of at least a few thousand are recommended
+        higher value can often improve performance.
+	Values of a few thousand are recommended
         for production installations. This option can only be set at
         server start.
        </para>
@@ -1878,15 +1906,17 @@ dynamic_library_path = '/usr/local/lib/postgresql:/home/my_project/lib:$libdir'
       <listitem>
        <para>
 	Specifies the amount of memory to be used by internal sorts and
-	hashes before switching to temporary disk files. The value is
+	hash tables before switching to temporary disk files. The value is
 	specified in kilobytes, and defaults to 1024 kilobytes (1 MB).
-	Note that for a complex query, several sorts might be running in
-	parallel, and each one will be allowed to use as much memory as
-	this value specifies before it starts to put data into temporary
+	Note that for a complex query, several sorts or hashes might be
+	running in parallel; each one will be allowed to use as much memory
+	as this value specifies before it starts to put data into temporary
 	files. Also, each running backend could be doing one or more
 	sorts simultaneously, so the total memory used could be many
 	times the value of <varname>SORT_MEM</varname>. Sorts are used
 	by <literal>ORDER BY</>, merge joins, and <command>CREATE INDEX</>.
+	Hash tables are used in hash joins, hash-based aggregation, and
+	hash-based processing of <literal>IN</> sub-selects.
        </para>
       </listitem>
      </varlistentry>
author	Tom Lane <tgl@sss.pgh.pa.us>	2003-01-25 23:10:30 +0000
committer	Tom Lane <tgl@sss.pgh.pa.us>	2003-01-25 23:10:30 +0000
commit	9f5f2124754ccd605671bfe952c220b46a0e730b (patch)
tree	e6be6eab43ffe733b9c785d62cec74497098c694 /doc/src/sgml
parent	15ab7a87206d657a4182d2932970384d540004d0 (diff)