Implement operator class parameters

PostgreSQL provides set of template index access methods, where opclasses have much freedom in the semantics of indexing. These index AMs are GiST, GIN, SP-GiST and BRIN. There opclasses define representation of keys, operations on them and supported search strategies. So, it's natural that opclasses may be faced some tradeoffs, which require user-side decision. This commit implements opclass parameters allowing users to set some values, which tell opclass how to index the particular dataset. This commit doesn't introduce new storage in system catalog. Instead it uses pg_attribute.attoptions, which is used for table column storage options but unused for index attributes. In order to evade changing signature of each opclass support function, we implement unified way to pass options to opclass support functions. Options are set to fn_expr as the constant bytea expression. It's possible due to the fact that opclass support functions are executed outside of expressions, so fn_expr is unused for them. This commit comes with some examples of opclass options usage. We parametrize signature length in GiST. That applies to multiple opclasses: tsvector_ops, gist__intbig_ops, gist_ltree_ops, gist__ltree_ops, gist_trgm_ops and gist_hstore_ops. Also we parametrize maximum number of integer ranges for gist__int_ops. However, the main future usage of this feature is expected to be json, where users would be able to specify which way to index particular json parts. Catversion is bumped. Discussion: https://postgr.es/m/d22c3a18-31c7-1879-fc11-4c1ce2f5e5af%40postgrespro.ru Author: Nikita Glukhov, revised by me Reviwed-by: Nikolay Shaplov, Robert Haas, Tom Lane, Tomas Vondra, Alvaro Herrera
author: Alexander Korotkov <akorotkov@postgresql.org> 2020-03-30 19:17:11 +0300
committer: Alexander Korotkov <akorotkov@postgresql.org> 2020-03-30 19:17:23 +0300
commit: 911e70207703799605f5a0e8aad9f06cff067c63 (patch)
tree: 5cdbffea1945c2c0b766db195a55bcb372d60537 /doc/src
parent: 1d53432ff940b789c2431ba476a2a6e2db3edf84 (diff)
7 files changed, 113 insertions, 14 deletions
diff --git a/doc/src/sgml/hstore.sgml b/doc/src/sgml/hstore.sgml
index 64c2477fffc..f1f2b08cd79 100644
--- a/doc/src/sgml/hstore.sgml
+++ b/doc/src/sgml/hstore.sgml
@@ -468,6 +468,23 @@ CREATE INDEX hidx ON testhstore USING GIN (h);
 </programlisting>
 
   <para>
+   <literal>gist_hstore_ops</literal> GiST opclass approximates set of
+   key/value pairs as a bitmap signature.  Optional integer parameter
+   <literal>siglen</literal> of <literal>gist_hstore_ops</literal> determines
+   signature length in bytes.  Default signature length is 16 bytes.
+   Valid values of signature length are between 1 and 2024 bytes.  Longer
+   signatures leads to more precise search (scan less fraction of index, scan
+   less heap pages), but larger index.
+  </para>
+
+  <para>
+   Example of creating such an index with a signature length of 32 bytes:
+  </para>
+<programlisting>
+  CREATE INDEX hidx ON testhstore USING GIST (h gist_hstore_ops(siglen=32));
+</programlisting>
+
+  <para>
    <type>hstore</type> also supports <type>btree</type> or <type>hash</type> indexes for
    the <literal>=</literal> operator. This allows <type>hstore</type> columns to be
    declared <literal>UNIQUE</literal>, or to be used in <literal>GROUP BY</literal>,
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 86539a781c5..1be209a2fe7 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1316,7 +1316,7 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    An index definition can specify an <firstterm>operator
    class</firstterm> for each column of an index.
 <synopsis>
-CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (<replaceable>column</replaceable> <replaceable>opclass</replaceable> <optional><replaceable>sort options</replaceable></optional> <optional>, ...</optional>);
+CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (<replaceable>column</replaceable> <replaceable>opclass</replaceable> [ ( <replaceable>opclass_options</replaceable> ) ] <optional><replaceable>sort options</replaceable></optional> <optional>, ...</optional>);
 </synopsis>
    The operator class identifies the operators to be used by the index
    for that column.  For example, a B-tree index on the type <type>int4</type>
diff --git a/doc/src/sgml/intarray.sgml b/doc/src/sgml/intarray.sgml
index 025cbca616e..72b4b23c158 100644
--- a/doc/src/sgml/intarray.sgml
+++ b/doc/src/sgml/intarray.sgml
@@ -265,7 +265,7 @@
   </para>
 
   <para>
-   Two GiST index operator classes are provided:
+   Two parametrized GiST index operator classes are provided:
    <literal>gist__int_ops</literal> (used by default) is suitable for
    small- to medium-size data sets, while
    <literal>gist__intbig_ops</literal> uses a larger signature and is more
@@ -274,6 +274,25 @@
    The implementation uses an RD-tree data structure with
    built-in lossy compression.
   </para>
+   
+  <para>
+   <literal>gist__int_ops</literal> approximates integer set as an array of
+   integer ranges.  Optional integer parameter <literal>numranges</literal> of
+   <literal>gist__int_ops</literal> determines maximum number of ranges in
+   one index key.  Default value of <literal>numranges</literal> is 100.
+   Valid values are between 1 and 253.  Using larger arrays as GiST index
+   keys leads to more precise search (scan less fraction of index, scan less
+   heap pages), but larger index.
+  </para>
+   
+  <para>
+   <literal>gist__intbig_ops</literal> approximates integer set as a bitmap
+   signature.  Optional integer parameter <literal>siglen</literal> of
+   <literal>gist__intbig_ops</literal> determines signature length in bytes.
+   Default signature length is 16 bytes.  Valid values of signature length
+   are between 1 and 2024 bytes.  Longer signatures leads to more precise
+   search (scan less fraction of index, scan less heap pages), but larger index.
+  </para>
 
   <para>
    There is also a non-default GIN operator class
@@ -293,8 +312,8 @@
 -- a message can be in one or more <quote>sections</quote>
 CREATE TABLE message (mid INT PRIMARY KEY, sections INT[], ...);
 
--- create specialized index
-CREATE INDEX message_rdtree_idx ON message USING GIST (sections gist__int_ops);
+-- create specialized index with sigature length of 32 bytes
+CREATE INDEX message_rdtree_idx ON message USING GIST (sections gist__int_ops(siglen=32));
 
 -- select messages in section 1 OR 2 - OVERLAP operator
 SELECT message.mid FROM message WHERE message.sections &amp;&amp; '{1,2}';
diff --git a/doc/src/sgml/ltree.sgml b/doc/src/sgml/ltree.sgml
index 2d539f23fd8..ae4b33ec85e 100644
--- a/doc/src/sgml/ltree.sgml
+++ b/doc/src/sgml/ltree.sgml
@@ -498,32 +498,61 @@ Europe &amp; Russia*@ &amp; !Transportation
    </listitem>
    <listitem>
     <para>
-     GiST index over <type>ltree</type>:
+     GiST index over <type>ltree</type> (<literal>gist_ltree_ops</literal>
+     opclass):
      <literal>&lt;</literal>, <literal>&lt;=</literal>, <literal>=</literal>,
      <literal>&gt;=</literal>, <literal>&gt;</literal>,
      <literal>@&gt;</literal>, <literal>&lt;@</literal>,
      <literal>@</literal>, <literal>~</literal>, <literal>?</literal>
     </para>
     <para>
-     Example of creating such an index:
+     <literal>gist_ltree_ops</literal> GiST opclass approximates set of
+     path labels as a bitmap signature.  Optional integer parameter
+     <literal>siglen</literal> of <literal>gist_ltree_ops</literal> determines
+     signature length in bytes.  Default signature length is 8 bytes.
+     Valid values of signature length are between 1 and 2024 bytes.  Longer
+     signatures leads to more precise search (scan less fraction of index, scan
+     less heap pages), but larger index.
+    </para>
+    <para>
+     Example of creating such an index with a default signature length of 8 bytes:
     </para>
 <programlisting>
 CREATE INDEX path_gist_idx ON test USING GIST (path);
 </programlisting>
+    <para>
+     Example of creating such an index with a signature length of 100 bytes:
+    </para>
+<programlisting>
+CREATE INDEX path_gist_idx ON test USING GIST (path gist_ltree_ops(siglen=100));
+</programlisting>
    </listitem>
    <listitem>
     <para>
-     GiST index over <type>ltree[]</type>:
+     GiST index over <type>ltree[]</type> (<literal>gist__ltree_ops</literal>
+     opclass):
      <literal>ltree[] &lt;@ ltree</literal>, <literal>ltree @&gt; ltree[]</literal>,
      <literal>@</literal>, <literal>~</literal>, <literal>?</literal>
     </para>
     <para>
-     Example of creating such an index:
+     <literal>gist__ltree_ops</literal> GiST opclass works similar to
+     <literal>gist_ltree_ops</literal> and also takes signature length as
+     a parameter.  Default value of <literal>siglen</literal> in
+      <literal>gist__ltree_ops</literal> is 28 bytes.
+    </para>
+    <para>
+     Example of creating such an index with a default signature length of 28 bytes:
     </para>
 <programlisting>
 CREATE INDEX path_gist_idx ON test USING GIST (array_path);
 </programlisting>
     <para>
+     Example of creating such an index with a signature length of 100 bytes:
+    </para>
+<programlisting>
+CREATE INDEX path_gist_idx ON test USING GIST (array_path gist__ltree_ops(siglen=100));
+</programlisting>
+    <para>
      Note: This index type is lossy.
     </para>
    </listitem>
diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml
index 049f496869c..dde02634ae4 100644
--- a/doc/src/sgml/pgtrgm.sgml
+++ b/doc/src/sgml/pgtrgm.sgml
@@ -391,6 +391,23 @@ CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
   </para>
 
   <para>
+   <literal>gist_trgm_ops</literal> GiST opclass approximates set of
+   trigrams as a bitmap signature.  Optional integer parameter
+   <literal>siglen</literal> of <literal>gist_trgm_ops</literal> determines
+   signature length in bytes.  Default signature length is 12 bytes.
+   Valid values of signature length are between 1 and 2024 bytes.  Longer
+   signatures leads to more precise search (scan less fraction of index, scan
+   less heap pages), but larger index.
+  </para>
+
+  <para>
+   Example of creating such an index with a signature length of 32 bytes:
+  </para>
+<programlisting>
+CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops(siglen=32));
+</programlisting>
+
+  <para>
    At this point, you will have an index on the <structfield>t</structfield> column that
    you can use for similarity searching.  A typical query is
 <programlisting>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index f0fe6fb874b..3f902dcf84f 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -22,7 +22,7 @@ PostgreSQL documentation
  <refsynopsisdiv>
 <synopsis>
 CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class="parameter">name</replaceable> ] ON [ ONLY ] <replaceable class="parameter">table_name</replaceable> [ USING <replaceable class="parameter">method</replaceable> ]
-    ( { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [ COLLATE <replaceable class="parameter">collation</replaceable> ] [ <replaceable class="parameter">opclass</replaceable> ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
+    ( { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [ COLLATE <replaceable class="parameter">collation</replaceable> ] { <replaceable class="parameter">opclass</replaceable> | DEFAULT } [ ( <replaceable class="parameter">opclass_parameter</replaceable> = <replaceable class="parameter">value</replaceable> [, ... ] ) ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
     [ INCLUDE ( <replaceable class="parameter">column_name</replaceable> [, ...] ) ]
     [ WITH ( <replaceable class="parameter">storage_parameter</replaceable> = <replaceable class="parameter">value</replaceable> [, ... ] ) ]
     [ TABLESPACE <replaceable class="parameter">tablespace_name</replaceable> ]
@@ -286,6 +286,15 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      </varlistentry>
 
      <varlistentry>
+      <term><replaceable class="parameter">opclass_parameter</replaceable></term>
+      <listitem>
+       <para>
+        The name of an operator class parameter. See below for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><literal>ASC</literal></term>
       <listitem>
        <para>
@@ -679,8 +688,9 @@ Indexes:
   </para>
 
   <para>
-   An <firstterm>operator class</firstterm> can be specified for each
-   column of an index. The operator class identifies the operators to be
+   An <firstterm>operator class</firstterm> with its optional parameters 
+   can be specified for each column of an index.
+   The operator class identifies the operators to be
    used by the index for that column. For example, a B-tree index on
    four-byte integers would use the <literal>int4_ops</literal> class;
    this operator class includes comparison functions for four-byte
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index c5b254c7ca9..2217fcd6c2f 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -3637,7 +3637,7 @@ SELECT plainto_tsquery('supernovae stars');
       <tertiary>text search</tertiary>
      </indexterm>
 
-      <literal>CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING GIST (<replaceable>column</replaceable>);</literal>
+      <literal>CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING GIST (<replaceable>column</replaceable> [ { DEFAULT | tsvector_ops } (siglen = <replaceable>number</replaceable>) ] );</literal>
      </term>
 
      <listitem>
@@ -3645,6 +3645,8 @@ SELECT plainto_tsquery('supernovae stars');
        Creates a GiST (Generalized Search Tree)-based index.
        The <replaceable>column</replaceable> can be of <type>tsvector</type> or
        <type>tsquery</type> type.
+       Optional integer parameter <literal>siglen</literal> determines
+       signature length in bytes (see below for details).
       </para>
      </listitem>
     </varlistentry>
@@ -3668,12 +3670,17 @@ SELECT plainto_tsquery('supernovae stars');
    to check the actual table row to eliminate such false matches.
    (<productname>PostgreSQL</productname> does this automatically when needed.)
    GiST indexes are lossy because each document is represented in the
-   index by a fixed-length signature. The signature is generated by hashing
+   index by a fixed-length signature.  Signature length in bytes is determined
+   by the value of the optional integer parameter <literal>siglen</literal>.
+   Default signature length (when <literal>siglen</literal> is not specied) is
+   124 bytes, maximal length is 2024 bytes. The signature is generated by hashing
    each word into a single bit in an n-bit string, with all these bits OR-ed
    together to produce an n-bit document signature.  When two words hash to
    the same bit position there will be a false match.  If all words in
    the query have matches (real or false) then the table row must be
-   retrieved to see if the match is correct.
+   retrieved to see if the match is correct.  Longer signatures leads to more
+   precise search (scan less fraction of index, scan less heap pages), but
+   larger index.
   </para>
 
   <para>
author	Alexander Korotkov <akorotkov@postgresql.org>	2020-03-30 19:17:11 +0300
committer	Alexander Korotkov <akorotkov@postgresql.org>	2020-03-30 19:17:23 +0300
commit	911e70207703799605f5a0e8aad9f06cff067c63 (patch)
tree	5cdbffea1945c2c0b766db195a55bcb372d60537 /doc/src
parent	1d53432ff940b789c2431ba476a2a6e2db3edf84 (diff)