Phrase full text search.

Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery. On-disk and binary in/out format of tsquery are backward compatible. It has two side effect: - change order for tsquery, so, users, who has a btree index over tsquery, should reindex it - less number of parenthesis in tsquery output, and tsquery becomes more readable Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov Reviewers: Alexander Korotkov, Artur Zakirov
author: Teodor Sigaev <teodor@sigaev.ru> 2016-04-07 18:44:18 +0300
committer: Teodor Sigaev <teodor@sigaev.ru> 2016-04-07 18:44:18 +0300
commit: bb140506df605fab58f48926ee1db1f80bdafb59 (patch)
tree: 581f9aeb71e3596000af3b4904e0c62a372d77b3 /doc/src
parent: 015e88942aa50f0d419ddac00e63bb06d6e62e86 (diff)
3 files changed, 215 insertions, 15 deletions
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 7c3ef92cd2e..0b60c61d480 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -3924,8 +3924,9 @@ SELECT to_tsvector('english', 'The Fat Rats');
     <para>
      A <type>tsquery</type> value stores lexemes that are to be
      searched for, and combines them honoring the Boolean operators
-     <literal>&amp;</literal> (AND), <literal>|</literal> (OR), and
-     <literal>!</> (NOT).  Parentheses can be used to enforce grouping
+     <literal>&amp;</literal> (AND), <literal>|</literal> (OR),
+     <literal>!</> (NOT) and <literal>&lt;-&gt;</> (FOLLOWED BY) phrase search
+     operator.  Parentheses can be used to enforce grouping
      of the operators:
 
 <programlisting>
@@ -3946,8 +3947,8 @@ SELECT 'fat &amp; rat &amp; ! cat'::tsquery;
 </programlisting>
 
      In the absence of parentheses, <literal>!</> (NOT) binds most tightly,
-     and <literal>&amp;</literal> (AND) binds more tightly than
-     <literal>|</literal> (OR).
+     and <literal>&amp;</literal> (AND) and <literal>&lt;-&gt;</literal> (FOLLOWED BY)
+     both bind more tightly than <literal>|</literal> (OR).
     </para>
 
     <para>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 15b6b4eb3d5..9b0778baa99 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9128,6 +9128,12 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
         <entry><literal>!'cat'</literal></entry>
        </row>
        <row>
+        <entry> <literal>&lt;-&gt;</literal> </entry>
+        <entry><type>tsquery</> followed by <type>tsquery</></entry>
+        <entry><literal>to_tsquery('fat') &lt;-&gt; to_tsquery('rat')</literal></entry>
+        <entry><literal>'fat' &lt;-&gt; 'rat'</literal></entry>
+       </row>
+       <row>
         <entry> <literal>@&gt;</literal> </entry>
         <entry><type>tsquery</> contains another ?</entry>
         <entry><literal>'cat'::tsquery @&gt; 'cat &amp; rat'::tsquery</literal></entry>
@@ -9222,6 +9228,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
        <row>
         <entry>
          <indexterm>
+          <primary>phraseto_tsquery</primary>
+         </indexterm>
+         <literal><function>phraseto_tsquery(<optional> <replaceable class="PARAMETER">config</> <type>regconfig</> , </optional> <replaceable class="PARAMETER">query</> <type>text</type>)</function></literal>
+        </entry>
+        <entry><type>tsquery</type></entry>
+        <entry>produce <type>tsquery</> ignoring punctuation</entry>
+        <entry><literal>phraseto_tsquery('english', 'The Fat Rats')</literal></entry>
+        <entry><literal>'fat' &lt;-&gt; 'rat'</literal></entry>
+       </row>
+       <row>
+        <entry>
+         <indexterm>
           <primary>querytree</primary>
          </indexterm>
          <literal><function>querytree(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>)</function></literal>
@@ -9424,6 +9442,27 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
        <row>
         <entry>
          <indexterm>
+          <primary>tsquery_phrase</primary>
+         </indexterm>
+         <literal><function>tsquery_phrase(<replaceable class="PARAMETER">query1</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">query2</replaceable> <type>tsquery</>)</function></literal>
+        </entry>
+        <entry><type>tsquery</type></entry>
+        <entry>implementation of <literal>&lt;-&gt;</> (FOLLOWED BY) operator</entry>
+        <entry><literal>tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'))</literal></entry>
+        <entry><literal>'fat' &lt;-&gt; 'cat'</literal></entry>
+       </row>
+       <row>
+        <entry>
+         <literal><function>tsquery_phrase(<replaceable class="PARAMETER">query1</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">query2</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">distance</replaceable> <type>integer</>)</function></literal>
+        </entry>
+        <entry><type>tsquery</type></entry>
+        <entry>phrase-concatenate with distance</entry>
+        <entry><literal>tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10)</literal></entry>
+        <entry><literal>'fat' &lt;10&gt; 'cat'</literal></entry>
+       </row>
+       <row>
+        <entry>
+         <indexterm>
           <primary>tsvector_update_trigger</primary>
          </indexterm>
          <literal><function>tsvector_update_trigger()</function></literal>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index ea3abc9e15a..930c8f0a5dc 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -263,9 +263,10 @@ SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::t
     As the above example suggests, a <type>tsquery</type> is not just raw
     text, any more than a <type>tsvector</type> is.  A <type>tsquery</type>
     contains search terms, which must be already-normalized lexemes, and
-    may combine multiple terms using AND, OR, and NOT operators.
+    may combine multiple terms using AND, OR, NOT and FOLLOWED BY operators.
     (For details see <xref linkend="datatype-textsearch">.)  There are
-    functions <function>to_tsquery</> and <function>plainto_tsquery</>
+    functions <function>to_tsquery</>, <function>plainto_tsquery</>
+    and <function>phraseto_tsquery</>
     that are helpful in converting user-written text into a proper
     <type>tsquery</type>, for example by normalizing words appearing in
     the text.  Similarly, <function>to_tsvector</> is used to parse and
@@ -294,6 +295,35 @@ SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat &amp; rat');
    </para>
 
    <para>
+    Phrase search is made possible with the help of the <literal>&lt;-&gt;</>
+    (FOLLOWED BY) operator, which enforces lexeme order. This allows you
+    to discard strings not containing the desired phrase, for example:
+
+<programlisting>
+SELECT q @@ to_tsquery('fatal &lt;-&gt; error')
+FROM unnest(array[to_tsvector('fatal error'),
+                  to_tsvector('error is not fatal')]) AS q;
+ ?column?
+----------
+ t
+ f
+</programlisting>
+
+    A more generic version of the FOLLOWED BY operator takes form of
+    <literal>&lt;N&gt;</>, where N stands for the greatest allowed distance
+    between the specified lexemes. The <literal>phraseto_tsquery</>
+    function makes use of this behavior in order to construct a
+    <literal>tsquery</> capable of matching the provided phrase:
+
+<programlisting>
+SELECT phraseto_tsquery('cat ate some rats');
+       phraseto_tsquery
+-------------------------------
+ ( 'cat' &lt;-&gt; 'ate' ) &lt;2&gt; 'rat'
+</programlisting>
+   </para>
+
+   <para>
     The <literal>@@</literal> operator also
     supports <type>text</type> input, allowing explicit conversion of a text
     string to <type>tsvector</type> or <type>tsquery</> to be skipped
@@ -709,11 +739,14 @@ UPDATE tt SET ti =
 
    <para>
     <productname>PostgreSQL</productname> provides the
-    functions <function>to_tsquery</function> and
-    <function>plainto_tsquery</function> for converting a query to
-    the <type>tsquery</type> data type.  <function>to_tsquery</function>
-    offers access to more features than <function>plainto_tsquery</function>,
-    but is less forgiving about its input.
+    functions <function>to_tsquery</function>,
+    <function>plainto_tsquery</function> and
+    <function>phraseto_tsquery</function>
+    for converting a query to the <type>tsquery</type> data type.
+    <function>to_tsquery</function> offers access to more features
+    than both <function>plainto_tsquery</function> and
+    <function>phraseto_tsquery</function>, but is less forgiving
+    about its input.
    </para>
 
    <indexterm>
@@ -728,7 +761,8 @@ to_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>
     <function>to_tsquery</function> creates a <type>tsquery</> value from
     <replaceable>querytext</replaceable>, which must consist of single tokens
     separated by the Boolean operators <literal>&amp;</literal> (AND),
-    <literal>|</literal> (OR) and <literal>!</literal> (NOT).  These operators
+    <literal>|</literal> (OR), <literal>!</literal> (NOT), and also the
+    <literal>&lt;-&gt;</literal> (FOLLOWED BY) phrase search operator. These operators
     can be grouped using parentheses.  In other words, the input to
     <function>to_tsquery</function> must already follow the general rules for
     <type>tsquery</> input, as described in <xref
@@ -814,8 +848,8 @@ SELECT plainto_tsquery('english', 'The Fat Rats');
 </screen>
 
     Note that <function>plainto_tsquery</> cannot
-    recognize Boolean operators, weight labels, or prefix-match labels
-    in its input:
+    recognize Boolean and phrase search operators, weight labels,
+    or prefix-match labels in its input:
 
 <screen>
 SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
@@ -827,6 +861,57 @@ SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
     Here, all the input punctuation was discarded as being space symbols.
    </para>
 
+   <indexterm>
+    <primary>phraseto_tsquery</primary>
+   </indexterm>
+
+<synopsis>
+phraseto_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
+</synopsis>
+
+   <para>
+    <function>phraseto_tsquery</> behaves much like
+    <function>plainto_tsquery</>, with the exception
+    that it utilizes the <literal>&lt;-&gt;</literal> (FOLLOWED BY) phrase search
+    operator instead of the <literal>&amp;</literal> (AND) Boolean operator.
+    This is particularly useful when searching for exact lexeme sequences,
+    since the phrase search operator helps to maintain lexeme order.
+   </para>
+
+   <para>
+    Example:
+
+<screen>
+SELECT phraseto_tsquery('english', 'The Fat Rats');
+ phraseto_tsquery
+------------------
+ 'fat' &lt;-&gt; 'rat'
+</screen>
+
+    Just like the <function>plainto_tsquery</>, the
+    <function>phraseto_tsquery</> function cannot
+    recognize Boolean and phrase search operators, weight labels,
+    or prefix-match labels in its input:
+
+<screen>
+SELECT phraseto_tsquery('english', 'The Fat &amp; Rats:C');
+      phraseto_tsquery
+-----------------------------
+ ( 'fat' &lt;-&gt; 'rat' ) &lt;-&gt; 'c'
+</screen>
+
+    It is possible to specify the configuration to be used to parse the document,
+    for example, we could create a new one using the hunspell dictionary
+    (namely 'eng_hunspell') in order to match phrases with different word forms:
+
+<screen>
+SELECT phraseto_tsquery('eng_hunspell', 'developer of the building which collapsed');
+                                      phraseto_tsquery
+--------------------------------------------------------------------------------------------
+ ( 'developer' &lt;3&gt; 'building' ) &lt;2&gt; 'collapse' | ( 'developer' &lt;3&gt; 'build' ) &lt;2&gt; 'collapse'
+</screen>
+   </para>
+
   </sect2>
 
   <sect2 id="textsearch-ranking">
@@ -1390,6 +1475,81 @@ FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
     <varlistentry>
 
      <term>
+      <literal><type>tsquery</> &lt;-&gt; <type>tsquery</></literal>
+     </term>
+
+     <listitem>
+      <para>
+       Returns the phrase-concatenation of the two given queries.
+
+<screen>
+SELECT to_tsquery('fat') &lt;-&gt; to_tsquery('cat | rat');
+             ?column?
+-----------------------------------
+ 'fat' &lt;-&gt; 'cat' | 'fat' &lt;-&gt; 'rat'
+</screen>
+      </para>
+     </listitem>
+
+    </varlistentry>
+
+    <varlistentry>
+
+     <term>
+     <indexterm>
+      <primary>tsquery_phrase</primary>
+     </indexterm>
+
+      <literal>tsquery_phrase(<replaceable class="PARAMETER">query1</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">query2</replaceable> <type>tsquery</> [, <replaceable class="PARAMETER">distance</replaceable> <type>integer</> ]) returns <type>tsquery</></literal>
+     </term>
+
+     <listitem>
+      <para>
+       Returns the distanced phrase-concatenation of the two given queries.
+       This function lies in the implementation of the <literal>&lt;-&gt;</> operator.
+
+<screen>
+SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10);
+  tsquery_phrase
+------------------
+ 'fat' &lt;10&gt; 'cat'
+</screen>
+      </para>
+     </listitem>
+
+    </varlistentry>
+
+    <varlistentry>
+
+     <term>
+     <indexterm>
+      <primary>setweight</primary>
+     </indexterm>
+
+      <literal>setweight(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">weight</replaceable> <type>"char"</>) returns <type>tsquery</></literal>
+     </term>
+
+     <listitem>
+      <para>
+       <function>setweight</> returns a copy of the input query in which every
+       position has been labeled with the given <replaceable>weight</>(s), either
+       <literal>A</literal>, <literal>B</literal>, <literal>C</literal>,
+       <literal>D</literal> or their combination. These labels are retained when
+       queries are concatenated, allowing words from different parts of a document
+       to be weighted differently by ranking functions.
+      </para>
+
+      <para>
+       Note that weight labels apply to <emphasis>positions</>, not
+       <emphasis>lexemes</>.  If the input query has been stripped of
+       positions then <function>setweight</> does nothing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+
+     <term>
      <indexterm>
       <primary>numnode</primary>
      </indexterm>
@@ -2428,7 +2588,7 @@ more sample word(s) : more indexed word(s)
 
    <para>
     Specific stop words recognized by the subdictionary cannot be
-    specified;  instead use <literal>?</> to mark the location where any
+    specified;  instead use <literal>&lt;-&gt;</> to mark the location where any
     stop word can appear.  For example, assuming that <literal>a</> and
     <literal>the</> are stop words according to the subdictionary:
author	Teodor Sigaev <teodor@sigaev.ru>	2016-04-07 18:44:18 +0300
committer	Teodor Sigaev <teodor@sigaev.ru>	2016-04-07 18:44:18 +0300
commit	bb140506df605fab58f48926ee1db1f80bdafb59 (patch)
tree	581f9aeb71e3596000af3b4904e0c62a372d77b3 /doc/src
parent	015e88942aa50f0d419ddac00e63bb06d6e62e86 (diff)