diff options
author | Teodor Sigaev <teodor@sigaev.ru> | 2016-04-07 18:44:18 +0300 |
---|---|---|
committer | Teodor Sigaev <teodor@sigaev.ru> | 2016-04-07 18:44:18 +0300 |
commit | bb140506df605fab58f48926ee1db1f80bdafb59 (patch) | |
tree | 581f9aeb71e3596000af3b4904e0c62a372d77b3 /doc/src | |
parent | 015e88942aa50f0d419ddac00e63bb06d6e62e86 (diff) |
Phrase full text search.
Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery.
On-disk and binary in/out format of tsquery are backward compatible.
It has two side effect:
- change order for tsquery, so, users, who has a btree index over tsquery,
should reindex it
- less number of parenthesis in tsquery output, and tsquery becomes more
readable
Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov
Reviewers: Alexander Korotkov, Artur Zakirov
Diffstat (limited to 'doc/src')
-rw-r--r-- | doc/src/sgml/datatype.sgml | 9 | ||||
-rw-r--r-- | doc/src/sgml/func.sgml | 39 | ||||
-rw-r--r-- | doc/src/sgml/textsearch.sgml | 182 |
3 files changed, 215 insertions, 15 deletions
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 7c3ef92cd2e..0b60c61d480 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -3924,8 +3924,9 @@ SELECT to_tsvector('english', 'The Fat Rats'); <para> A <type>tsquery</type> value stores lexemes that are to be searched for, and combines them honoring the Boolean operators - <literal>&</literal> (AND), <literal>|</literal> (OR), and - <literal>!</> (NOT). Parentheses can be used to enforce grouping + <literal>&</literal> (AND), <literal>|</literal> (OR), + <literal>!</> (NOT) and <literal><-></> (FOLLOWED BY) phrase search + operator. Parentheses can be used to enforce grouping of the operators: <programlisting> @@ -3946,8 +3947,8 @@ SELECT 'fat & rat & ! cat'::tsquery; </programlisting> In the absence of parentheses, <literal>!</> (NOT) binds most tightly, - and <literal>&</literal> (AND) binds more tightly than - <literal>|</literal> (OR). + and <literal>&</literal> (AND) and <literal><-></literal> (FOLLOWED BY) + both bind more tightly than <literal>|</literal> (OR). </para> <para> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 15b6b4eb3d5..9b0778baa99 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -9128,6 +9128,12 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple <entry><literal>!'cat'</literal></entry> </row> <row> + <entry> <literal><-></literal> </entry> + <entry><type>tsquery</> followed by <type>tsquery</></entry> + <entry><literal>to_tsquery('fat') <-> to_tsquery('rat')</literal></entry> + <entry><literal>'fat' <-> 'rat'</literal></entry> + </row> + <row> <entry> <literal>@></literal> </entry> <entry><type>tsquery</> contains another ?</entry> <entry><literal>'cat'::tsquery @> 'cat & rat'::tsquery</literal></entry> @@ -9222,6 +9228,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple <row> <entry> <indexterm> + <primary>phraseto_tsquery</primary> + </indexterm> + <literal><function>phraseto_tsquery(<optional> <replaceable class="PARAMETER">config</> <type>regconfig</> , </optional> <replaceable class="PARAMETER">query</> <type>text</type>)</function></literal> + </entry> + <entry><type>tsquery</type></entry> + <entry>produce <type>tsquery</> ignoring punctuation</entry> + <entry><literal>phraseto_tsquery('english', 'The Fat Rats')</literal></entry> + <entry><literal>'fat' <-> 'rat'</literal></entry> + </row> + <row> + <entry> + <indexterm> <primary>querytree</primary> </indexterm> <literal><function>querytree(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>)</function></literal> @@ -9424,6 +9442,27 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple <row> <entry> <indexterm> + <primary>tsquery_phrase</primary> + </indexterm> + <literal><function>tsquery_phrase(<replaceable class="PARAMETER">query1</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">query2</replaceable> <type>tsquery</>)</function></literal> + </entry> + <entry><type>tsquery</type></entry> + <entry>implementation of <literal><-></> (FOLLOWED BY) operator</entry> + <entry><literal>tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'))</literal></entry> + <entry><literal>'fat' <-> 'cat'</literal></entry> + </row> + <row> + <entry> + <literal><function>tsquery_phrase(<replaceable class="PARAMETER">query1</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">query2</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">distance</replaceable> <type>integer</>)</function></literal> + </entry> + <entry><type>tsquery</type></entry> + <entry>phrase-concatenate with distance</entry> + <entry><literal>tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10)</literal></entry> + <entry><literal>'fat' <10> 'cat'</literal></entry> + </row> + <row> + <entry> + <indexterm> <primary>tsvector_update_trigger</primary> </indexterm> <literal><function>tsvector_update_trigger()</function></literal> diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index ea3abc9e15a..930c8f0a5dc 100644 --- a/doc/src/sgml/textsearch.sgml +++ b/doc/src/sgml/textsearch.sgml @@ -263,9 +263,10 @@ SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::t As the above example suggests, a <type>tsquery</type> is not just raw text, any more than a <type>tsvector</type> is. A <type>tsquery</type> contains search terms, which must be already-normalized lexemes, and - may combine multiple terms using AND, OR, and NOT operators. + may combine multiple terms using AND, OR, NOT and FOLLOWED BY operators. (For details see <xref linkend="datatype-textsearch">.) There are - functions <function>to_tsquery</> and <function>plainto_tsquery</> + functions <function>to_tsquery</>, <function>plainto_tsquery</> + and <function>phraseto_tsquery</> that are helpful in converting user-written text into a proper <type>tsquery</type>, for example by normalizing words appearing in the text. Similarly, <function>to_tsvector</> is used to parse and @@ -294,6 +295,35 @@ SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat'); </para> <para> + Phrase search is made possible with the help of the <literal><-></> + (FOLLOWED BY) operator, which enforces lexeme order. This allows you + to discard strings not containing the desired phrase, for example: + +<programlisting> +SELECT q @@ to_tsquery('fatal <-> error') +FROM unnest(array[to_tsvector('fatal error'), + to_tsvector('error is not fatal')]) AS q; + ?column? +---------- + t + f +</programlisting> + + A more generic version of the FOLLOWED BY operator takes form of + <literal><N></>, where N stands for the greatest allowed distance + between the specified lexemes. The <literal>phraseto_tsquery</> + function makes use of this behavior in order to construct a + <literal>tsquery</> capable of matching the provided phrase: + +<programlisting> +SELECT phraseto_tsquery('cat ate some rats'); + phraseto_tsquery +------------------------------- + ( 'cat' <-> 'ate' ) <2> 'rat' +</programlisting> + </para> + + <para> The <literal>@@</literal> operator also supports <type>text</type> input, allowing explicit conversion of a text string to <type>tsvector</type> or <type>tsquery</> to be skipped @@ -709,11 +739,14 @@ UPDATE tt SET ti = <para> <productname>PostgreSQL</productname> provides the - functions <function>to_tsquery</function> and - <function>plainto_tsquery</function> for converting a query to - the <type>tsquery</type> data type. <function>to_tsquery</function> - offers access to more features than <function>plainto_tsquery</function>, - but is less forgiving about its input. + functions <function>to_tsquery</function>, + <function>plainto_tsquery</function> and + <function>phraseto_tsquery</function> + for converting a query to the <type>tsquery</type> data type. + <function>to_tsquery</function> offers access to more features + than both <function>plainto_tsquery</function> and + <function>phraseto_tsquery</function>, but is less forgiving + about its input. </para> <indexterm> @@ -728,7 +761,8 @@ to_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type> <function>to_tsquery</function> creates a <type>tsquery</> value from <replaceable>querytext</replaceable>, which must consist of single tokens separated by the Boolean operators <literal>&</literal> (AND), - <literal>|</literal> (OR) and <literal>!</literal> (NOT). These operators + <literal>|</literal> (OR), <literal>!</literal> (NOT), and also the + <literal><-></literal> (FOLLOWED BY) phrase search operator. These operators can be grouped using parentheses. In other words, the input to <function>to_tsquery</function> must already follow the general rules for <type>tsquery</> input, as described in <xref @@ -814,8 +848,8 @@ SELECT plainto_tsquery('english', 'The Fat Rats'); </screen> Note that <function>plainto_tsquery</> cannot - recognize Boolean operators, weight labels, or prefix-match labels - in its input: + recognize Boolean and phrase search operators, weight labels, + or prefix-match labels in its input: <screen> SELECT plainto_tsquery('english', 'The Fat & Rats:C'); @@ -827,6 +861,57 @@ SELECT plainto_tsquery('english', 'The Fat & Rats:C'); Here, all the input punctuation was discarded as being space symbols. </para> + <indexterm> + <primary>phraseto_tsquery</primary> + </indexterm> + +<synopsis> +phraseto_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</> +</synopsis> + + <para> + <function>phraseto_tsquery</> behaves much like + <function>plainto_tsquery</>, with the exception + that it utilizes the <literal><-></literal> (FOLLOWED BY) phrase search + operator instead of the <literal>&</literal> (AND) Boolean operator. + This is particularly useful when searching for exact lexeme sequences, + since the phrase search operator helps to maintain lexeme order. + </para> + + <para> + Example: + +<screen> +SELECT phraseto_tsquery('english', 'The Fat Rats'); + phraseto_tsquery +------------------ + 'fat' <-> 'rat' +</screen> + + Just like the <function>plainto_tsquery</>, the + <function>phraseto_tsquery</> function cannot + recognize Boolean and phrase search operators, weight labels, + or prefix-match labels in its input: + +<screen> +SELECT phraseto_tsquery('english', 'The Fat & Rats:C'); + phraseto_tsquery +----------------------------- + ( 'fat' <-> 'rat' ) <-> 'c' +</screen> + + It is possible to specify the configuration to be used to parse the document, + for example, we could create a new one using the hunspell dictionary + (namely 'eng_hunspell') in order to match phrases with different word forms: + +<screen> +SELECT phraseto_tsquery('eng_hunspell', 'developer of the building which collapsed'); + phraseto_tsquery +-------------------------------------------------------------------------------------------- + ( 'developer' <3> 'building' ) <2> 'collapse' | ( 'developer' <3> 'build' ) <2> 'collapse' +</screen> + </para> + </sect2> <sect2 id="textsearch-ranking"> @@ -1390,6 +1475,81 @@ FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank <varlistentry> <term> + <literal><type>tsquery</> <-> <type>tsquery</></literal> + </term> + + <listitem> + <para> + Returns the phrase-concatenation of the two given queries. + +<screen> +SELECT to_tsquery('fat') <-> to_tsquery('cat | rat'); + ?column? +----------------------------------- + 'fat' <-> 'cat' | 'fat' <-> 'rat' +</screen> + </para> + </listitem> + + </varlistentry> + + <varlistentry> + + <term> + <indexterm> + <primary>tsquery_phrase</primary> + </indexterm> + + <literal>tsquery_phrase(<replaceable class="PARAMETER">query1</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">query2</replaceable> <type>tsquery</> [, <replaceable class="PARAMETER">distance</replaceable> <type>integer</> ]) returns <type>tsquery</></literal> + </term> + + <listitem> + <para> + Returns the distanced phrase-concatenation of the two given queries. + This function lies in the implementation of the <literal><-></> operator. + +<screen> +SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10); + tsquery_phrase +------------------ + 'fat' <10> 'cat' +</screen> + </para> + </listitem> + + </varlistentry> + + <varlistentry> + + <term> + <indexterm> + <primary>setweight</primary> + </indexterm> + + <literal>setweight(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">weight</replaceable> <type>"char"</>) returns <type>tsquery</></literal> + </term> + + <listitem> + <para> + <function>setweight</> returns a copy of the input query in which every + position has been labeled with the given <replaceable>weight</>(s), either + <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, + <literal>D</literal> or their combination. These labels are retained when + queries are concatenated, allowing words from different parts of a document + to be weighted differently by ranking functions. + </para> + + <para> + Note that weight labels apply to <emphasis>positions</>, not + <emphasis>lexemes</>. If the input query has been stripped of + positions then <function>setweight</> does nothing. + </para> + </listitem> + </varlistentry> + + <varlistentry> + + <term> <indexterm> <primary>numnode</primary> </indexterm> @@ -2428,7 +2588,7 @@ more sample word(s) : more indexed word(s) <para> Specific stop words recognized by the subdictionary cannot be - specified; instead use <literal>?</> to mark the location where any + specified; instead use <literal><-></> to mark the location where any stop word can appear. For example, assuming that <literal>a</> and <literal>the</> are stop words according to the subdictionary: |