summaryrefslogtreecommitdiff
path: root/doc/src/sgml/trgm.sgml
blob: 63f677a6475142d39cc41882bf1420368df1fc43 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
<sect1 id="pgtrgm">
 <title>pg_trgm</title>
 
 <indexterm zone="pgtrgm">
  <primary>pgtrgm</primary>
 </indexterm>

 <para>
  The <literal>pg_trgm</literal> module provides functions and index classes
  for determining the similarity of text based on trigram matching.
 </para>

 <sect2>
  <title>Trigram (or Trigraph)</title>
  <para>
   A trigram is a set of three consecutive characters taken
   from a string.  A string is considered to have two spaces
   prefixed and one space suffixed when determining the set
   of trigrams that comprise the string.
  </para>
  <para>
   eg. The set of trigrams in the word "cat" is "  c", " ca", 
   "at " and "cat".
  </para>
 </sect2>

 <sect2>
  <title>Public Functions</title>
  <table>
   <title><literal>pg_trgm</literal> functions</title>
   <tgroup cols="2">
    <thead>
     <row>
      <entry>Function</entry>
      <entry>Description</entry>
     </row>
    </thead>
    <tbody>
     <row>
      <entry><literal>real similarity(text, text)</literal></entry>
      <entry>
       <para>
        Returns a number that indicates how closely matches the two
        arguments are.  A zero result indicates that the two words
        are completely dissimilar, and a result of one indicates that
        the two words are identical.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>real show_limit()</literal></entry>
      <entry>
       <para>
        Returns the current similarity threshold used by the '%'
        operator.  This in effect sets the minimum similarity between
        two words in order that they be considered similar enough to
        be misspellings of each other, for example.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>real set_limit(real)</literal></entry>
      <entry>
       <para>
        Sets the current similarity threshold that is used by the '%'
        operator, and is returned by the show_limit() function.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>text[] show_trgm(text)</literal></entry>
      <entry>
       <para>
        Returns an array of all the trigrams of the supplied text
        parameter.
       </para>
      </entry>
     </row>
     <row>
      <entry>Operator: <literal>text % text (returns boolean)</literal></entry> 
      <entry>
       <para>
        The '%' operator returns TRUE if its two arguments have a similarity
        that is greater than the similarity threshold set by set_limit(). It
        will return FALSE if the similarity is less than the current
        threshold.
       </para>
      </entry>
     </row>
    </tbody>
   </tgroup>
  </table>
 </sect2>

 <sect2>
  <title>Public Index Operator Class</title>
  <para>
   The <literal>pg_trgm</literal> module comes with the 
   <literal>gist_trgm_ops</literal> index operator class that allows a
   developer to create an index over a text column for the purpose
   of very fast similarity searches.
  </para>
  <para>
   To use this index, the '%' operator must be used and an appropriate
   similarity threshold for the application must be set. Example:
  </para>
  <programlisting>
CREATE TABLE test_trgm (t text);
CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);
  </programlisting>
  <para>
   At this point, you will have an index on the t text column that you
   can use for similarity searching. Example:
  </para>
  <programlisting>
SELECT
        t,
        similarity(t, 'word') AS sml
FROM
        test_trgm
WHERE
        t % 'word'
ORDER BY
        sml DESC, t;
  </programlisting>
  <para>
   This will return all values in the text column that are sufficiently
   similar to 'word', sorted from best match to worst.  The index will
   be used to make this a fast operation over very large data sets.
  </para>
 </sect2>

 <sect2>
  <title>Text Search Integration</title>
  <para>
   Trigram matching is a very useful tool when used in conjunction
   with a full text index.
  </para>
  <para>
   The first step is to generate an auxiliary table containing all
   the unique words in the documents:
  </para>
  <programlisting>
CREATE TABLE words AS SELECT word FROM 
        stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
  </programlisting>
  <para>
   where <structname>documents</> is a table that has a text field
   <structfield>bodytext</> that we wish to search.  The use of the
   <literal>simple</> configuration with the <function>to_tsvector</>
   function, instead of just using the already
   existing vector is to avoid creating a list of already stemmed
   words.  This way, only the original, unstemmed words are added
   to the word list.
  </para>
  <para>
   Next, create a trigram index on the word column:
  </para>
  <programlisting>
CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);
  </programlisting>
  <para>
   or
  </para>
  <programlisting>
CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);
  </programlisting>
  <para>
   Now, a <literal>SELECT</literal> query similar to the example above can be 
   used to suggest spellings for misspelled words in user search terms. A
   useful extra clause is to ensure that the similar words are also
   of similar length to the misspelled word.
  </para>
  <para>
   <note>
    <para>
     Since the <structname>words</> table has been generated as a separate,
     static table, it will need to be periodically regenerated so that
     it remains up to date with the document collection.
    </para>
   </note>
  </para>
 </sect2>

 <sect2>
  <title>References</title>
  <para>
   GiST Development Site
   <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
  </para>
  <para>
   Tsearch2 Development Site
   <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
  </para>
 </sect2>

 <sect2>
  <title>Authors</title>
  <para>
   Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia
  </para>
  <para>
   Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia
  </para>
  <para>
   Documentation: Christopher Kings-Lynne 
  </para>
  <para>
   This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
  </para>
 </sect2>

</sect1>