| The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference
by The PostgreSQL Global Development Group Paperback (6"x9"), 454 pages ISBN 9781906966041 RRP £14.95 ($19.95) Sales of this book support the PostgreSQL project! Get a printed copy>>> |
10.6.2 Simple Dictionary
The simple dictionary template operates by converting the
input token to lower case and checking it against a file of stop words.
If it is found in the file then an empty array is returned, causing
the token to be discarded. If not, the lower-cased form of the word
is returned as the normalized lexeme. Alternatively, the dictionary
can be configured to report non-stop-words as unrecognized, allowing
them to be passed on to the next dictionary in the list.
Here is an example of a dictionary definition using the simple
template:
CREATE TEXT SEARCH DICTIONARY public.simple_dict (
TEMPLATE = pg_catalog.simple,
STOPWORDS = english
);
Here, english is the base name of a file of stop words.
The file's full name will be
‘$SHAREDIR/tsearch_data/english.stop’,
where $SHAREDIR means the
PostgreSQL installation's shared-data directory,
often ‘/usr/local/share/postgresql’ (use pg_config
--sharedir to determine it if you're not sure).
The file format is simply a list
of words, one per line. Blank lines and trailing spaces are ignored,
and upper case is folded to lower case, but no other processing is done
on the file contents.
Now we can test our dictionary:
SELECT ts_lexize('public.simple_dict','YeS');
ts_lexize
-----------
{yes}
SELECT ts_lexize('public.simple_dict','The');
ts_lexize
-----------
{}
We can also choose to return NULL, instead of the lower-cased
word, if it is not found in the stop words file. This behavior is
selected by setting the dictionary's Accept parameter to
false. Continuing the example:
ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept =
false );
SELECT ts_lexize('public.simple_dict','YeS');
ts_lexize
-----------
SELECT ts_lexize('public.simple_dict','The');
ts_lexize
-----------
{}
With the default setting of Accept = true,
it is only useful to place a simple dictionary at the end
of a list of dictionaries, since it will never pass on any token to
a following dictionary. Conversely, Accept = false
is only useful when there is at least one following dictionary.
Caution: Most types of dictionaries rely on configuration files, such as files of stop words. These files must be stored in UTF-8 encoding. They will be translated to the actual database encoding, if that is different, when they are read into the server.
Caution: Normally, a database session will read a dictionary configuration file only once, when it is first used within the session. If you modify a configuration file and want to force existing sessions to pick up the new contents, issue an
ALTER TEXT SEARCH DICTIONARYcommand on the dictionary. This can be a “dummy” update that doesn't actually change any parameter values.
| ISBN 9781906966041 | The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference | See the print edition |