| The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference
by The PostgreSQL Global Development Group Paperback (6"x9"), 454 pages ISBN 9781906966041 RRP £14.95 ($19.95) Sales of this book support the PostgreSQL project! Get a printed copy>>> |
10.6 Dictionaries
Dictionaries are used to eliminate words that should not be considered in a
search (stop words), and to normalize words so
that different derived forms of the same word will match. A successfully
normalized word is called a lexeme. Aside from
improving search quality, normalization and removal of stop words reduce the
size of the tsvector representation of a document, thereby
improving performance. Normalization does not always have linguistic meaning
and usually depends on application semantics.
Some examples of normalization:
- Linguistic - Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings
-
URL locations can be canonicalized to make
equivalent URLs match:
- http://www.pgsql.ru/db/mw/index.html
- http://www.pgsql.ru/db/mw/
- http://www.pgsql.ru/db/../db/mw/index.html
-
Color names can be replaced by their hexadecimal values, e.g.,
red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF - If indexing numbers, we can remove some fractional digits to reduce the range of possible numbers, so for example 3.14159265359, 3.1415926, 3.14 will be the same after normalization if only two digits are kept after the decimal point.
A dictionary is a program that accepts a token as input and returns:
- an array of lexemes if the input token is known to the dictionary (notice that one token can produce more than one lexeme)
-
a single lexeme with the
TSL_FILTERflag set, to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a filtering dictionary) - an empty array if the dictionary knows the token, but it is a stop word
-
NULLif the dictionary does not recognize the input token
PostgreSQL provides predefined dictionaries for many languages. There are also several predefined templates that can be used to create new dictionaries with custom parameters. Each predefined dictionary template is described below. If no existing template is suitable, it is possible to create new ones; see the ‘contrib/’ area of the PostgreSQL distribution for examples.
A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token
type that the parser can return, a separate list of dictionaries is
specified by the configuration. When a token of that type is found
by the parser, each dictionary in the list is consulted in turn,
until some dictionary recognizes it as a known word. If it is identified
as a stop word, or if no dictionary recognizes the token, it will be
discarded and not indexed or searched for.
Normally, the first dictionary that returns a non-NULL
output determines the result, and any remaining dictionaries are not
consulted; but a filtering dictionary can replace the given word
with a modified word, which is then passed to subsequent dictionaries.
The general rule for configuring a list of dictionaries
is to place first the most narrow, most specific dictionary, then the more
general dictionaries, finishing with a very general dictionary, like
a Snowball stemmer or simple, which
recognizes everything. For example, for an astronomy-specific search
(astro_en configuration) one could bind token type
asciiword (ASCII word) to a synonym dictionary of astronomical
terms, a general English dictionary and a Snowball English
stemmer:
ALTER TEXT SEARCH CONFIGURATION astro_en
ADD MAPPING FOR asciiword WITH astrosyn, english_ispell,
english_stem;
A filtering dictionary can be placed anywhere in the list, except at the end where it'd be useless. Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries. For example, a filtering dictionary could be used to remove accents from accented letters, as is done by the ‘contrib/unaccent’ (unaccent) extension module.
| ISBN 9781906966041 | The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference | See the print edition |