| The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference
by The PostgreSQL Global Development Group Paperback (6"x9"), 454 pages ISBN 9781906966041 RRP £14.95 ($19.95) Sales of this book support the PostgreSQL project! Get a printed copy>>> |
10.6.4 Thesaurus Dictionary
A thesaurus dictionary (sometimes abbreviated as TZ) is a collection of words that includes information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred terms, related terms, etc.
Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves the original terms for indexing as well. PostgreSQL's current implementation of the thesaurus dictionary is an extension of the synonym dictionary with added phrase support. A thesaurus dictionary requires a configuration file of the following format:
# this is a comment sample word(s) : indexed word(s) more sample word(s) : more indexed word(s) ...
where the colon (:) symbol acts as a delimiter between a
a phrase and its replacement.
A thesaurus dictionary uses a subdictionary (which
is specified in the dictionary's configuration) to normalize the input
text before checking for phrase matches. It is only possible to select one
subdictionary. An error is reported if the subdictionary fails to
recognize a word. In that case, you should remove the use of the word or
teach the subdictionary about it. You can place an asterisk
(*) at the beginning of an indexed word to skip applying
the subdictionary to it, but all sample words must be known
to the subdictionary.
The thesaurus dictionary chooses the longest match if there are multiple phrases matching the input, and ties are broken by using the last definition.
Specific stop words recognized by the subdictionary cannot be
specified; instead use ? to mark the location where any
stop word can appear. For example, assuming that a and
the are stop words according to the subdictionary:
? one ? two : swsw
matches a one the two and the one a two;
both would be replaced by swsw.
Since a thesaurus dictionary has the capability to recognize phrases it
must remember its state and interact with the parser. A thesaurus dictionary
uses these assignments to check if it should handle the next word or stop
accumulation. The thesaurus dictionary must be configured
carefully. For example, if the thesaurus dictionary is assigned to handle
only the asciiword token, then a thesaurus dictionary
definition like one 7 will not work since token type
uint is not assigned to the thesaurus dictionary.
Caution: Thesauruses are used during indexing so any change in the thesaurus dictionary's parameters requires reindexing. For most other dictionary types, small changes such as adding or removing stopwords does not force reindexing.
| ISBN 9781906966041 | The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference | See the print edition |