|The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference
by The PostgreSQL Global Development Group
Paperback (6"x9"), 454 pages
RRP £14.95 ($19.95)
Sales of this book support the PostgreSQL project! Get a printed copy>>>
Full Text Searching (or just text search) provides
the capability to identify natural-language documents that
satisfy a query, and optionally to sort them by
relevance to the query. The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query. Notions of
similarity are very flexible and depend on the specific
application. The simplest search considers
query as a
set of words and
similarity as the frequency of query
words in the document.
Textual search operators have existed in databases for years.
ILIKE operators for textual data types, but they lack
many essential properties required by modern information systems:
There is no linguistic support, even for English. Regular expressions
are not sufficient because they cannot easily handle derived words, e.g.,
satisfy. You might miss documents that contain
satisfies, although you probably would like to find them when searching for
satisfy. It is possible to use
ORto search for multiple derived forms, but this is tedious and error-prone (some words can have several thousand derivatives).
- They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.
- They tend to be slow because there is no index support, so they must process all documents for every search.
Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:
- Parsing documents into tokens. It is useful to identify various classes of tokens, e.g., numbers, words, complex words, email addresses, so that they can be processed differently. In principle token classes depend on the specific application, but for most purposes it is adequate to use a predefined set of classes. PostgreSQL uses a parser to perform this step. A standard parser is provided, and custom parsers can be created for specific needs.
Converting tokens into lexemes.
A lexeme is a string, just like a token, but it has been
normalized so that different forms of the same word
are made alike. For example, normalization almost always includes
folding upper-case letters to lower-case, and often involves removal
of suffixes (such as
esin English). This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates stop words, which are words that are so common that they are useless for searching. (In short, then, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.) PostgreSQL uses dictionaries to perform this step. Various standard dictionaries are provided, and custom ones can be created for specific needs.
- Storing preprocessed documents optimized for searching. For example, each document can be represented as a sorted array of normalized lexemes. Along with the lexemes it is often desirable to store positional information to use for proximity ranking, so that a document that contains a more “dense” region of query words is assigned a higher rank than one with scattered query words.
Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can:
- Define stop words that should not be indexed.
- Map synonyms to a single word using Ispell.
- Map phrases to a single word using a thesaurus.
- Map different variations of a word to a canonical form using an Ispell dictionary.
- Map different variations of a word to a canonical form using Snowball stemmer rules.
A data type
tsvector is provided for storing preprocessed
documents, along with a type
tsquery for representing processed
queries (section 6.11 Text Search Types). There are many
functions and operators available for these data types
(section 7.13 Text Search Functions and Operators), the most important of which is
the match operator
@@, which we introduce in
section 10.1.2 Basic Text Matching. Full text searches can be accelerated
using indexes (section 10.9 GiST and GIN Index Types).
|ISBN 9781906966041||The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference||See the print edition|