- publishing free software manuals
The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference
by The PostgreSQL Global Development Group
Paperback (6"x9"), 454 pages
ISBN 9781906966041
RRP £14.95 ($19.95)

Sales of this book support the PostgreSQL project! Get a printed copy>>>

10.1.1 What Is a Document?

A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.

For searches within PostgreSQL, a document is normally a textual field within a row of a database table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or obtained dynamically. In other words, a document can be constructed from different parts for indexing and it might not be stored anywhere as a whole. For example:

SELECT title || ' ' ||  author || ' ' ||  abstract || ' ' || 
  body AS document
FROM messages
WHERE mid = 12;

SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' 
  || d.body AS document
FROM messages m, docs d
WHERE mid = did AND mid = 12;

Note: Actually, in these example queries, coalesce should be used to prevent a single NULL attribute from causing a NULL result for the whole document.

Another possibility is to store the documents as simple text files in the file system. In this case, the database can be used to store the full text index and to execute searches, and some unique identifier can be used to retrieve the document from the file system. However, retrieving files from outside the database requires superuser permissions or special function support, so this is usually less convenient than keeping all the data inside PostgreSQL. Also, keeping everything inside the database allows easy access to document metadata to assist in indexing and display.

For text search purposes, each document must be reduced to the preprocessed tsvector format. Searching and ranking are performed entirely on the tsvector representation of a document--the original text need only be retrieved when the document has been selected for display to a user. We therefore often speak of the tsvector as being the document, but of course it is only a compact representation of the full document.

ISBN 9781906966041The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language ReferenceSee the print edition