| The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference
by The PostgreSQL Global Development Group Paperback (6"x9"), 454 pages ISBN 9781906966041 RRP £14.95 ($19.95) Sales of this book support the PostgreSQL project! Get a printed copy>>> |
6.11.1 tsvector
A tsvector value is a sorted list of distinct
lexemes, which are words that have been
normalized to merge different variants of the same word
(see section 10 Full Text Search for details). Sorting and
duplicate-elimination are done automatically during input, as shown in
this example:
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
tsvector
--------------------------------------------------
'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
To represent lexemes containing whitespace or punctuation, surround them with quotes:
SELECT $$the lexeme ' ' contains spaces$$::tsvector;
tsvector
-------------------------------------------
' ' 'contains' 'lexeme' 'spaces' 'the'
(We use dollar-quoted string literals in this example and the next one to avoid the confusion of having to double quote marks within the literals.) Embedded quotes and backslashes must be doubled:
SELECT $$the lexeme 'Joe”s' contains a quote$$::tsvector;
tsvector
------------------------------------------------
'Joe”s' 'a' 'contains' 'lexeme' 'quote' 'the'
Optionally, integer positions can be attached to lexemes:
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10
fat:11 rat:12'::tsvector;
tsvector
--------------------------------------------------
'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5
'rat':12 'sat':4
A position normally indicates the source word's location in the document. Positional information can be used for proximity ranking. Position values can range from 1 to 16383; larger numbers are silently set to 16383. Duplicate positions for the same lexeme are discarded.
Lexemes that have positions can further be labeled with a
weight, which can be A,
B, C, or D.
D is the default and hence is not shown on output:
SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
tsvector
----------------------------
'a':1A 'cat':5 'fat':2B,4C
Weights are typically used to reflect document structure, for example by marking title words differently from body words. Text search ranking functions can assign different priorities to the different weight markers.
It is important to understand that the
tsvector type itself does not perform any normalization;
it assumes the words it is given are normalized appropriately
for the application. For example,
select 'The Fat Rats'::tsvector;
tsvector
--------------------
'Fat' 'Rats' 'The'
For most English-text-searching applications the above words would
be considered non-normalized, but tsvector doesn't care.
Raw document text should usually be passed through
to_tsvector to normalize the words appropriately
for searching:
SELECT to_tsvector('english', 'The Fat Rats');
to_tsvector
-----------------
'fat':2 'rat':3
Again, see section 10 Full Text Search for more detail.
| ISBN 9781906966041 | The PostgreSQL 9.0 Reference Manual - Volume 1A - SQL Language Reference | See the print edition |