Greenplum Database has additional functions and operators you can use to manipulate search and query vectors, and to rewrite search queries.
This section contains the following subtopics:
Parsing Documents showed how raw textual documents can be converted into tsvector
values. Greenplum Database also provides functions and operators that can be used to manipulate documents that are already in tsvector
form.
tsvector || tsvector
tsvector
concatenation operator returns a vector which combines the lexemes and positional information of the two vectors given as arguments. Positions and weight labels are retained during the concatenation. Positions appearing in the right-hand vector are offset by the largest position mentioned in the left-hand vector, so that the result is nearly equivalent to the result of performing
to_tsvector
on the concatenation of the two original document strings. (The equivalence is not exact, because any stop-words removed from the end of the left-hand argument will not affect the result, whereas they would have affected the positions of the lexemes in the right-hand argument if textual concatenation were used.)
One advantage of using concatenation in the vector form, rather than concatenating text before applying to_tsvector
, is that you can use different configurations to parse different sections of the document. Also, because the setweight
function marks all lexemes of the given vector the same way, it is necessary to parse the text and do setweight
before concatenating if you want to label different parts of the document with different weights.
setweight(<vector> tsvector, <weight> "char") returns tsvector
setweight
returns a copy of the input vector in which every position has been labeled with the given
<weight>
, either
A
,
B
,
C
, or
D
. (
D
is the default for new vectors and as such is not displayed on output.) These labels are retained when vectors are concatenated, allowing words from different parts of a document to be weighted differently by ranking functions.
Note that weight labels apply to positions, not lexemes. If the input vector has been stripped of positions then setweight
does nothing.
length(<vector> tsvector) returns integer
strip(vector tsvector) returns tsvector
Parsing Queries showed how raw textual queries can be converted into tsquery
values. Greenplum Database also provides functions and operators that can be used to manipulate queries that are already in tsquery
form.
tsquery && tsquery
tsquery || tsquery
!! tsquery
numnode(<query> tsquery) returns integer
SELECT numnode(plainto_tsquery('the any'));
NOTICE: query contains only stopword(s) or doesn't contain lexeme(s), ignored
numnode
---------
0
SELECT numnode('foo & bar'::tsquery);
numnode
---------
3
querytree(<query> tsquery) returns text
SELECT querytree(to_tsquery('!defined'));
querytree
-----------
The ts_rewrite
family of functions search a given tsquery
for occurrences of a target subquery, and replace each occurrence with a substitute subquery. In essence this operation is a tsquery
-specific version of substring replacement. A target and substitute combination can be thought of as a query rewrite rule. A collection of such rewrite rules can be a powerful search aid. For example, you can expand the search using synonyms (e.g., new york
, big apple
, nyc
, gotham
) or narrow the search to direct the user to some hot topic. There is some overlap in functionality between this feature and thesaurus dictionaries (Thesaurus Dictionary). However, you can modify a set of rewrite rules on-the-fly without reindexing, whereas updating a thesaurus requires reindexing to be effective.
ts_rewrite(<query> tsquery, <target> tsquery, <substitute> tsquery) returns tsquery
ts_rewrite
simply applies a single rewrite rule:
<target>
is replaced by
<substitute>
wherever it appears in
<query>
. For example:
SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery);
ts_rewrite
------------
'b' & 'c'
ts_rewrite(<query> tsquery, <select> text) returns tsquery
ts_rewrite
accepts a starting
<query>
and a SQL
<select>
command, which is given as a text string. The
<select>
must yield two columns of
tsquery
type. For each row of the
<select>
result, occurrences of the first column value (the target) are replaced by the second column value (the substitute) within the current
<query>
value. For example:
CREATE TABLE aliases (id int, t tsquery, s tsquery);
INSERT INTO aliases VALUES(1, 'a', 'c');
SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases');
ts_rewrite
------------
'b' & 'c'
Note that when multiple rewrite rules are applied in this way, the order of application can be important; so in practice you will want the source query to ORDER BY
some ordering key.
Let's consider a real-life astronomical example. We'll expand query supernovae
using table-driven rewriting rules:
CREATE TABLE aliases (id int, t tsquery primary key, s tsquery);
INSERT INTO aliases VALUES(1, to_tsquery('supernovae'), to_tsquery('supernovae|sn'));
SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT t, s FROM aliases');
ts_rewrite
---------------------------------
'crab' & ( 'supernova' | 'sn' )
We can change the rewriting rules just by updating the table:
UPDATE aliases
SET s = to_tsquery('supernovae|sn & !nebulae')
WHERE t = to_tsquery('supernovae');
SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT t, s FROM aliases');
ts_rewrite
---------------------------------------------
'crab' & ( 'supernova' | 'sn' & !'nebula' )
Rewriting can be slow when there are many rewriting rules, since it checks every rule for a possible match. To filter out obvious non-candidate rules we can use the containment operators for the tsquery
type. In the example below, we select only those rules which might match the original query:
SELECT ts_rewrite('a & b'::tsquery,
'SELECT t,s FROM aliases WHERE ''a & b''::tsquery @> t');
ts_rewrite
------------
'b' & 'c'
The function ts_stat
is useful for checking your configuration and for finding stop-word candidates.
ts_stat(<sqlquery> text, [ <weights> text, ]
OUT <word> text, OUT <ndoc> integer,
OUT <nentry> integer) returns setof record
<sqlquery>
is a text value containing an SQL query which must return a single tsvector
column. ts_stat
runs the query and returns statistics about each distinct lexeme (word) contained in the tsvector
data. The columns returned are
<word> text
— the value of a lexeme<ndoc> integer
— number of documents (tsvector
s) the word occurred in<nentry> integer
— total number of occurrences of the wordIf weights is supplied, only occurrences having one of those weights are counted.
For example, to find the ten most frequent words in a document collection:
SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
The same, but counting only word occurrences with weight A
or B
:
SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
Parent topic: Using Full Text Search