To retrieve data, you submit a query that performs a search based on criteria that you specify. Simple queries return straight-forward results. You can use the default query parser, or specify a different query parser at query time.
After a Solr index is committed, you can run queries with the gptext.search()
function, which has this syntax:
gptext.search(<src_table>, <index_name>, <search_query>, <filter_queries>[, <options>])
The <search_query>
argument is a text value that contains a Solr query. The <filter_queries>
argument is an array of queries that restrict the set of documents to search.
The default Solr Standard Query Parser has a rich query syntax that includes wildcard characters, Boolean operators, proximity and range searches, and fuzzy searches. See The Standard Query Parser at the Solr website for examples.
Solr has additional query processors that you can specify in the <search_query>
argument to access additional features. The VMware Greenplum Text Universal Query Parser, gptextqp
, allows queries that mix features from all of the supported query parsers.
See Selecting a Query Parser for a list of the supported query parsers and how to request them in your queries. See Using the Universal Query Parser for examples using the VMware Greenplum Text Universal Query Parser.
The following sections show how to use the gptext.search()
function, including example queries that demonstrate Solr search features.
This search finds documents in the wikipedia.articles
index that contain both search terms "solar" and "battery". The 'rows=5'
argument is a Solr option that specifies the top 5 results are to be returned from each segment. In a VMware Greenplum cluster with two segments, this query returns up to 10 rows.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'solar AND battery', null, 'rows=5') q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
id | date_time | title | score
----------+------------------------+---------------------+-----------
13690575 | 2017-08-24 02:34:00-05 | Solar power | 2.7128658
2008322 | 2017-08-05 02:09:00-05 | Vehicle-to-grid | 2.5810153
4711003 | 2017-08-10 18:56:00-05 | Osmotic power | 2.2073007
25784 | 2017-08-26 07:10:00-05 | Renewable energy | 2.1295567
213555 | 2017-08-27 12:48:00-05 | Solar updraft tower | 2.0210648
27743 | 2017-08-20 15:56:00-05 | Solar energy | 1.6916461
608623 | 2017-08-27 03:56:00-05 | Ethanol fuel | 1.4619896
(7 rows)
See Solr options for more about Solr options.
By using the OR keyword, this search matches more documents than the AND example. The total number of rows returned is limited by the rows=5
Solr option.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'solar OR battery', null, 'rows=5') q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
id | date_time | title | score
---------+------------------------+---------------------+-----------
2008322 | 2017-08-05 02:09:00-05 | Vehicle-to-grid | 2.5810153
25784 | 2017-08-26 07:10:00-05 | Renewable energy | 2.1295567
2120798 | 2017-01-28 00:59:00-06 | Lithium economy | 2.0416002
213555 | 2017-08-27 12:48:00-05 | Solar updraft tower | 2.0210648
27743 | 2017-08-20 15:56:00-05 | Solar energy | 1.6916461
608623 | 2017-08-27 03:56:00-05 | Ethanol fuel | 1.4619896
533423 | 2017-08-28 00:52:00-05 | Solar water heating | 1.0239072
2988035 | 2017-03-12 06:39:00-05 | Vortex engine | 0.9519546
113728 | 2017-08-15 09:59:00-05 | Geothermal energy | 0.6801035
55017 | 2017-08-28 19:24:00-05 | Fusion power | 0.6432224
(10 rows)
A VMware Greenplum Text index has a default search column, specified when the index is created with the gptext.create_index()
function. If you have included additional columns to index, you can reference them in your queries. This query searches for documents with the word "solar" in the title
column.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'title:solar', null, null) q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
id | date_time | title | score
----------+------------------------+---------------------+-----------
13690575 | 2017-08-24 02:34:00-05 | Solar power | 1.6547729
27743 | 2017-08-20 15:56:00-05 | Solar energy | 1.6547729
533423 | 2017-08-28 00:52:00-05 | Solar water heating | 1.1132113
213555 | 2017-08-27 12:48:00-05 | Solar updraft tower | 1.1132113
(4 rows)
This example finds documents where the title
column matches "Solar power" or "Solar energy".
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'title:(solar AND (power OR energy))', null, null) q
WHERE q.id::int8 = a.id;
id | date_time | title | score
----------+------------------------+--------------+-----------
27743 | 2017-08-20 15:56:00-05 | Solar energy | 3.3095458
13690575 | 2017-08-24 02:34:00-05 | Solar power | 2.9718256
(2 rows)
This example searches for articles that have "photosynthesis" in the content
column but that do not have "solar" in the title
column.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'photosynthesis and -title:solar', null, null) q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
id | date_time | title | score
----------+------------------------+------------------+-----------
25784 | 2017-08-26 07:10:00-05 | Renewable energy | 2.9720955
53716476 | 2017-08-28 20:40:00-05 | Seaweed fuel | 1.4240221
14205946 | 2017-08-28 08:46:00-05 | Algae fuel | 1.3022419
608623 | 2017-08-27 03:56:00-05 | Ethanol fuel | 0.7614042
(4 rows)
A filter query applies filters to the results returned by the query. The <filter_queries>
argument of the gptext.search()
function is an array, so you can apply multiple filters to the search results.
The following example finds articles that have the word "nuclear" in the content
column and then applies two filter queries to remove articles that have "solar" in the title
column and articles that do not have "power" in the title
column.
=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'nuclear', '{-title:solar,title:power}', null) q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
id | date_time | title | score
----------+------------------------+------------------+------------
14090587 | 2017-08-14 14:00:00-05 | Low-carbon power | 1.1897897
55017 | 2017-08-28 19:24:00-05 | Fusion power | 1.1753609
13021878 | 2017-08-09 05:03:00-05 | Geothermal power | 0.99499804
(3 rows)
The following example searches the demo.twitter.message
table for messages that contain the text "iphone" and either "hate" or "love" and filters for authors who specified English language in their twitter profile.
=# SELECT t.id, q.score, t.author_screen_name, t.message_text
FROM twitter.message t,
gptext.search(TABLE(SELECT * FROM twitter.message), 'demo.twitter.message',
'(iphone AND (hate OR love))', '{author_lang:en}', 'rows=5') q
WHERE t.id = q.id::int4
ORDER BY score DESC;
id | score | author_screen_name |
message_text
----------+-----------+--------------------+-----------------------------------------------
-------------------------------------------------------------
19424811 | 3.446217 | kennediiscool | I hate
: iPhones:
20663075 | 2.9209785 | Hi_imMac | RT @indigoFKNvanity: I hate the auto correct o
n iPhones !!!!!!!!!
20042822 | 2.9209785 | renadrian | @KDMC23 ohhhh!!! I hate Iphone Talk!
20759274 | 2.5128412 | SteLala | Dropped frutopia on
: My phone... #ciaowaterdamage I hate iPhones.
19416451 | 2.1448703 | ShayFknShay | I'm in love with my new iPhone(:
20350436 | 2.102924 | mahhnamestj | I absolutely love how fast this phone works. L
ove the iPhone.
19284329 | 1.9478481 | popolvuhplaya | #nowplaying on my iPhone: Daft Punk - "Digital
Love"
19714120 | 1.9478481 | BipolarBearApp | @ayee_Eddy2011 I love pancakes too! #iPhone #
app
20257190 | 1.6903389 | alasco | Love my #iphone - only problem now? I want an
#Ipad!
20473459 | 1.379696 | ArniBella | i love my iphone4 but I'm excited to see what
the iphone5 has to offer #gadgets #iphone #apple #technology
(10 rows)
Faceting breaks query results into multiple categories with a count of the number of documents in the index for each category. There are three VMware Greenplum Text faceted search functions:
gptext.faceted_field_search()
– the categories are the values of one or more fields in VMware Greenplum Text index.gptext.faceted_query_search()
– the categories are a list of search queries.gptext.faceted_range_search()
– the categories are a list of ranges calculated from minimum value, maximum value, and the size of the range (gap).Note: The examples in this section use the store.products
table. See Setting Up the Demo Database for commands to create and load data into this table.
After the table is created and the data loaded, create the VMware Greenplum Text index, index the data, and then commit the index as shown in this example.
=# SELECT * FROM gptext.create_index('store', 'products', '{id, title, category, brand, price}', '{int, text_intl, string, string, float}', 'id', 'title'); =# SELECT * FROM gptext.index(TABLE(SELECT * FROM store.products), 'demo.store.products'); dbid | num_docs ------+---------- 2 | 25 3 | 25 (2 rows) =# SELECT * FROM gptext.commit_index('demo.store.products'); commit_index -------------- t (1 row)
With the gptext.faceted_field_search()
function, the categories are values of one or more fields in the index. Here is the syntax for the gptext.faceted_field_search()
function:
gptext.faceted_field_search(<index_name>, <query>, <filter_queries>,
<facet_fields>, <facet_limit>, <minimum>[, <options>])
<index_name>
is the name of the VMware Greenplum Text index with fields to facet.
<query>
is a search query that selects the set of documents to be faceted. To facet all documents in the index specify '*:*'
.
<filter_queries>
is an array of queries that filter documents from the set returned by the <query>
, or null
if none. Only documents that match all queries in the list are included in the counts.
<facet_fields>
is an array of index fields to facet.
<facet_limit>
is the maximum number of results to report for any one category. Use -1
to report all results.
<minimum>
is the minimum number of results a category must have in order to be included in the results.
This example facets all documents in the demo.store.products
index on the category field.
=# SELECT * FROM gptext.faceted_field_search(
'demo.store.products', '*:*', null, '{category}', -1, 1);
field_name | field_value | value_count
------------+--------------+-------------
category | Pot | 11
category | Desktops | 10
category | Tablets | 8
category | Monitors | 7
category | Tent | 6
category | Luggage | 5
category | Sleeping Bag | 3
(7 rows)
This example facets all documents on two fields, category
and brand
. Only facets with a count of 2 or more are included in the results.
=# SELECT * FROM gptext.faceted_field_search(
'demo.store.products', '*:*', null, '{category, brand}', -1, 2);
field_name | field_value | value_count
------------+----------------+-------------
brand | ASUS | 7
brand | Dell | 5
brand | HP | 4
brand | Samsung | 4
brand | Apple | 2
brand | Utopia Kitchen | 2
brand | Big Agnes | 2
brand | Yaheetech | 2
brand | Kelty | 2
brand | Huawei | 2
category | Pot | 11
category | Desktops | 10
category | Tablets | 8
category | Monitors | 7
category | Tent | 6
category | Luggage | 5
category | Sleeping Bag | 3
(17 rows)
The next example uses a filter query to facet the brand
field for just the 10 documents with category "Desktops".
=# SELECT * FROM gptext.faceted_field_search(
'demo.store.products', '*:*', '{category:Desktops}', '{brand}', -1, 1);
field_name | field_value | value_count
------------+-------------+-------------
brand | Dell | 5
brand | ASUS | 3
brand | HP | 2
(3 rows)
With the faceted_query_search()
function, the categories are VMware Greenplum Text search queries. The counts are a report of the numbers of documents that match each search query. Here is the syntax for the faceted_field_search()
function:
gptext.faceted_query_search(<index_name>, <query>, <filter_queries>, <facet_queries>);
<index_name>
is the name of the VMware Greenplum Text index with fields to facet.
<query>
is a search query that selects the set of documents to be faceted. To facet all documents in the index specify '*:*'
.
<filter_queries>
is an array of queries that filter documents from the set returned by the <query>
, or null
if none. Only documents that match all queries in the list are included in the counts.
<facet_queries>
is an array of search queries. Each query in the array is a category in the results.
This example reports the number of documents that contain "windows", "intel", and both "windows" and "intel" in the default search column (title
).
=# SELECT * FROM gptext.faceted_query_search(
'demo.store.products', '*:*', null,
'{windows, intel, windows AND intel}');
query_name | value_count
-------------------+-------------
intel | 7
windows | 4
windows AND intel | 2
(3 rows)
The facet queries in this example are Solr range queries that define four custom ranges over the price
field.
=# SELECT * FROM gptext.faceted_query_search(
'demo.store.products', '*:*', null,
'{price:[* TO 200],price:[201 TO 250],price:[251 TO 300],price:[301 TO *]}');
query_name | value_count
--------------------+-------------
price:[201 TO 250] | 2
price:[251 TO 300] | 2
price:[301 TO *] | 11
price:[* TO 200] | 35
(4 rows)
The gptext.faceted_range_search()
function facets a single field in the VMware Greenplum Text index into ranges specified with start, end, and gap values. The faceted field must be a numeric type.
gptext.faceted_range_search(<index_name>, <query>, <filter_queries>,
<field_name>, <range_start>, <range_end>, <range_gap>, <options>)
<index_name>
is the name of the VMware Greenplum Text index with fields to facet.
<query>
is a search query that selects the set of documents to be faceted. To facet all documents in the index specify '*:*'
.
<filter_queries>
is an array of queries that filter documents from the set returned by the <query>
, or null
if none. Only documents that match all queries in the list are included in the results.
<field_name>
is the name of the field to facet. The field must have numeric content. The calculated ranges will have the same data type as the field.
<range_start>
is the smallest value of the first range category.
<range_limit>
is the highest value of the top range.
<range_gap>
is the size of each range category.
<options>
is an optional string containing Solr query options.
This range search example facets the price field into ranges between 0 and 1200 with a gap of 100. The range_value
column in the results is a text value, so the ORDER BY
clause casts the value to a float type.
=# SELECT * from gptext.faceted_range_search(
'demo.store.products', '*:*', null, 'price', '0', '1200', '100')
ORDER BY range_value::float;
field_name | range_value | value_count
------------+-------------+-------------
price | 0.0 | 23
price | 100.0 | 12
price | 200.0 | 4
price | 300.0 | 6
price | 400.0 | 0
price | 500.0 | 1
price | 600.0 | 1
price | 700.0 | 1
price | 800.0 | 0
price | 900.0 | 1
price | 1000.0 | 0
price | 1100.0 | 1
(12 rows)
Highlighting inserts markup tags before and after each occurrence of the search terms in a query. For example, if the search term is "iphone", each occurrence of "iphone" in the field is marked up:
<em>iphone</em>
You can change the default markup strings from <em>
and </em>
by setting the gptext.hl_pre_tag
and gptext.hl_post_tag
server configuration options.
There are three ways to highlight search terms:
If the field is indexed, but not stored, you must join the search results with the database table and use the gptext.highlight()
function to apply markup tags to the column data.
If the field is indexed and stored, Solr can apply the markup tags and return the marked-up field in the results of the search query. This is the same way highlighting works for VMware Greenplum Text external indexes. (See Highlighting External Index Search Results.) Using this method with regular VMware Greenplum Text indexes requires modifying the solrconfig.xml
configuration file for the index.
With the gptext.highlight_instant_content()
function, you can create an index without highlighting, and highlight only the search results.
To use gptext.highlight()
to highlight terms, one of the following conditions must hold:
stored=true
).storeOffsetsWithPositions=true
and termVectors=true
). See Configuring Postings with Term Vectors for the instructions to edit the managed-schema
file to store these additional fields.termVectors=true
, termPositions=true
, and termOffsets=true
). This configuration is equivalent to invokinggptext.enable_terms()
on the field. See gptext.enable_terms()
in the VMware Greenplum Text Function Reference.Refer to Schema Options and Performance Considerations in the Solr Reference Guide for more information on the trade-offs between these options.
This example enables terms for the message_text
field in the demo.twitter.message
index, reindexes the data, and commits the changes to the index:
=# SELECT * FROM gptext.enable_terms('demo.twitter.message', 'message_text');
=# SELECT * FROM gptext.index(TABLE (SELECT * FROM twitter.message), 'demo.twitter.message');
=# SELECT * FROM gptext.commit_index('demo.twitter.message');
Warning: Highlighting adds overhead to the query, including index space, indexing time, and search time.
The gptext.highlight()
syntax is:
gptext.highlight(<column_data>, <column_name>, <offsets>)
The <column_data>
argument contains the text data that will be marked up with highlighting tags.
The <column_name>
argument is the name of the corresponding table column.
The <offsets>
argument is a VMware Greenplum Text hstore
type that contains key-value pairs that specify the locations of the search term in the text data. This value is constructed by the gptext.search()
function when highlighting is enabled. The key contains the column name and the value is a comma-separated list of offsets where the data appears.
To enable highlighting in a gptext.search()
query, add the hl
and hl.fl
options in the <options>
argument:
hl=true&hl.fl=<field1>,<field2>
Setting the hl=true
option enables highlighting for the search. The hl.fl
option specifies a list of the field names to highlight.
This example returns up to five rows from each segment with the text "iphone" highlighted in the message_text
field.
=# SELECT t.id, gptext.highlight(t.message_text, 'message_text', s.hs)
FROM twitter.message t,
gptext.search(TABLE(SELECT 1 SCATTER BY 1),
'demo.twitter.message', '{!gptextqp}iphone', null,
'rows=5&hl=true&hl.fl=message_text' ) s
WHERE t.id = s.id::int8;
id | highlight
----------+----------------------------------------------------------------------------------
-----------------------------------
20473459 | i love my iphone4 but I'm excited to see what the iphone5 has to offer #gadgets #
<em>iphone</em> #apple #technology
19424811 | I hate
: <em>iPhones</em>:
20663075 | RT @indigoFKNvanity: I hate the auto correct on <em>iPhones</em> !!!!!!!!!
20350436 | I absolutely love how fast this phone works. Love the <em>iPhone</em>.
20042822 | @KDMC23 ohhhh!!! I hate <em>Iphone</em> Talk!
19714120 | @ayee_Eddy2011 I love pancakes too! #<em>iPhone</em> #app
19284329 | #nowplaying on my <em>iPhone</em>: Daft Punk - "Digital Love"
19416451 | I'm in love with my new <em>iPhone</em>(:
20257190 | Love my #<em>iphone</em> - only problem now? I want an #Ipad!
20759274 | Dropped frutopia on
: My phone... #ciaowaterdamage I hate <em>iPhones</em>.
(10 rows)
If the field to be highlighted is stored in the index, Solr can return the field in the search results with markup tags applied. The gptext.highlight()
function is not used and it is not necessary to enable terms for the field. This is the default behavior for VMware Greenplum Text external indexes, but for regular VMware Greenplum Text indexes you must enable it by editing the solrconfig.xml
configuration file for the index.
Use the gptext-config
utility to open the solrconfig.xml
configuration file for the index in the editor.
$ gptext-config edit -i demo.twitter.message -f solrconfig.xml
Search for <!-- Search Components -->
and add the following element.
<searchComponent class="solr.HighlightComponent" name="highlight" />
Search for <requestHandler name="/select" class="solr.SearchHandler">
. In the <arr name="components">
child element, change <str>termoffsets</str>
to <str>highlight</str>
. The complete <requestHandler>
entry should be:
<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">message_text</str>
</lst>
<arr name="components">
<str>query</str>
<str>facet</str>
<str>mlt</str>
<str>highlight</str>
<str>stats</str>
<str>debug</str>
</arr>
</requestHandler>
Save your changes.
Update the field definitions in the managed-schema
configuration file to store the fields that will be highlighted. See Storing Additional Fields in an Index for instructions. Be sure to reindex the data after changing storage options.
The following query searches the message_text
field for messages containing the text "iphone" and highlights "iphone" in the text returned in the hs
column.
=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message',
'{!gptextqp}iphone', null, 'rows=5&hl=true&hl.fl=message_text');
id | score | hs
| rf
----------+-----------+---------------------------------------------------------------------------------------
--------------------------------------------------+----
19284329 | 0.8176138 | {"columnValue":[{"name":"message_text","value":"#nowplaying on my \u003cem\u003eiPhone
\u003c/em\u003e: Daft Punk - \"Digital Love\""}]} |
19416451 | 0.9003142 | {"columnValue":[{"name":"message_text","value":"I'm in love with my new \u003cem\u003e
iPhone\u003c/em\u003e(:"}]} |
19424811 | 1.0051261 | {"columnValue":[{"name":"message_text","value":"I hate\n\u003cem\u003eiPhones\u003c/em
\u003e:"}]} |
20042822 | 0.8519347 | {"columnValue":[{"name":"message_text","value":"I hate \u003cem\u003eIphone\u003c/em\u
003e Talk!"}]} |
(4 rows)
You can use the gptext.gptext_retrieve_field()
function to extract the highlighted text from the columnValue
array in the hs
column. Compare the previous results to the results from this query.
=# SELECT id, score, gptext.gptext_retrieve_field(hs, 'message_text') message_text
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message',
'{!gptextqp}iphone', null, 'rows=5&hl=true&hl.fl=message_text');
id | score | message_text
----------+------------+--------------------------------------------------------------------------------------
-------------------------------
19424811 | 1.0051261 | I hate
: <em>iPhones</em>:
20042822 | 0.8519347 | I hate <em>Iphone</em> Talk!
20350436 | 0.7387052 | Love the <em>iPhone</em>.
20473459 | 0.59349346 | i love my iphone4 but I'm excited to see what the iphone5 has to offer #gadgets #<em>
iphone</em> #apple #technology
20663075 | 0.8519347 | RT @indigoFKNvanity: I hate the auto correct on <em>iPhones</em> !!!!!!!!!
19284329 | 0.8176138 | #nowplaying on my <em>iPhone</em>: Daft Punk - "Digital Love"
19416451 | 0.9003142 | I'm in love with my new <em>iPhone</em>(:
19714120 | 0.8176138 | #<em>iPhone</em> #app
20257190 | 0.7095236 | Love my #<em>iphone</em> - only problem now? I want an #Ipad!
20759274 | 0.7095236 | #ciaowaterdamage I hate <em>iPhones</em>.
(10 rows)
Enabling text highlighting increases disk space usage. You may choose to create an index without highlighting, and only highlight search results. Greenplum Text provides more performant and disk-efficient text highlighting with the gptext.highlight_instant_content()
function.
The gptext.highlight_instant_content()
syntax is:
gptext.highlight_instant_content(<table_query>, <index_name>, <highlight_query>)
<table_query>
is a table SELECT
query that must return two columns: the first column is an identifier, the second column is the content to be highlighted.
<index_name>
is the name of the index containing data to highlight.
<highlight_query>
is a text value containing a Solr text search query.
The gptext.highlight_instant_content()
function returns one or more records containing the following two columns:
id
- The first column returned by the <table_query>
. You can use this id
to join the results with other tables.highlighted_content
- The result of applying the <highlight_query>
on the original content.The following example illustrates an instant content highlight use case:
Create a VMware Greenplum table with some sample data. For example, to create a table named hli_test
:
CREATE TABLE hli_test(id bigint primary key, content text, date date);
INSERT INTO hli_test VALUES(1, 'some words that include distribute', now());
INSERT INTO hli_test VALUES(2, 'other words like distro', now());
INSERT INTO hli_test VALUES(3, 'distributed instances', now());
INSERT INTO hli_test VALUES(4, 'word instance', now());
Create and commit an index:
SELECT * FROM gptext.create_index('public', 'hli_test', 'id', 'content');
SELECT * FROM gptext.index(table(SELECT * FROM hli_test), 'demo.public.hli_test');
SELECT * FROM gptext.commit_index('demo.public.hli_test');
This command searches for the term word
and returns the results:
SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.hli_test', 'word', null, 'rows=10');
Join the search result with the original table to get the original text for gptext.highlight_instant_content()
:
SELECT * FROM gptext.highlight_instant_content(
TABLE(SELECT t.id, content FROM gptext.search(
TABLE(SELECT 1 SCATTER BY 1), 'demo.public.hli_test', 'word', null, 'rows=10')
s JOIN hli_test t ON s.id::bigint=t.id), 'demo.public.hli_test', 'word');
Join the result of gptext.highlight_instant_content()
with the original table to get the highlighted content:
SELECT t.id, highlighted_content, date FROM
gptext.highlight_instant_content(
TABLE(SELECT t.id, content FROM gptext.search(
TABLE(SELECT 1 SCATTER BY 1), 'demo.public.hli_test', 'word', null, 'rows=10')
s JOIN hli_test t ON s.id::bigint=t.id), 'demo.public.hli_test', 'word')
h JOIN hli_test t ON h.id::bigint=t.id;
id | highlighted_content | date
----+------------------------------------------------------------------+------------
1 | some <em>words</em> that include distributed database management | 2021-11-08
4 | <em>word</em> instance | 2021-11-08
2 | other <em>words</em> like distro | 2021-11-08
(3 rows)
A VMware Greenplum Text index for a partitioned VMware Greenplum table has a __partition
field that contains the name of the child partition. When you query the index, you can use the __partition
field to restrict the partitions to search.
Search all partitions in an index by calling gptext.search()
with the root partition name:
=# SELECT *
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message',
'{!gptextqp} blackberry', null, null);
id | score | hs | rf
-----------+-----------+----+----
71559892 | 5.670539 | |
127444971 | 5.1496587 | |
127024083 | 5.1496587 | |
65596365 | 4.4688635 | |
79177658 | 4.4688635 | |
78934938 | 4.4688635 | |
111566417 | 4.4688635 | |
65058966 | 3.5941496 | |
92240815 | 5.212467 | |
38424415 | 4.730712 | |
96811329 | 4.730712 | |
146782767 | 4.730712 | |
41409575 | 4.1019597 | |
104198393 | 4.1019597 | |
86943734 | 3.2956126 | |
89120464 | 3.2956126 | |
153181836 | 3.2956126 | |
139227011 | 3.2956126 | |
20664699 | 2.8236253 | |
(19 rows)
You can search a single partition by calling gptext.search()
with the child partition name. Use the gptext.partition_status(<index_name>)
function to see the partition names. For example:
=# SELECT partition_name, level
FROM gptext.partition_status('demo.twitter.message');
partition_name | level
------------------------------+-------
demo.twitter.message_1_prt_1 | 1
demo.twitter.message_1_prt_2 | 1
demo.twitter.message_1_prt_3 | 1
demo.twitter.message_1_prt_4 | 1
(4 rows)
This example searches only the demo.twitter.message_1_prt_3
partition:
=# SELECT *
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1),
'demo.twitter.message_1_prt_3', '{!gptextqp} blackberry', null);
id | score
-----------+-----------
71559892 | 5.670539
79177658 | 4.4688635
78934938 | 4.4688635
111566417 | 4.4688635
92240815 | 5.212467
96811329 | 4.730712
104198393 | 4.1019597
86943734 | 3.2956126
89120464 | 3.2956126
(9 rows)
You can also specify a partition name or a range of partitions in the query filter argument of the gptext.search()
function. This example searches the partitions between message_1_prt_2
and message_1_prt_4
.
=# SELECT *
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message', 'android',
'{__partition:[message_1_prt_2 TO message_1_prt_4]}');
id | score
-----------+-----------
42474603 | 5.770868
95666225 | 5.670539
68701747 | 4.4688635
56900818 | 4.4688635
111566417 | 4.4688635
120764432 | 4.4688635
115326522 | 4.4688635
67269000 | 3.5941496
99959486 | 6.413594
104293903 | 3.1360807
(10 rows)
A VMware Greenplum Text index does not, by default, store the contents of database columns in the index, with the exception of the unique id column. When you search the index, you must join the search results with the original database table on the id column in order to access other table columns.
You can configure a VMware Greenplum Text index to store content of fields when documents are indexed. The additional stored fields can be returned with the search results so that it is unnecessary to join with the database original table. For some applications, you can even delete data from the database table or drop the table after the data has been added to the index.
Retrieve the additional field values in a VMware Greenplum Text search by specifying a list of fields in the gptext.search()
options argument. In this example, the demo.wikipedia.articles
index has been configured to store the content
, title
, and refs
fields, in addition to the id
field. See Storing Field Content in an Index for instructions to edit the managed-schema
file to store these additional fields. In the option
argument, the Solr fl
parameter requests that contents of the id
and title
fields be included in the results.
=# SELECT *
FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'+grid +economy', null, 'fl=id,title&rows=2');
id | score | hs | rf
---------+-----------+----+---------------------------------------------------------------
-------------------------------------------
533423 | 2.4593863 | | column_value { name: "id" value: "533423" } column_value { nam
e: "title" value: "Solar water heating" }
7906908 | 2.0646634 | | column_value { name: "id" value: "7906908" } column_value { na
me: "title" value: "Biomass" }
27743 | 1.823319 | | column_value { name: "id" value: "27743" } column_value { name
: "title" value: "Solar energy" }
113728 | 1.2235354 | | column_value { name: "id" value: "113728" } column_value { nam
e: "title" value: "Geothermal energy" }
(4 rows)
To retrieve all fields stored in the index, use the *
wildcard for the field list: 'fl=*'
.
In the results, the requested fields are packed into an field named rf
added to the results. The rf
field is a text value containing a structure with the following format:
column_value { name: "<field1_name>" value: "<field1_value>"}
[column_value { name: "<field2_name>" value: "<field2_value>"}] ...
The VMware Greenplum Text function gptext.gptext_retrieve_field(rf,<column_name>)
retrieves a single field value by name from this structure as a text value. VMware Greenplum Text provides variations to retrieve the field values as int
or float
values. If the specified field name does not exist in the rf
structure, the function returns NULL
.
This example shows how you can use the gptext.gptext_retrieve*()
functions to unpack search results into separate result columns.
=# SELECT score,
gptext.gptext_retrieve_field_int(rf, 'id') id,
gptext.gptext_retrieve_field(rf, 'title') title,
substring(gptext.gptext_retrieve_field(rf, 'content'),1,15) content
FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'+grid +economy', null, 'fl=*');
score | id | title | content
-----------+----------+---------------------+-----------------
2.4593863 | 533423 | Solar water heating | '''Solar water
2.0646634 | 7906908 | Biomass | '''Biomass''' i
2.0444229 | 13690575 | Solar power | '''Solar power'
1.823319 | 27743 | Solar energy | '''Solar energy
1.2235354 | 113728 | Geothermal energy | '''Geothermal e
1.0890164 | 14205946 | Algae fuel | '''Algae fuel''
(6 rows)
When you submit a query, Solr processes the query using a query parser. There are several Solr query parsers with different capabilities. For example, the ComplexPhraseQueryParser
can parse wildcards, and the SurroundQueryParser
supports span queries—finding words in the vicinity of a search term in a document.
VMware Greenplum Text supports these query parsers:
QParserPlugin
, the default VMware Greenplum Text query parser. QParserPlugin
is a superset of the LuceneQParserPlugin
, Solr’s native Lucene query parser. QParserPlugin
is a general purpose query parser with broad capabilities. QParserPlugin
does not support span queries and handles operator precedence in an unintuitive manner. The support for field selection is also rather weak. See http://wiki.apache.org/solr/SolrQuerySyntax.ComplexPhraseQueryParser
supports wildcards, ORs, ranges, and fuzzies inside phrase queries. See https://issues.apache.org/jira/browse/SOLR-1604.DisMax
(or eDisMax
) handles operator precedence in an intuitive manner and is well-suited for user queries since it is similar to popular search engines on the web. See Using the DisMax and Extended DisMax Query Processors.SurroundQueryParser
, supports the family of span queries. See Proximity Search Queries and Surround Query Parser in the Apache Solr Reference Guide.gptextqp
, the VMware Greenplum Text Unified Query Parser, can use all of the above query parsers in combination. See Using the Universal Query Parser for more information.Note: The default query parser is specified in the requestHandler
definitions in solrconfig.xml
. You can edit solrconfig.xml
with the management utility command gptext-config edit
.
You can specify the query parser to use at query time by setting the Solr defType
option in the options
argument of the search function or by setting the type
as a Solr LocalParam embedded in the query.
This query specifies the dismax
query parser in the options
argument of the gptext.search()
function:
=# SELECT a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1),
'demo.wikipedia.articles', '+hydroelectric -solar', null,
'defType=dismax') q
WHERE a.id = q.id::int8;
title | score
------------------------+-----------
Forward osmosis | 0.9552469
Liquid nitrogen engine | 1.0126935
(2 rows)
The following query uses the ComplexPhraseQueryParser, setting the type
parameter in a Solr LocalParam.
=# SELECT a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'{!type=complexphrase} sequester AND carbon', null, null) q
WHERE a.id = q.id::int8;
title | score
---------+---------
Biomass | 3.83572
(1 row)
In the LocalParam, the type=
specifier can be omitted because type
is the default parameter:
'{!complexphrase} sequester AND carbon'
Proximity search queries find documents that have search terms within a specified distance. The distance is measured as the number of term moves that would be needed to make the terms adjacent.
With the standard query parser, the terms to match are placed in quotes and the distance between them is specified by adding a tilde ~
and an integer after the closing quote. The following search query finds documents with the terms "solar" and "fossil" within five terms of each other.
=# SELECT t.id, s.score, t.title
FROM wikipedia.articles t,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'"solar fossil"~5', null, null) s
WHERE s.id::int8 = t.id;
id | score | title
----------+------------+------------------
25784 | 0.4855828 | Renewable energy
14090587 | 0.30585092 | Low-carbon power
13690575 | 0.62667537 | Solar power
(3 rows)
The search terms inside the quotes can appear in either order. However, if the terms occur in the opposite order in the document, the distance between them is one greater than if the terms occur in the specified order.
The Surround query parser allows ordered and unordered proximity searches. The W
operator specifies an ordered search and the N
operator specifies an unordered search. The maximum distance between the terms is specified by prefixing the W
or N
operator with an integer, for example 3W
.
The proximity query can be written with prefix or infix notation.
Prefix notation: '{!surround} 3W(solar, fossil)'
Infix notation: '{!surround} solar 3W fossil'
Here are some proximity query examples using the Surround query parser.
Note: The Surround query parser does not analyze query text like the other query parsers. VMware Greenplum Text indexes are by default built with lowercase and stemming filters, for example, so surround queries containing capital letters or unstemmed terms will return no results.
The wikipedia.articles
index contains a document with the title "Solar water heating". The following example search, however, cannot find it.
=# SELECT t.id, s.score, t.title
FROM wikipedia.articles t,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'{!surround} title: 2w(Solar, heating)', null, null) s
WHERE s.id::int8 = t.id;
id | score | title
----+-------+-------
(0 rows)
When you rewrite the query to use only lowercase characters and remove the suffix from "heating", the document is found.
=# SELECT t.id, s.score, t.title
FROM wikipedia.articles t,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'{!surround} title: 2w(solar, heat)', null, null)
WHERE s.id::int8 = t.id;
id | score | title
--------+-----------+---------------------
533423 | 1.5434089 | Solar water heating
(1 row)
An easy way to avoid this limitation is to use the VMware Greenplum Text Universal Query Parser, which does analyze the query text and also supports the Surround query parser's proximity syntax.
With the VMware Greenplum Text Universal Query Parser, you can perform searches using features from any of the other supported query parsers, combined into one search string. Invoke the Universal Query Parser by setting the Solr type
parameter in a Solr LocalParam with this format:
'{!gptextqp} <search_query>'
The search query in the following example includes syntax from three query parsers:
sea*
– Complex query with wildcard2W
– Proximity query requesting a maximum of two words distance between the terms "sea*" and "oil" or "fuel"oil OR fuel
– Solr Standard Query Processor=# SELECT a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1),'demo.wikipedia.articles',
'{!gptextqp} sea* 2W (oil OR fuel)', null, null) q
WHERE a.id = q.id::int8;
title | score
--------------+-----------
Seaweed fuel | 55.250305
(1 row)
In the following example, title:n(power, geothermal)
specifies that the terms "power" and "geothermal" in the title
field must be adjacent, but they can occur in either order.
=# SELECT a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
'{!gptextqp} title:n(power, geothermal)', null, null) q
WHERE a.id = q.id::int8;
title | score
------------------+---------
Geothermal power | 2.05577
(1 row)
This query uses the fuzzy search operator ~
to find articles with titles containing a term similar to "lethiam" and a complex query that finds articles with "ocean" and "wind" in the content.
=# SELECT t.id, score, title
FROM wikipedia.articles t,
gptext.search(TABLE(SELECT 1 SCATTER by 1), 'demo.wikipedia.articles',
'{!gptextqp} title:lethiam~ OR content:(ocean AND wind)', null, null) s
WHERE t.id=s.id::int8;
id | score | title
----------+------------+-------------------
2120798 | 1.3326647 | Lithium economy
4711003 | 2.6328268 | Osmotic power
25784 | 3.3899183 | Renewable energy
55017 | 0.95579207 | Fusion power
113728 | 1.3909805 | Geothermal energy
27743 | 2.114852 | Solar energy
13690575 | 1.4488393 | Solar power
(7 rows)
The DisMax query parser supports a subset of the Solr Standard Query Parser syntax. It is useful for queries from end users who are familiar with common search systems, such as Google search. It supports quoted phrases, AND and OR operators, and + and - operators. The Extended DisMax query parser improves upon the DisMax query parser, supporting the full Standard query parser syntax.
The DisMax and Extended DisMax query parser behaviors can be customized at query time by setting parameters in the Solr options argument of the gptext.search()
function or as local parameters in the query text. See DisMax Parameters and Extended DisMax Parameters for details. One useful parameter is the qf
(query fields) parameter, which specifies a list of fields to search. Using this parameter avoids having to write a query that searches each field individually. For example, instead of writing this query:
'content:nuclear OR title:nuclear OR links:nuclear'
you can write:
{!edismax qf="content title links"} nuclear
The following example queries illustrate features of the DisMax and Extended DisMax query parsers.