To retrieve data, you submit a query that performs a search based on criteria that you specify. Simple queries return straight-forward results. You can use the default query parser, or specify a different query parser at query time.

Creating a Simple Search Query

After a Solr index is committed, you can run queries with the gptext.search() function, which has this syntax:

gptext.search(<src_table>, <index_name>, <search_query>, <filter_queries>[, <options>])

The <search_query> argument is a text value that contains a Solr query. The <filter_queries> argument is an array of queries that restrict the set of documents to search.

The default Solr Standard Query Parser has a rich query syntax that includes wildcard characters, Boolean operators, proximity and range searches, and fuzzy searches. See The Standard Query Parser at the Solr website for examples.

Solr has additional query processors that you can specify in the <search_query> argument to access additional features. The VMware Greenplum Text Universal Query Parser, gptextqp, allows queries that mix features from all of the supported query parsers.

See Selecting a Query Parser for a list of the supported query parsers and how to request them in your queries. See Using the Universal Query Parser for examples using the VMware Greenplum Text Universal Query Parser.

The following sections show how to use the gptext.search() function, including example queries that demonstrate Solr search features.

An AND search example with top 5 results

This search finds documents in the wikipedia.articles index that contain both search terms "solar" and "battery". The 'rows=5' argument is a Solr option that specifies the top 5 results are to be returned from each segment. In a VMware Greenplum cluster with two segments, this query returns up to 10 rows.

=# SELECT a.id, a.date_time, a.title, q.score
 FROM wikipedia.articles a,
      gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles', 
        'solar AND battery', null, 'rows=5') q
 WHERE q.id::int8 = a.id
 ORDER BY score DESC;
    id    |       date_time        |        title        |   score   
----------+------------------------+---------------------+-----------
 13690575 | 2017-08-24 02:34:00-05 | Solar power         | 2.7128658
  2008322 | 2017-08-05 02:09:00-05 | Vehicle-to-grid     | 2.5810153
  4711003 | 2017-08-10 18:56:00-05 | Osmotic power       | 2.2073007
    25784 | 2017-08-26 07:10:00-05 | Renewable energy    | 2.1295567
   213555 | 2017-08-27 12:48:00-05 | Solar updraft tower | 2.0210648
    27743 | 2017-08-20 15:56:00-05 | Solar energy        | 1.6916461
   608623 | 2017-08-27 03:56:00-05 | Ethanol fuel        | 1.4619896
(7 rows)

See Solr options for more about Solr options.

An OR search example with top 5 results

By using the OR keyword, this search matches more documents than the AND example. The total number of rows returned is limited by the rows=5 Solr option.

=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a, 
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
       'solar OR battery', null, 'rows=5') q
WHERE  q.id::int8 = a.id 
ORDER BY score DESC;
   id    |       date_time        |        title        |   score   
---------+------------------------+---------------------+-----------
 2008322 | 2017-08-05 02:09:00-05 | Vehicle-to-grid     | 2.5810153
   25784 | 2017-08-26 07:10:00-05 | Renewable energy    | 2.1295567
 2120798 | 2017-01-28 00:59:00-06 | Lithium economy     | 2.0416002
  213555 | 2017-08-27 12:48:00-05 | Solar updraft tower | 2.0210648
   27743 | 2017-08-20 15:56:00-05 | Solar energy        | 1.6916461
  608623 | 2017-08-27 03:56:00-05 | Ethanol fuel        | 1.4619896
  533423 | 2017-08-28 00:52:00-05 | Solar water heating | 1.0239072
 2988035 | 2017-03-12 06:39:00-05 | Vortex engine       | 0.9519546
  113728 | 2017-08-15 09:59:00-05 | Geothermal energy   | 0.6801035
   55017 | 2017-08-28 19:24:00-05 | Fusion power        | 0.6432224
(10 rows)

Search non-default fields

A VMware Greenplum Text index has a default search column, specified when the index is created with the gptext.create_index() function. If you have included additional columns to index, you can reference them in your queries. This query searches for documents with the word "solar" in the title column.

=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
       'title:solar', null, null) q
WHERE q.id::int8 = a.id
ORDER BY score DESC;
    id    |       date_time        |        title        |   score
----------+------------------------+---------------------+-----------
 13690575 | 2017-08-24 02:34:00-05 | Solar power         | 1.6547729
    27743 | 2017-08-20 15:56:00-05 | Solar energy        | 1.6547729
   533423 | 2017-08-28 00:52:00-05 | Solar water heating | 1.1132113
   213555 | 2017-08-27 12:48:00-05 | Solar updraft tower | 1.1132113
(4 rows)

This example finds documents where the title column matches "Solar power" or "Solar energy".

=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
       'title:(solar AND (power OR energy))', null, null) q
WHERE q.id::int8 = a.id;
    id    |       date_time        |    title     |   score   
----------+------------------------+--------------+-----------
    27743 | 2017-08-20 15:56:00-05 | Solar energy | 3.3095458
 13690575 | 2017-08-24 02:34:00-05 | Solar power  | 2.9718256
(2 rows)

This example searches for articles that have "photosynthesis" in the content column but that do not have "solar" in the title column.

=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
       'photosynthesis and -title:solar', null, null) q 
WHERE q.id::int8 = a.id
ORDER BY score DESC;
    id    |       date_time        |      title       |   score   
----------+------------------------+------------------+-----------
    25784 | 2017-08-26 07:10:00-05 | Renewable energy | 2.9720955
 53716476 | 2017-08-28 20:40:00-05 | Seaweed fuel     | 1.4240221
 14205946 | 2017-08-28 08:46:00-05 | Algae fuel       | 1.3022419
   608623 | 2017-08-27 03:56:00-05 | Ethanol fuel     | 0.7614042
(4 rows)

Filtering search results

A filter query applies filters to the results returned by the query. The <filter_queries> argument of the gptext.search() function is an array, so you can apply multiple filters to the search results.

The following example finds articles that have the word "nuclear" in the content column and then applies two filter queries to remove articles that have "solar" in the title column and articles that do not have "power" in the title column.

=# SELECT a.id, a.date_time, a.title, q.score
FROM wikipedia.articles a,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
       'nuclear', '{-title:solar,title:power}', null) q
WHERE  q.id::int8 = a.id
ORDER BY score DESC;
    id    |       date_time        |      title       |   score    
----------+------------------------+------------------+------------
 14090587 | 2017-08-14 14:00:00-05 | Low-carbon power |  1.1897897
    55017 | 2017-08-28 19:24:00-05 | Fusion power     |  1.1753609
 13021878 | 2017-08-09 05:03:00-05 | Geothermal power | 0.99499804
(3 rows)

The following example searches the demo.twitter.message table for messages that contain the text "iphone" and either "hate" or "love" and filters for authors who specified English language in their twitter profile.

=# SELECT t.id, q.score, t.author_screen_name, t.message_text
FROM twitter.message t,
     gptext.search(TABLE(SELECT * FROM twitter.message), 'demo.twitter.message',
       '(iphone AND (hate OR love))', '{author_lang:en}', 'rows=5') q
WHERE t.id = q.id::int4 
ORDER BY score DESC;
    id    |   score   | author_screen_name |
 message_text
----------+-----------+--------------------+-----------------------------------------------
-------------------------------------------------------------
 19424811 |  3.446217 | kennediiscool      | I hate

                                           : iPhones:
 20663075 | 2.9209785 | Hi_imMac           | RT @indigoFKNvanity: I hate the auto correct o
n iPhones !!!!!!!!!
 20042822 | 2.9209785 | renadrian          | @KDMC23 ohhhh!!! I hate Iphone Talk!
 20759274 | 2.5128412 | SteLala            | Dropped frutopia on

                                           : My phone... #ciaowaterdamage I hate iPhones.
 19416451 | 2.1448703 | ShayFknShay        | I'm in love with my new iPhone(:
 20350436 |  2.102924 | mahhnamestj        | I absolutely love how fast this phone works. L
ove the iPhone.
 19284329 | 1.9478481 | popolvuhplaya      | #nowplaying on my iPhone: Daft Punk - "Digital
 Love"
 19714120 | 1.9478481 | BipolarBearApp     | @ayee_Eddy2011 I love pancakes too!  #iPhone #
app
 20257190 | 1.6903389 | alasco             | Love my #iphone - only problem now?  I want an
 #Ipad!
 20473459 |  1.379696 | ArniBella          | i love my iphone4 but I'm excited to see what
the iphone5 has to offer #gadgets #iphone #apple #technology
(10 rows)

Creating Faceted Search Queries

Faceting breaks query results into multiple categories with a count of the number of documents in the index for each category. There are three VMware Greenplum Text faceted search functions:

  • gptext.faceted_field_search() – the categories are the values of one or more fields in VMware Greenplum Text index.
  • gptext.faceted_query_search() – the categories are a list of search queries.
  • gptext.faceted_range_search() – the categories are a list of ranges calculated from minimum value, maximum value, and the size of the range (gap).

Note: The examples in this section use the store.products table. See Setting Up the Demo Database for commands to create and load data into this table.

After the table is created and the data loaded, create the VMware Greenplum Text index, index the data, and then commit the index as shown in this example.

=# SELECT * FROM gptext.create_index('store', 'products', 
    '{id, title, category, brand, price}', 
    '{int, text_intl, string, string, float}', 
    'id', 'title');
    
=# SELECT * FROM gptext.index(TABLE(SELECT * FROM store.products), 
       'demo.store.products');
 dbid | num_docs
------+----------
    2 |       25
    3 |       25
(2 rows)

=# SELECT * FROM gptext.commit_index('demo.store.products');
 commit_index
--------------
 t
(1 row)

Faceting on Fields

With the gptext.faceted_field_search() function, the categories are values of one or more fields in the index. Here is the syntax for the gptext.faceted_field_search() function:

gptext.faceted_field_search(<index_name>, <query>, <filter_queries>, 
    <facet_fields>, <facet_limit>, <minimum>[, <options>])

<index_name> is the name of the VMware Greenplum Text index with fields to facet.

<query> is a search query that selects the set of documents to be faceted. To facet all documents in the index specify '*:*'.

<filter_queries> is an array of queries that filter documents from the set returned by the <query>, or null if none. Only documents that match all queries in the list are included in the counts.

<facet_fields> is an array of index fields to facet.

<facet_limit> is the maximum number of results to report for any one category. Use -1 to report all results.

<minimum> is the minimum number of results a category must have in order to be included in the results.

This example facets all documents in the demo.store.products index on the category field.

=# SELECT * FROM gptext.faceted_field_search(
    'demo.store.products', '*:*', null, '{category}', -1, 1);
 field_name | field_value  | value_count
------------+--------------+-------------
 category   | Pot          |          11
 category   | Desktops     |          10
 category   | Tablets      |           8
 category   | Monitors     |           7
 category   | Tent         |           6
 category   | Luggage      |           5
 category   | Sleeping Bag |           3
(7 rows)

This example facets all documents on two fields, category and brand. Only facets with a count of 2 or more are included in the results.

=# SELECT * FROM gptext.faceted_field_search(
    'demo.store.products', '*:*', null, '{category, brand}', -1, 2);
 field_name |  field_value   | value_count
------------+----------------+-------------
 brand      | ASUS           |           7
 brand      | Dell           |           5
 brand      | HP             |           4
 brand      | Samsung        |           4
 brand      | Apple          |           2
 brand      | Utopia Kitchen |           2
 brand      | Big Agnes      |           2
 brand      | Yaheetech      |           2
 brand      | Kelty          |           2
 brand      | Huawei         |           2
 category   | Pot            |          11
 category   | Desktops       |          10
 category   | Tablets        |           8
 category   | Monitors       |           7
 category   | Tent           |           6
 category   | Luggage        |           5
 category   | Sleeping Bag   |           3
(17 rows)

The next example uses a filter query to facet the brand field for just the 10 documents with category "Desktops".

=# SELECT * FROM gptext.faceted_field_search(
    'demo.store.products', '*:*', '{category:Desktops}', '{brand}', -1, 1);
 field_name | field_value | value_count
------------+-------------+-------------
 brand      | Dell        |           5
 brand      | ASUS        |           3
 brand      | HP          |           2
(3 rows)

Faceting on search queries

With the faceted_query_search() function, the categories are VMware Greenplum Text search queries. The counts are a report of the numbers of documents that match each search query. Here is the syntax for the faceted_field_search() function:

gptext.faceted_query_search(<index_name>, <query>, <filter_queries>, <facet_queries>);

<index_name> is the name of the VMware Greenplum Text index with fields to facet.

<query> is a search query that selects the set of documents to be faceted. To facet all documents in the index specify '*:*'.

<filter_queries> is an array of queries that filter documents from the set returned by the <query>, or null if none. Only documents that match all queries in the list are included in the counts.

<facet_queries> is an array of search queries. Each query in the array is a category in the results.

This example reports the number of documents that contain "windows", "intel", and both "windows" and "intel" in the default search column (title).

=# SELECT * FROM gptext.faceted_query_search(
    'demo.store.products', '*:*', null, 
    '{windows, intel, windows AND intel}');
    query_name     | value_count
-------------------+-------------
 intel             |           7
 windows           |           4
 windows AND intel |           2
(3 rows)

The facet queries in this example are Solr range queries that define four custom ranges over the price field.

=# SELECT * FROM gptext.faceted_query_search(
    'demo.store.products', '*:*', null,
    '{price:[* TO 200],price:[201 TO 250],price:[251 TO 300],price:[301 TO *]}');
     query_name     | value_count
--------------------+-------------
 price:[201 TO 250] |           2
 price:[251 TO 300] |           2
 price:[301 TO *]   |          11
 price:[* TO 200]   |          35
(4 rows)

Faceting on Ranges

The gptext.faceted_range_search() function facets a single field in the VMware Greenplum Text index into ranges specified with start, end, and gap values. The faceted field must be a numeric type.

gptext.faceted_range_search(<index_name>, <query>, <filter_queries>, 
    <field_name>, <range_start>, <range_end>, <range_gap>, <options>)

<index_name> is the name of the VMware Greenplum Text index with fields to facet.

<query> is a search query that selects the set of documents to be faceted. To facet all documents in the index specify '*:*'.

<filter_queries> is an array of queries that filter documents from the set returned by the <query>, or null if none. Only documents that match all queries in the list are included in the results.

<field_name> is the name of the field to facet. The field must have numeric content. The calculated ranges will have the same data type as the field.

<range_start> is the smallest value of the first range category.

<range_limit> is the highest value of the top range.

<range_gap> is the size of each range category.

<options> is an optional string containing Solr query options.

This range search example facets the price field into ranges between 0 and 1200 with a gap of 100. The range_value column in the results is a text value, so the ORDER BY clause casts the value to a float type.

=# SELECT * from gptext.faceted_range_search(
    'demo.store.products', '*:*', null, 'price', '0', '1200', '100')
    ORDER BY range_value::float;
 field_name | range_value | value_count
------------+-------------+-------------
 price      | 0.0         |          23
 price      | 100.0       |          12
 price      | 200.0       |           4
 price      | 300.0       |           6
 price      | 400.0       |           0
 price      | 500.0       |           1
 price      | 600.0       |           1
 price      | 700.0       |           1
 price      | 800.0       |           0
 price      | 900.0       |           1
 price      | 1000.0      |           0
 price      | 1100.0      |           1
(12 rows)

Highlighting Search Terms in Query Results

Highlighting inserts markup tags before and after each occurrence of the search terms in a query. For example, if the search term is "iphone", each occurrence of "iphone" in the field is marked up:

<em>iphone</em>

You can change the default markup strings from <em> and </em> by setting the gptext.hl_pre_tag and gptext.hl_post_tag server configuration options.

There are three ways to highlight search terms:

  • If the field is indexed, but not stored, you must join the search results with the database table and use the gptext.highlight() function to apply markup tags to the column data.

  • If the field is indexed and stored, Solr can apply the markup tags and return the marked-up field in the results of the search query. This is the same way highlighting works for VMware Greenplum Text external indexes. (See Highlighting External Index Search Results.) Using this method with regular VMware Greenplum Text indexes requires modifying the solrconfig.xml configuration file for the index.

  • With the gptext.highlight_instant_content() function, you can create an index without highlighting, and highlight only the search results.

Highlighting Terms with gptext.highlight()

To use gptext.highlight() to highlight terms, one of the following conditions must hold:

  • The field must be stored (stored=true).
  • The field must be configured with postings for term vectors (storeOffsetsWithPositions=true and termVectors=true). See Configuring Postings with Term Vectors for the instructions to edit the managed-schema file to store these additional fields.
  • The field must be configured with full term vectors (termVectors=true, termPositions=true, and termOffsets=true). This configuration is equivalent to invokinggptext.enable_terms() on the field. See gptext.enable_terms() in the VMware Greenplum Text Function Reference.

Refer to Schema Options and Performance Considerations in the Solr Reference Guide for more information on the trade-offs between these options.

Example: Full Term Vector Highlighting

This example enables terms for the message_text field in the demo.twitter.message index, reindexes the data, and commits the changes to the index:

=# SELECT * FROM gptext.enable_terms('demo.twitter.message', 'message_text');
=# SELECT * FROM gptext.index(TABLE (SELECT * FROM twitter.message), 'demo.twitter.message');
=# SELECT * FROM gptext.commit_index('demo.twitter.message');

Warning: Highlighting adds overhead to the query, including index space, indexing time, and search time.

The gptext.highlight() syntax is:

gptext.highlight(<column_data>, <column_name>, <offsets>)

The <column_data> argument contains the text data that will be marked up with highlighting tags.

The <column_name> argument is the name of the corresponding table column.

The <offsets> argument is a VMware Greenplum Text hstore type that contains key-value pairs that specify the locations of the search term in the text data. This value is constructed by the gptext.search() function when highlighting is enabled. The key contains the column name and the value is a comma-separated list of offsets where the data appears.

To enable highlighting in a gptext.search() query, add the hl and hl.fl options in the <options> argument:

hl=true&hl.fl=<field1>,<field2>

Setting the hl=true option enables highlighting for the search. The hl.fl option specifies a list of the field names to highlight.

This example returns up to five rows from each segment with the text "iphone" highlighted in the message_text field.

=# SELECT t.id, gptext.highlight(t.message_text, 'message_text', s.hs)
FROM twitter.message t,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1),
      'demo.twitter.message', '{!gptextqp}iphone', null,
      'rows=5&hl=true&hl.fl=message_text' ) s
WHERE t.id = s.id::int8;
    id    |                                                      highlight

----------+----------------------------------------------------------------------------------
-----------------------------------
 20473459 | i love my iphone4 but I'm excited to see what the iphone5 has to offer #gadgets #
<em>iphone</em> #apple #technology
 19424811 | I hate

          : <em>iPhones</em>:
 20663075 | RT @indigoFKNvanity: I hate the auto correct on <em>iPhones</em> !!!!!!!!!
 20350436 | I absolutely love how fast this phone works. Love the <em>iPhone</em>.
 20042822 | @KDMC23 ohhhh!!! I hate <em>Iphone</em> Talk!
 19714120 | @ayee_Eddy2011 I love pancakes too!  #<em>iPhone</em> #app
 19284329 | #nowplaying on my <em>iPhone</em>: Daft Punk - "Digital Love"
 19416451 | I'm in love with my new <em>iPhone</em>(:
 20257190 | Love my #<em>iphone</em> - only problem now?  I want an #Ipad!
 20759274 | Dropped frutopia on

          : My phone... #ciaowaterdamage I hate <em>iPhones</em>.
(10 rows)

Highlighting Terms in Stored Fields

If the field to be highlighted is stored in the index, Solr can return the field in the search results with markup tags applied. The gptext.highlight() function is not used and it is not necessary to enable terms for the field. This is the default behavior for VMware Greenplum Text external indexes, but for regular VMware Greenplum Text indexes you must enable it by editing the solrconfig.xml configuration file for the index.

  1. Use the gptext-config utility to open the solrconfig.xml configuration file for the index in the editor.

    $ gptext-config edit -i demo.twitter.message -f solrconfig.xml
    
  2. Search for <!-- Search Components --> and add the following element.

    <searchComponent class="solr.HighlightComponent" name="highlight" />
    
  3. Search for <requestHandler name="/select" class="solr.SearchHandler">. In the <arr name="components"> child element, change <str>termoffsets</str> to <str>highlight</str>. The complete <requestHandler> entry should be:

    <requestHandler name="/select" class="solr.SearchHandler">
      <!-- default values for query parameters can be specified, these
         will be overridden by parameters in the request
        -->
      <lst name="defaults">
        <str name="echoParams">explicit</str>
        <int name="rows">10</int>
        <str name="df">message_text</str>
      </lst>
      <arr name="components">
        <str>query</str>
        <str>facet</str>
        <str>mlt</str>
        <str>highlight</str>
        <str>stats</str>
        <str>debug</str>
      </arr>
    </requestHandler>
    
  4. Save your changes.

  5. Update the field definitions in the managed-schema configuration file to store the fields that will be highlighted. See Storing Additional Fields in an Index for instructions. Be sure to reindex the data after changing storage options.

The following query searches the message_text field for messages containing the text "iphone" and highlights "iphone" in the text returned in the hs column.

=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message', 
      '{!gptextqp}iphone', null, 'rows=5&hl=true&hl.fl=message_text');
      id    |   score   |                                                                   hs
                                                    | rf
  ----------+-----------+---------------------------------------------------------------------------------------
  --------------------------------------------------+----
   19284329 | 0.8176138 | {"columnValue":[{"name":"message_text","value":"#nowplaying on my \u003cem\u003eiPhone
  \u003c/em\u003e: Daft Punk - \"Digital Love\""}]} |
   19416451 | 0.9003142 | {"columnValue":[{"name":"message_text","value":"I'm in love with my new \u003cem\u003e
  iPhone\u003c/em\u003e(:"}]}                       |
   19424811 | 1.0051261 | {"columnValue":[{"name":"message_text","value":"I hate\n\u003cem\u003eiPhones\u003c/em
  \u003e:"}]}                                       |
   20042822 | 0.8519347 | {"columnValue":[{"name":"message_text","value":"I hate \u003cem\u003eIphone\u003c/em\u
  003e Talk!"}]}                                    |
  (4 rows)

You can use the gptext.gptext_retrieve_field() function to extract the highlighted text from the columnValue array in the hs column. Compare the previous results to the results from this query.

=# SELECT id, score, gptext.gptext_retrieve_field(hs, 'message_text') message_text 
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message', 
      '{!gptextqp}iphone', null, 'rows=5&hl=true&hl.fl=message_text');
    id    |   score    |                                                    message_text

----------+------------+--------------------------------------------------------------------------------------
-------------------------------
 19424811 |  1.0051261 | I hate

                       : <em>iPhones</em>:
 20042822 |  0.8519347 | I hate <em>Iphone</em> Talk!
 20350436 |  0.7387052 | Love the <em>iPhone</em>.
 20473459 | 0.59349346 | i love my iphone4 but I'm excited to see what the iphone5 has to offer #gadgets #<em>
iphone</em> #apple #technology
 20663075 |  0.8519347 | RT @indigoFKNvanity: I hate the auto correct on <em>iPhones</em> !!!!!!!!!
 19284329 |  0.8176138 | #nowplaying on my <em>iPhone</em>: Daft Punk - "Digital Love"
 19416451 |  0.9003142 | I'm in love with my new <em>iPhone</em>(:
 19714120 |  0.8176138 | #<em>iPhone</em> #app
 20257190 |  0.7095236 | Love my #<em>iphone</em> - only problem now?  I want an #Ipad!
 20759274 |  0.7095236 | #ciaowaterdamage I hate <em>iPhones</em>.
(10 rows)

Instant Content Highlighting

Enabling text highlighting increases disk space usage. You may choose to create an index without highlighting, and only highlight search results. Greenplum Text provides more performant and disk-efficient text highlighting with the gptext.highlight_instant_content() function.

The gptext.highlight_instant_content() syntax is:

gptext.highlight_instant_content(<table_query>, <index_name>, <highlight_query>)

<table_query> is a table SELECT query that must return two columns: the first column is an identifier, the second column is the content to be highlighted.

<index_name> is the name of the index containing data to highlight.

<highlight_query> is a text value containing a Solr text search query.

The gptext.highlight_instant_content() function returns one or more records containing the following two columns:

  • id - The first column returned by the <table_query>. You can use this id to join the results with other tables.
  • highlighted_content - The result of applying the <highlight_query> on the original content.

Example

The following example illustrates an instant content highlight use case:

  1. Create a VMware Greenplum table with some sample data. For example, to create a table named hli_test:

    CREATE TABLE hli_test(id bigint primary key, content text, date date); 
    INSERT INTO hli_test VALUES(1, 'some words that include distribute', now()); 
    INSERT INTO hli_test VALUES(2, 'other words like distro', now()); 
    INSERT INTO hli_test VALUES(3, 'distributed instances', now()); 
    INSERT INTO hli_test VALUES(4, 'word instance', now());
    
  2. Create and commit an index:

    SELECT * FROM gptext.create_index('public', 'hli_test', 'id', 'content'); 
    SELECT * FROM gptext.index(table(SELECT * FROM hli_test), 'demo.public.hli_test');
    SELECT * FROM gptext.commit_index('demo.public.hli_test');
    
  3. This command searches for the term word and returns the results:

    SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.hli_test', 'word', null, 'rows=10');
    
  4. Join the search result with the original table to get the original text for gptext.highlight_instant_content():

    SELECT * FROM gptext.highlight_instant_content(
      TABLE(SELECT t.id, content FROM gptext.search(
        TABLE(SELECT 1 SCATTER BY 1), 'demo.public.hli_test', 'word', null, 'rows=10')
      s JOIN hli_test t ON s.id::bigint=t.id), 'demo.public.hli_test', 'word');
    
  5. Join the result of gptext.highlight_instant_content() with the original table to get the highlighted content:

    SELECT t.id, highlighted_content, date FROM 
      gptext.highlight_instant_content(
        TABLE(SELECT t.id, content FROM gptext.search(
          TABLE(SELECT 1 SCATTER BY 1), 'demo.public.hli_test', 'word', null, 'rows=10')
        s JOIN hli_test t ON s.id::bigint=t.id), 'demo.public.hli_test', 'word')
       h JOIN hli_test t ON h.id::bigint=t.id;
     id |                       highlighted_content                        |    date    
    ----+------------------------------------------------------------------+------------
      1 | some <em>words</em> that include distributed database management | 2021-11-08
      4 | <em>word</em> instance                                           | 2021-11-08
      2 | other <em>words</em> like distro                                 | 2021-11-08
    (3 rows)
    

Searching Partitioned Tables

A VMware Greenplum Text index for a partitioned VMware Greenplum table has a __partition field that contains the name of the child partition. When you query the index, you can use the __partition field to restrict the partitions to search.

Search all partitions in an index by calling gptext.search() with the root partition name:

=# SELECT *
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message',
        '{!gptextqp} blackberry', null, null);
    id     |   score   | hs | rf
-----------+-----------+----+----
 71559892  |  5.670539 |    |
 127444971 | 5.1496587 |    |
 127024083 | 5.1496587 |    |
 65596365  | 4.4688635 |    |
 79177658  | 4.4688635 |    |
 78934938  | 4.4688635 |    |
 111566417 | 4.4688635 |    |
 65058966  | 3.5941496 |    |
 92240815  |  5.212467 |    |
 38424415  |  4.730712 |    |
 96811329  |  4.730712 |    |
 146782767 |  4.730712 |    |
 41409575  | 4.1019597 |    |
 104198393 | 4.1019597 |    |
 86943734  | 3.2956126 |    |
 89120464  | 3.2956126 |    |
 153181836 | 3.2956126 |    |
 139227011 | 3.2956126 |    |
 20664699  | 2.8236253 |    |
(19 rows)

You can search a single partition by calling gptext.search() with the child partition name. Use the gptext.partition_status(<index_name>) function to see the partition names. For example:

=# SELECT partition_name, level
FROM gptext.partition_status('demo.twitter.message');
        partition_name        | level
------------------------------+-------
 demo.twitter.message_1_prt_1 |     1
 demo.twitter.message_1_prt_2 |     1
 demo.twitter.message_1_prt_3 |     1
 demo.twitter.message_1_prt_4 |     1
(4 rows)

This example searches only the demo.twitter.message_1_prt_3 partition:

=# SELECT *
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1),
      'demo.twitter.message_1_prt_3', '{!gptextqp} blackberry', null);
    id     |   score
-----------+-----------
 71559892  |  5.670539
 79177658  | 4.4688635
 78934938  | 4.4688635
 111566417 | 4.4688635
 92240815  |  5.212467
 96811329  |  4.730712
 104198393 | 4.1019597
 86943734  | 3.2956126
 89120464  | 3.2956126
(9 rows)

You can also specify a partition name or a range of partitions in the query filter argument of the gptext.search() function. This example searches the partitions between message_1_prt_2 and message_1_prt_4.

=# SELECT *
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message', 'android',
       '{__partition:[message_1_prt_2 TO message_1_prt_4]}');
    id     |   score
-----------+-----------
 42474603  |  5.770868
 95666225  |  5.670539
 68701747  | 4.4688635
 56900818  | 4.4688635
 111566417 | 4.4688635
 120764432 | 4.4688635
 115326522 | 4.4688635
 67269000  | 3.5941496
 99959486  |  6.413594
 104293903 | 3.1360807
(10 rows)

Retrieving Stored Field Content

A VMware Greenplum Text index does not, by default, store the contents of database columns in the index, with the exception of the unique id column. When you search the index, you must join the search results with the original database table on the id column in order to access other table columns.

You can configure a VMware Greenplum Text index to store content of fields when documents are indexed. The additional stored fields can be returned with the search results so that it is unnecessary to join with the database original table. For some applications, you can even delete data from the database table or drop the table after the data has been added to the index.

Retrieve the additional field values in a VMware Greenplum Text search by specifying a list of fields in the gptext.search() options argument. In this example, the demo.wikipedia.articles index has been configured to store the content, title, and refs fields, in addition to the id field. See Storing Field Content in an Index for instructions to edit the managed-schema file to store these additional fields. In the option argument, the Solr fl parameter requests that contents of the id and title fields be included in the results.

=# SELECT *
FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
           '+grid +economy', null, 'fl=id,title&rows=2');
   id    |   score   | hs |                                                    rf

---------+-----------+----+---------------------------------------------------------------
-------------------------------------------
 533423  | 2.4593863 |    | column_value { name: "id" value: "533423" } column_value { nam
e: "title" value: "Solar water heating" }
 7906908 | 2.0646634 |    | column_value { name: "id" value: "7906908" } column_value { na
me: "title" value: "Biomass" }
 27743   |  1.823319 |    | column_value { name: "id" value: "27743" } column_value { name
: "title" value: "Solar energy" }
 113728  | 1.2235354 |    | column_value { name: "id" value: "113728" } column_value { nam
e: "title" value: "Geothermal energy" }
(4 rows)

To retrieve all fields stored in the index, use the * wildcard for the field list: 'fl=*'.

In the results, the requested fields are packed into an field named rf added to the results. The rf field is a text value containing a structure with the following format:

column_value { name: "<field1_name>" value: "<field1_value>"} 
  [column_value { name: "<field2_name>" value: "<field2_value>"}] ...

The VMware Greenplum Text function gptext.gptext_retrieve_field(rf,<column_name>) retrieves a single field value by name from this structure as a text value. VMware Greenplum Text provides variations to retrieve the field values as int or float values. If the specified field name does not exist in the rf structure, the function returns NULL.

This example shows how you can use the gptext.gptext_retrieve*() functions to unpack search results into separate result columns.

=# SELECT score,
    gptext.gptext_retrieve_field_int(rf, 'id') id,
    gptext.gptext_retrieve_field(rf, 'title') title,
    substring(gptext.gptext_retrieve_field(rf, 'content'),1,15) content
FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
      '+grid +economy', null, 'fl=*');
   score   |    id    |        title        |     content
-----------+----------+---------------------+-----------------
 2.4593863 |   533423 | Solar water heating | '''Solar water
 2.0646634 |  7906908 | Biomass             | '''Biomass''' i
 2.0444229 | 13690575 | Solar power         | '''Solar power'
  1.823319 |    27743 | Solar energy        | '''Solar energy
 1.2235354 |   113728 | Geothermal energy   | '''Geothermal e
 1.0890164 | 14205946 | Algae fuel          | '''Algae fuel''
(6 rows)

Selecting a Query Parser

When you submit a query, Solr processes the query using a query parser. There are several Solr query parsers with different capabilities. For example, the ComplexPhraseQueryParser can parse wildcards, and the SurroundQueryParser supports span queries—finding words in the vicinity of a search term in a document.

VMware Greenplum Text supports these query parsers:

  • QParserPlugin, the default VMware Greenplum Text query parser. QParserPlugin is a superset of the LuceneQParserPlugin, Solr’s native Lucene query parser. QParserPlugin is a general purpose query parser with broad capabilities. QParserPlugin does not support span queries and handles operator precedence in an unintuitive manner. The support for field selection is also rather weak. See http://wiki.apache.org/solr/SolrQuerySyntax.
  • ComplexPhraseQueryParser supports wildcards, ORs, ranges, and fuzzies inside phrase queries. See https://issues.apache.org/jira/browse/SOLR-1604.
  • DisMax (or eDisMax) handles operator precedence in an intuitive manner and is well-suited for user queries since it is similar to popular search engines on the web. See Using the DisMax and Extended DisMax Query Processors.
  • SurroundQueryParser, supports the family of span queries. See Proximity Search Queries and Surround Query Parser in the Apache Solr Reference Guide.
  • gptextqp, the VMware Greenplum Text Unified Query Parser, can use all of the above query parsers in combination. See Using the Universal Query Parser for more information.

Note: The default query parser is specified in the requestHandler definitions in solrconfig.xml. You can edit solrconfig.xml with the management utility command gptext-config edit.

You can specify the query parser to use at query time by setting the Solr defType option in the options argument of the search function or by setting the type as a Solr LocalParam embedded in the query.

This query specifies the dismax query parser in the options argument of the gptext.search() function:

=# SELECT a.title, q.score
FROM wikipedia.articles a,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1),
        'demo.wikipedia.articles', '+hydroelectric -solar', null, 
        'defType=dismax') q
WHERE a.id = q.id::int8;
         title          |   score
------------------------+-----------
 Forward osmosis        | 0.9552469
 Liquid nitrogen engine | 1.0126935
(2 rows)

The following query uses the ComplexPhraseQueryParser, setting the type parameter in a Solr LocalParam.

=# SELECT a.title, q.score
FROM wikipedia.articles a,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles', 
       '{!type=complexphrase} sequester AND carbon', null, null) q
WHERE a.id = q.id::int8;
  title  |  score
---------+---------
 Biomass | 3.83572
(1 row)

In the LocalParam, the type= specifier can be omitted because type is the default parameter:

'{!complexphrase} sequester AND carbon'

Proximity Search Queries

Proximity search queries find documents that have search terms within a specified distance. The distance is measured as the number of term moves that would be needed to make the terms adjacent.

With the standard query parser, the terms to match are placed in quotes and the distance between them is specified by adding a tilde ~ and an integer after the closing quote. The following search query finds documents with the terms "solar" and "fossil" within five terms of each other.

=# SELECT t.id, s.score, t.title
FROM wikipedia.articles t,
    gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
        '"solar fossil"~5', null, null) s
WHERE s.id::int8 = t.id;
    id    |   score    |      title
----------+------------+------------------
    25784 |  0.4855828 | Renewable energy
 14090587 | 0.30585092 | Low-carbon power
 13690575 | 0.62667537 | Solar power
(3 rows)

The search terms inside the quotes can appear in either order. However, if the terms occur in the opposite order in the document, the distance between them is one greater than if the terms occur in the specified order.

The Surround query parser allows ordered and unordered proximity searches. The W operator specifies an ordered search and the N operator specifies an unordered search. The maximum distance between the terms is specified by prefixing the W or N operator with an integer, for example 3W.

The proximity query can be written with prefix or infix notation.

Prefix notation: '{!surround} 3W(solar, fossil)'

Infix notation: '{!surround} solar 3W fossil'

Here are some proximity query examples using the Surround query parser.

`'{!surround} title:2w(solar, heat)'`
Searches the `title` field for the terms "solar" and "heat" within two terms, and in the specified order. This query uses prefix notation. The `N` and `W` operators are not case-sensitive.
`'{!surround} title:heat 2N solar'`
Searches the `title` field for the terms "heat" and "solar" within two terms, in any order. This query uses infix notation.
`'{!surround} title: W(solar, heat)'`
Searches the `title` field for adjacent terms "solar" and "heat". The default distance is 1, so `1W` can be abbreviated to `W`.

Note: The Surround query parser does not analyze query text like the other query parsers. VMware Greenplum Text indexes are by default built with lowercase and stemming filters, for example, so surround queries containing capital letters or unstemmed terms will return no results.

The wikipedia.articles index contains a document with the title "Solar water heating". The following example search, however, cannot find it.

=# SELECT t.id, s.score, t.title 
FROM wikipedia.articles t, 
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles', 
         '{!surround} title: 2w(Solar, heating)', null, null) s 
WHERE s.id::int8 = t.id;
 id | score | title
----+-------+-------
(0 rows)

When you rewrite the query to use only lowercase characters and remove the suffix from "heating", the document is found.

=# SELECT t.id, s.score, t.title
FROM wikipedia.articles t,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles',
         '{!surround} title: 2w(solar, heat)', null, null)
WHERE s.id::int8 = t.id;
   id   |   score   |        title
--------+-----------+---------------------
 533423 | 1.5434089 | Solar water heating
(1 row)

An easy way to avoid this limitation is to use the VMware Greenplum Text Universal Query Parser, which does analyze the query text and also supports the Surround query parser's proximity syntax.

Using the Universal Query Parser

With the VMware Greenplum Text Universal Query Parser, you can perform searches using features from any of the other supported query parsers, combined into one search string. Invoke the Universal Query Parser by setting the Solr type parameter in a Solr LocalParam with this format:

'{!gptextqp} <search_query>'

The search query in the following example includes syntax from three query parsers:

  • sea* – Complex query with wildcard
  • 2W – Proximity query requesting a maximum of two words distance between the terms "sea*" and "oil" or "fuel"
  • oil OR fuel – Solr Standard Query Processor
=# SELECT a.title, q.score
FROM wikipedia.articles a, 
    gptext.search(TABLE(SELECT 1 SCATTER BY 1),'demo.wikipedia.articles', 
      '{!gptextqp} sea* 2W (oil OR fuel)', null, null) q
WHERE a.id = q.id::int8;
    title     |   score
--------------+-----------
 Seaweed fuel | 55.250305
(1 row)

In the following example, title:n(power, geothermal) specifies that the terms "power" and "geothermal" in the title field must be adjacent, but they can occur in either order.

=# SELECT a.title, q.score 
FROM wikipedia.articles a,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.wikipedia.articles', 
       '{!gptextqp} title:n(power, geothermal)', null, null) q
WHERE a.id = q.id::int8;
      title       |  score
------------------+---------
 Geothermal power | 2.05577
(1 row)

This query uses the fuzzy search operator ~ to find articles with titles containing a term similar to "lethiam" and a complex query that finds articles with "ocean" and "wind" in the content.

=# SELECT t.id, score, title
FROM wikipedia.articles t,
    gptext.search(TABLE(SELECT 1 SCATTER by 1), 'demo.wikipedia.articles', 
        '{!gptextqp} title:lethiam~ OR content:(ocean AND wind)', null, null) s 
WHERE t.id=s.id::int8;
    id    |   score    |       title
----------+------------+-------------------
  2120798 |  1.3326647 | Lithium economy
  4711003 |  2.6328268 | Osmotic power
    25784 |  3.3899183 | Renewable energy
    55017 | 0.95579207 | Fusion power
   113728 |  1.3909805 | Geothermal energy
    27743 |   2.114852 | Solar energy
 13690575 |  1.4488393 | Solar power
(7 rows)

Using the DisMax and Extended DisMax Query Parsers

The DisMax query parser supports a subset of the Solr Standard Query Parser syntax. It is useful for queries from end users who are familiar with common search systems, such as Google search. It supports quoted phrases, AND and OR operators, and + and - operators. The Extended DisMax query parser improves upon the DisMax query parser, supporting the full Standard query parser syntax.

The DisMax and Extended DisMax query parser behaviors can be customized at query time by setting parameters in the Solr options argument of the gptext.search() function or as local parameters in the query text. See DisMax Parameters and Extended DisMax Parameters for details. One useful parameter is the qf (query fields) parameter, which specifies a list of fields to search. Using this parameter avoids having to write a query that searches each field individually. For example, instead of writing this query:

'content:nuclear OR title:nuclear OR links:nuclear'

you can write:

{!edismax qf="content title links"} nuclear

The following example queries illustrate features of the DisMax and Extended DisMax query parsers.

`'{!dismax} +nuclear reactor'`
Finds documents containing the term "nuclear" and, optionally, the term "reactor".
`'{!dismax} +"nuclear reactor"'`
Finds documents containing the phrase "nuclear reactor".
`'{!dismax} +solar -reactor'`
Finds documents containing the term "solar" but not the term "reactor".
`'{!edismax qf="title refs"} solar'`
Finds documents with the term "solar" in the `title` or `refs`fields.
`'{!edismax qf="title"} (solar or renewable) and energy'`
Finds the documents with titles "Solar energy" and "Renewable energy".
check-circle-line exclamation-circle-line close-line
Scroll to top icon