The following functions are available in VMware Greenplum Text.

Indexing Functions List

gptext.create_index() – creates an empty index.

gptext.create_index_external() - creates an index for external documents.

gptext.index() – populates an index.

gptext.index_external() - adds documents to an external index.

gptext.index_external_dir() - adds all documents in a directory in an external document source to a VMware Greenplum Text external index.

gptext.commit_index() – finalizes index operations.

gptext.extract_rich_doc() – shows the content extracted from an external document, without actually adding the content to the index.

gptext.recreate_error_table() – recreates the error table that records errors that occur while adding documents to an external index.

Authenticating with External Document Sources

gptext.external_login() – log in to an external document store that requires authentication.

gptext.external_logout() – log out of an external document store.

Modifying or Deleting an Index

gptext.add_field() – adds a field to an index.

gptext.delete() – deletes documents matching a search query.

gptext.drop_field() – deletes a field from an index.

gptext.drop_index() - deletes an index.

gptext.search() – searches an index.

gptext.search_count() – returns number of documents that match search.

gptext.search_external() - searches a VMware Greenplum Text external index.

gptext.gptext_retrieve_field() – extracts a single field from the rf search result column as text.

gptext.gptext_retrieve_field_int() – extracts a single field from the rf search result column and converts to an integer.

gptext.gptext_retrieve_field_float() – extracts a single field from the rf search result column and converts to a float.

gptext.highlight() – returns search result with search term highlighted.

gptext.highlight_external() – applys highlighting to search results from external indexes.

gptext.highlight_instant_content() – returns local search results with search term highlighted.

gptext.faceted_field_search() – search, faceted by fields.

gptext.faceted_query_search() – search, faceted by queries.

gptext.faceted_range_search() – search, faceted by defined ranges.

Working With Terms

gptext.enable_terms() – enables term vectors and positions to allow extracting terms and their positions from text fields.

gptext.ner_terms() - gets terms tagged with Named Entity Recognition (NER) filters from the term vector for a specified field.

gptext.terms() – gets the term vectors for the indexed documents in a Solr index for the specified field.

VMware Greenplum Text Index Monitoring

gptext.cluster_status() - shows status of indexes managed by the VMware Greenplum Text cluster.

gptext.index_size() - shows the number of documents indexed and total disk space used for VMware Greenplum Text indexes.

gptext.index_status() – shows status of replicas for an index or for all indexes.

gptext.partition_status() - lists partitioned indexes and child partitions.

gptext.index_summary()- shows configuration details, status, and metrics for all of the replicas for a specified index or for all VMware Greenplum Text indexes.

VMware Greenplum Text Index Configuration

gptext.analyzer() - shows the output from each class in the index or query analyzer chain for a given field type and user-supplied input text.

gptext.config_append() - appends the contents of a local file to a ZooKeeper configuration file for an index.

gptext.config_delete() - deletes an index configuration file from ZooKeeper.

gptext.config_get() - displays the contents of a ZooKeeper index configuration file.

gptext.config_list() - lists ZooKeeper configuration files and directories for an index.

gptext.config_upload() - uploads an index configuration file to ZooKeeper.

gptext.get_field_type() - displays the analyzer chain for a field type defined in the configuration for a specified index.

gptext.list_field_types() - lists the text field types defined in the configuration for a specified index.

gptext.reload_index() – reloads Solr configuration files.

VMware Greenplum Text Cluster Monitoring and Management

gptext.live_nodes() – lists active Solr nodes.

gptext.version() – returns version of VMware Greenplum Text installation.

gptext.zookeeper_hosts() – returns a list of the ZooKeeper host names and ports.

High Availability

gptext.add_replica() – Adds a replica of an index shard.

gptext.delete_replica() – Deletes a replica of an index shard.

General Purpose Functions

gptext.count_t() – counts number of rows in a table.

Privileges

Your privileges to execute the VMware Greenplum Text functions depend on your VMware Greenplum privileges for the table from which the index is generated. For example, if you have SELECT privileges for a table in VMware Greenplum, you have SELECT privileges for an index generated from that table.

Executing index functions requires one of OWNER, SELECT, INSERT, UPDATE, or DELETE privileges, depending on the function. The OWNER is the person who created the table and has all privileges. See the Security section of the VMware Greenplum Text User's Guide for information about setting privileges.

The Privileges required section for each of the VMware Greenplum Text functions specifies the privileges required to execute that function.

Usage

The gptext functions in this section must be executed as SQL queries in the form:

SELECT * FROM gptext.function();

The examples in this document use a VMware Greenplum named demo set up as follows:

  • A table named articles in the wikipedia schema.
  • A table named message in the twitter schema.

See Setting Up the Sample Database for details about these tables.

Indexing Functions

Indexing functions create, set up, populate, and finalize (commit) Solr indexes.

gptext.create_index()

Creates an empty Solr index.

Syntax

gptext.create_index(<schema_name>, <table_name>, <id_col_name>, 
  <def_search_col_name> [, <if_check_id_uniqueness>])

or

gptext.create_index(<schema_name>, <table_name>, <p_columns>, <p_types>, 
  <id_col_name>, <def_search_col_name> [, <if_check_id_uniqueness>])

Parameters

<schema_name>

The name of the schema in the VMware Greenplum.

<table_name>

The name of the table in the VMware Greenplum. To include all subpartitions of a partitioned table in the index, specify the name of the root table.

<p_columns>

A text array containing the names of the table columns to index. If <p_columns> and <p_types> are omitted, all table columns are indexed.

The columns must be valid columns in the table. The columns identified by the <id_col_name> and <def_search_col_name> must be included in the array.

If the <p_columns> parameter is supplied, the <p_types> parameter must also be supplied. The sizes of the <p_columns> and <p_types> arrays must be the same.

<p_types>

A text array containing the Solr data types of the columns in the <p_columns> array.

Text types can be mapped to the name of an analyzer chain, for example <text_intl>, <text_sm>, or any type defined in the <managed_schema>. See Map VMware Greenplum Data Types to Solr Data Types for equivalent Solr data types for other Greenplum types.

The <p_types> parameter must be supplied if the <p_columns> parameter is supplied.

<p_types> also supports the gptext_json virtual field type, that accepts json formatted input, and enables a schemaless index. The type automatically creates new fields from the json input. It can only be applied to VMware Greenplum column types json and jsonb, and it has to be explicitly specified as the data type during index creation. gptext_json field type contains no data; every child field is mapped into a Solr field with the related data type and data. Users can index the whole json or jsonb column, or part of the json path, by explicitly specifing child fields. Any erros are logged in json_error_table.

For gptext_json example queries refer to Examples below.

<id_col_name>

The name of a column in <table_name> that is unique for each row. The column must be of type int4, int8, varchar, text, or uuid.

<def_search_col_name>

The name of the default column to search in <table_name>, if no other column is named in a query.

<if_check_id_uniqueness>

Optional. A Boolean value. The default is true. Set to false to index a table with a non-unique ID field.

Return type

boolean

Privileges required

Only the OWNER can execute this function.

Remarks

A VMware Greenplum Text index is a Solr collection.

The name of the index created has the format:

<database_name>.<schema_name>.<table_name>

The contents of the <id_col_name> column should, in most cases, be a unique ID for each row. It must be of type int4, int8, varchar, or text.

The index is broken up into sections called shards. By default, VMware Greenplum Text indexes have one shard per VMware Greenplum segment. You can create an index with fewer shards by setting the VMware Greenplum Text configuration parameter gptext.idx_num_shards to the number of shards before you create the index. See Specifying the Number of Index Shards for more information.

A shard can contain a maximum of 2147483519 documents, so the maximum number of documents a VMware Greenplum Text index can contain is 2147483519 times the number of shards.

If the <if_check_id_uniqueness> argument is true, the default, a document with an ID matching an existing ID cannot be added to the index.

If the <if_check_id_uniqueness> argument is false, documents with duplicate IDs are allowed to be added to the index. The content of other fields may or may not be the same as existing documents with the same ID. When a query returns multiple documents with the same ID, it is the user's responsibility to anticipate and handle the multiple documents. For example, a table could have a revision column that is incremented when a new version of a document is added to the index, allowing queries that omit all but the most recent version from search results.

The gptext.terms() function is not supported with indexes created with the <if_check_id_uniqueness> argument set to false because term vectors require unique document IDs.

VMware Greenplum Text does not support indexes with fewer shards than the number of VMware Greenplum segments with the <if_check_id_uniqueness> argument set to false.

If you specify the name of the root partition for a partitioned table, the VMware Greenplum Text index created for the table will contain records for all partitions. The index records for documents added to the index have a __partition field containing the name of the child partition table. See Searching Partitioned Tables for syntax to search by partitions.

Populate the new index with gptext.index().

The number of replicas for each shard is determined when the index is created. It is the value of the gptext.replication_factor server configuration parameter, 2 by default.

If the gptext.failover_factor server configuration parameter is set, gptext.create_index() fails if the ratio of the number of VMware Greenplum Text nodes that are up to the total number of VMware Greenplum Text nodes is less than the gptext.failover_factor value (from 0.0 to 1.0). Index shards can only be created on active VMware Greenplum Text nodes, so the gptext.failover_factor parameter prevents overloading the active VMware Greenplum Text nodes when too many nodes are down.

Examples

  1. Create an index, demo.wikipedia.articles, with content as the default search field.

    =# SELECT * FROM gptext.create_index('wikipedia', 'articles', 'id', 'content'); 
    
  2. Create an index, demo.wikipedia.articles, with content as the default search field. Index the id, content, and title fields.

    =# SELECT * FROM gptext.create_index('wikipedia', 'articles', 
          '{"id", "content", "title"}', '{"long", "text", "text"}', 
          'id', 'content');
    
  3. Create an index student.list and populate the json fields exams.subject, exams.test, and exams.marks accordingly:

    SELECT * FROM gptext.create_index('student', 'list', '{id, description, stu_exams}',
         '{long, text_intl, gptext_json}', 'id', 'description', true);
    {
      "first": "John",
      "last": "Doe",
      "grade": 8,
      "exams": [
        {
          "subject": "Maths",
          "test"   : "term1",
          "marks"  : 90},
        {
          "subject": "Biology",
          "test"   : "term1",
          "marks"  : 86}
      ]
    }
    

    The command will create the fields:

    first
    last
    grade
    exams
    exams.subject
    exams.test
    exams.marks
    

    To search the content, specify the field name:

    SELECT a.id, a.description
       FROM student.list a,
         gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.student.list', 
            exams.subject:'Biology', null, 'rows=5') q
    WHERE q.id::int8 = a.id;
    

gptext.create_index_external()

Creates an empty Solr index for external documents.

Syntax

gptext.create_index_external(<index_name>)

Parameters

<index_name>

The name of the index to create. The name cannot contain periods (.).

Notes

A VMware Greenplum Text external index is a Solr index for documents external to VMware Greenplum, for example, PDF, Microsoft Word, XML, and HTML files. Unlike regular VMware Greenplum Text indexes, external indexes are not associated with a VMware Greenplum table, but they can be searched with VMware Greenplum Text search functions.

Like regular VMware Greenplum Text indexes, an external index by default has one shard per VMware Greenplum segment. To create an external index with fewer shards, set the gptext.idx_num_shards configuration parameter to the desired number of shards before you create the index. See Specifying the Number of Index Shards for more information.

Example

=# SELECT * FROM gptext.create_index_external('gptext-docs');

gptext.index()

Populates an index by indexing data in a table.

Syntax

gptext.index(TABLE(SELECT * FROM <table_name>), <index_name>) 

Parameters

TABLE(<select-statement>)

The document content to be indexed, with data type anytable. <index_name>

Name of the index that was created with gptext.create_index() and is to be populated.

Return type

SETOF dbid INT, num_docs BIGINT

where dbid is the dbid of the segment that the documents were sent to, and num_docs is the number of documents that were indexed.

Privileges required

You must have the INSERT or UPDATE privilege to execute this function.

Remarks

<index_name> must have been created with gptext.create_index().

The first argument to gptext.index() is a table expression specifying rows to add to the index. To write this argument, wrap a SELECT statement in a TABLE function to select rows to pass to the gptext.index() function. For example, TABLE(SELECT * FROM articles) is a table expression that indexes all columns and rows of the articles table. The gptext.index() function adds each row of the SELECT result set as a document in the VMware Greenplum Text index.

You can selectively index table columns by changing the SELECT list in the query and you can filter rows to index with a WHERE clause.

The SELECT statement does not have to query the base table that was specified when the VMware Greenplum Text index was created, but the results of the query must have the same columns and distribution policy. For example, you can index rows from a temporary table with the same definition as the base table. Or you could index the results from a join query that produces the same columns as the base table. If the results have a different distribution key, or no distribution key, as is the case with a join query, you must specify a compatible distribution policy by adding a SCATTER BY <column> to the SELECT STATEMENT. You can specify the column by name or by column number.

The output of the gptext.index() function includes a two-column table with dbid (the Greenplum segment ID) and num_docs (the number of documents added to the index shard for that segment) as the columns.

After adding documents to the index, you must commit the index with gptext.commit_index(<index_name>).

Example

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM wikipedia.articles), 'demo.wikipedia.articles');
 dbid | num_docs 
------+----------
    3 |        6
    2 |        5
(2 rows)

gptext.index_external()

Adds documents stored outside of VMware Greenplum to a VMware Greenplum Text external index.

Syntax

gptext.index_external(<url-list>, <index-name>)

Parameters

<url-list>

A list of URLs for documents to add to the VMware Greenplum Text external index. The URLs may be expressed as an array or as a table-valued expression.

<index-name>

The name of the index to which the documents are to be added.

Remarks

If the documents to add to the VMware Greenplum Text external index are in a store that requires authentication, use the gptext.external_login() function to log in to the document store before you execute gptext.index_external().

If the document cannot be retrieved at the given URL, or an error occurs while indexing the document, VMware Greenplum Text inserts a row in the gptext.error_table table. You can use gptext.recreate_error_table() to create an empty error table before you call gptext.index().

The value of the VMware Greenplum Text custom server parameter gptext.idx_segment_error_limit (default 10) is the number of errors that can occur on any one segment before the indexing operation is canceled.

When adding a document to an external index, VMware Greenplum Text calculates a 256-bit hash on the contents of the document. The hash is stored as a 64-byte hexadecimal value in the sha256 field. If you later add a document with a URL matching an existing document in the index, the new document is only added to the index if the newly calculated hash differs from the current value in the sha256 field.

Examples

This example adds a single PDF document to the index gptext-docs.

=# SELECT * FROM gptext.index_external(
      '{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf}',
      'gptext-docs');
 dbid | num_docs
------+----------
    3 |        0
    2 |        1
(2 rows)

This example adds multiple HTML documents to the gptext-docs external index by selecting URLs from a database table. Errors will be logged in the gptext.gptext_errrors table.

=# DROP TABLE IF EXISTS gptext_html_docs;
=# CREATE TABLE gptext_html_docs (
      id bigint,
      url text)
DISTRIBUTED BY (id);
CREATE TABLE
=# INSERT INTO gptext_html_docs VALUES 
    (1, 'http://gptext.docs.pivotal.io/latest/topics/administering.html'), 
    (2, 'http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html'), 
    (3, 'http://gptext.docs.pivotal.io/latest/topics/function_ref.html'),
    (4, 'http://gptext.docs.pivotal.io/latest/topics/guc_ref.html'),
    (5, 'http://gptext.docs.pivotal.io/latest/topics/ha.html'),
    (6, 'http://gptext.docs.pivotal.io/latest/topics/index.html'),
    (7, 'http://gptext.docs.pivotal.io/latest/topics/indexes.html'),
    (8, 'http://gptext.docs.pivotal.io/latest/topics/intro.html'), 
    (9, 'http://gptext.docs.pivotal.io/latest/topics/managed-schema.html'),
    (10, 'http://gptext.docs.pivotal.io/latest/topics/performance.html'),
    (11, 'http://gptext.docs.pivotal.io/latest/topics/queries.html'),
    (12, 'http://gptext.docs.pivotal.io/latest/topics/type_ref.html'),
    (13, 'http://gptext.docs.pivotal.io/latest/topics/upgrading.html'),
    (14, 'http://gptext.docs.pivotal.io/latest/topics/utility_ref.html'),
    (15, 'http://gptext.docs.pivotal.io/latest/topics/installing.html');
INSERT 0 15
=# SELECT * FROM gptext.index_external(
      TABLE(SELECT url FROM gptext_html_docs), 
      'gptext-docs');
dbid | num_docs
------+----------
   3 |        6
   2 |        8
(2 rows)

=# SELECT * FROM gptext.error_table;
-[ RECORD 1 ]-------------------------------------------------------------------------------------------
error_time | 2000-01-01 00:25:11.282769
index_name | gptext-docs
sqlcmd     |
errmsg     | Code: RUNTIME_ERROR, Message: 'http://gptext.docs.pivotal.io/210/topics/ext-indexes.html. '
rawdata    | http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html
rawbytes   |

=# SELECT * FROM gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)

This example indexes a single HTML file in an Amazon S3 bucket named "gptext".

=# SELECT gptext.external_login('s3', 's3://s3-us-west-2.amazonaws.com', 'mys3_auth');
external_login
----------------
t
(1 row)
=# SELECT * FROM gptext.index_external('{s3://gptext/topics/queries.html}', 'gptext-docs');
 dbid | num_docs
------+----------
    3 |        0
    5 |        0
    4 |        0
    2 |        1
(4 rows)
=# SELECT gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)
=# SELECT gptext.external_logout('s3');
external_logout
-----------------
t
(1 row)

gptext.index_external_dir()

Adds all documents in a directory in an external document source to a VMware Greenplum Text external index.

Syntax

gptext.index_external_dir(<directory_url>, <index_name>)

Parameters

<directory_url>

The URL for the directory with documents to add to the index.

<index_name>

The name of the VMware Greenplum Text external index to which the documents are to be added.

Remarks

The gptext.index_external_dir() function adds the documents in a directory and its subdirectories to a VMware Greenplum Text external index.

Log in to the document source using the gptext.external_login() function before you call the gptext.index_external_dir() function.

If you specify a file instead of a directory, an error is added to the gptext.error_table table.

The ID for each file added to the index is the URL for the file in the external document source.

The Apache Tika library discovers the content_type for each file.

The user who logs in to the external document source must have read permissions on the directory. The gptext.index_external_dir() function adds to the index only those documents and subdirectories that the user has permission to read.

Example

This example adds documents from a directory in an hdfs store to the VMware Greenplum Text webdocs external index.

#= SELECT * FROM gptext.external_login('hdfs', 'hdfs://myhadoop:9000', 'myhadoop');
 external_login
----------------
 t
(1 row)

=# SELECT * FROM gptext.index_external_dir('hdfs://myhadoop:9000/gptext_web/public/300/', 'webdocs');
 num_docs
----------
       37
(1 row)

=# SELECT * FROM gptext.commit_index('webdocs');
 commit_index
--------------
 t
(1 row)

=# SELECT * FROM gptext.external_logout('hdfs');
 external_logout
-----------------
 t
(1 row)

gptext.commit_index()

Finishes an index operation. The results of an indexing operation are not available until this function is called for the index.

Syntax

gptext.commit_index(<index_name>)

Parameters

<index_name>

The name of the index to commit. If the table is partitioned this must be the name of the root table.

Return type

boolean

Privileges required

You must have the INSERT, UPDATE, or DELETE privilege to execute this function.

Remarks

Must be called after gptext.index() and gptext.delete().

Example

=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
 commit_index 
--------------
 t
(1 row)

gptext.extract_rich_doc()

Retrieves metadata and content from an external document but does not add the content to the VMware Greenplum Text external index.

Syntax

gptext.extract_rich_doc(<index-name>, <document-url>)

Parameters

<index-name> The name of the external index to test. Metadata and content are extracted from the document using the current Solr configuration files for this index.

<document-url> The URL for the document to extract.

Return type

Column Type
stream_name text
title text
author text
keywords text
created text
modified text
content text

Remarks

The gptext.extract_rich_doc() function can be used while testing or troubleshooting to verify the metadata and content extracted from external documents. The external document is not added to the VMware Greenplum Text external index. For example, you can use this function to test that changes you make to the solrconfig.xml configuration file for the index have the desired effect.

If the document is in a document store that requires authentication, you must first log in using gptext.external_login().

The stream_name column is the same as the <document-url> function parameter. If the document is added to the index, this will be the unique document ID.

The title, author, keywords, created, and modified columns contain metadata dependent on the document type Apache Tika detects.

The content column contains the extracted document content.

Examples

  1. View the metadata and content that would be extracted from an HTML document.

    =# SELECT * FROM gptext.extract_rich_doc('gptext-docs', 'http://gptext.docs.pivotal.io/latest/topics/intro.html');
    
  2. View the text extracted from an image file when the Tesseract optical character recognition (OCR) engine has been installed on the VMware Greenplum cluster.

    =# SELECT * FROM gptext.extract_rich_doc('gptext-docs', 'http://gptext.docs.pivotal.io/latest/graphics/ocrtest.png');
    

    See Indexing Text Embedded in Images for information about setting up OCR for external image files and documents with embedded images.


gptext.recreate_error_table()

Drops and recreates the gptext.error_table database table.

Syntax

gptext.recreate_error_table()

Return type

Boolean

Remarks

If an error occurs while accessing a document to add to a VMware Greenplum Text external index, VMware Greenplum Text adds a record to the gptext.error_table table. See gptext.error_table for a description of this table. Users should not drop or modify the table.

Rows added to the gptext.error_table table remain until you use the gptext.recreate_error_table() function to create a new empty table.

If you attempt to execute gptext.recreate_error_table() while it is in use for an indexing operation, a warning is raised and the function returns false without recreating the table.

Examples

=# SELECT gptext.recreate_error_table();
recreate_error_table
----------------------
t
(1 row)

Authenticating with External Document Sources

To add documents from a document source that requires authentication, such as Hadoop, an ftp server, or Amazon S3 or other S3-compatible storage, log in before adding documents to a VMware Greenplum Text index and log out when done.


gptext.external_login()

Logs in to an external document source before adding documents from the source to a VMware Greenplum Text external index.

Syntax

gptext.external_login(<type>, <url>, <config-name>)

Parameters

<type>

Identifies the type of the external document source. The valid types are 'ftp', 'hdfs', and 's3'. The type is not case-sensitive.

<url> The URL of the external document source.

<config-name> The name of the configuration uploaded with the gptext-external upload utility command.

Remarks

You can log in to only one external document source at a time.

Use the gptext-external list command to list the configurations that have been uploaded.

Examples

Log in to an hdfs file system using the myhdfs configuration.

=# SELECT * FROM gptext.external_login('HDFS', 'hdfs://198.51.100.23:19000', 'myhdfs');

Log in to a ftp server using the myftp configuration.

=# SELECT gptext.external_login('ftp', 'ftp://198.51.100.23', 'myftp');

The S3 connection <url> has the format <protocol>://<s3-endpoint>[/<region>][/].

  • Amazon S3 — The <protocol> is s3. If the <s3-endpoint> begins with s3- or s3. and is followed by a region code, for example s3-us-east-1.amazonaws.com or s3.ap-southeast-1.amazonaws.com, the /<region> part of the URL is not required. An endpoint like s3.dualstack.us-east-1.amazonaws.com, however, must include the region code at the end: s3://s3.dualstack.us-east-1.amazonaws.com/us-east-1. See Amazon AWS Regions and Endpoints for a list of valid S3 endpoints by region.

    Example: Log in to Amazon S3 using the mys3 configuration.

    =# SELECT gptext.external_login('s3', 's3://s3.us-east-2.amazonaws.com', 'mys3');
    
  • Other S3-compatible storage — <protocol> identifies the login protocol. For example, when you access MinIO, the <protocol> may be http or https. The <s3-endpoint> is an endpoint that identifies the server. The <region>/ part of the URL is optional, and depends on the configuration of the S3 storage server.

    Example: Log in to MinIO using the myminio configuration.

    =# SELECT gptext.external_login('s3', 'http://my.minio.server:9000', 'myminio');
    

gptext.external_logout()

Logs out of an external document source after adding documents from the source to a VMware Greenplum Text external index.

Syntax

gptext.external_logout(<type>)

Parameters

<type>

Identifies the type of the external document source. The supported types are 'ftp', 'hdfs', and 's3'. The type is not case-sensitive.

Remarks

You can log in to only one external document source at a time. To index documents from another source, you must first call gptext.external_logout() and then log in to the new source with gptext.external_login().

Example

Log out of an hdfs file system.

=# SELECT * FROM gptext.external_logout('HDFS');

Log out of an ftp server.

=# SELECT gptext.external_logout('ftp');

Log out of S3.

=# SELECT gptext.external_logout('s3');

Modifying or Deleting an Index

You can change an index by adding or dropping fields, reverting an index to its previous state, or deleting the index.


gptext.add_field()

Adds a field to your schema if the field was added to the database after the index was created.

Syntax

gptext.add_field(<index_name>, <field_name>, <solr_type>[, <is_default_search_col> [, <if_enable_terms>]])

Parameters

<index_name>

The name of the index to which you want to add the field. If the table is partitioned this must be the name of the root table.

<field_name>

The name of the field to be indexed.

<solr_type>

The type of the field that is added to the index.

<is_default_search_col>

Optional. Boolean value. Is this to become the default search column (field)?

<if_enable_terms>

Optional. Boolean value. Enable terms support on this field when added to the VMware Greenplum Text index.

Return type

SETOF boolean

Privileges required

Only the OWNER can execute this function.

Remarks

Call this function for each field you add.

Before and after you add one or more fields, reload the Solr configuration files with gptext.reload_index(). The initial reload_index() call is required because of Solr 4.0 behavior and may not be required in subsequent versions.

After you add one or more fields, you should reindex the table and commit the index with gptext.commit_index().

Example

  1. Adds the field external_links to the index, then recreates, repopulates, and commits the index.

    =# ALTER TABLE wikipedia.articles ADD external_links text;
     ALTER TABLE
    =# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
     reload_index
    --------------
     t
    (1 row)
    =# SELECT * FROM gptext.add_field('demo.wikipedia.articles', 'external_links', false, false);
    INFO:  Add field: external_links for index: demo.wikipedia.articles
     add_field 
    -----------
     t
    (1 row)
    
    =# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
     reload_index 
    --------------
     t
    (1 row)
    
    =# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
     commit_index 
    --------------
     t
    (1 row)
    

gptext.delete()

Deletes all documents that match the search query.

Syntax

gptext.delete(<index_name>, <query>)

Parameters

<index_name>

The name of the index. <query>

Documents matching this query will be deleted. To delete all documents use the query'*'or '*:*'.

Return type

boolean

Privileges required

You must have the DELETE privilege to execute this function.

Remarks

After a successful delete, commit the index using gptext.commit_index(<index_name>).

Examples

  • Delete all documents containing the word "unverified" in the default search field:

    =# SELECT * FROM gptext.delete('demo.wikipedia.articles', 'unverified');
     delete 
    --------
     t
    
    (1 row)
    =# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
     commit_index 
    --------------
     t
    (1 row)
    
  • Delete all documents from the index:

    =# SELECT * FROM gptext.delete('demo.wikipedia.articles', '*:*');
     delete 
    --------
     t
    (1 row)
    
    =# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
     commit_index 
    --------------
     t
    (1 row)
    

gptext.drop_field()

Removes a field from your schema.

Syntax

gptext.drop_field(<index_name>, <field_name>)

Parameters

<index_name>

The name of the index from which to drop the field. If the table is partitioned this must be the name of the root table.

<field_name>

The name of the field to drop.

Return type

boolean

Privileges required

Only the OWNER can execute this function.

Remarks

Call this function for each field you drop.

Before and after dropping one or more fields, you must reload the Solr configuration files with gptext.reload_index(), then commit the index with gptext.commit_index().

The column __partition in indexes for partitioned database tables cannot be dropped.

The initial reload_index() is required by Solr 4.0 behavior and may not be necessary in subsequent versions.

Example

Drops the field external_links from the index.

=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
 reload_index 
--------------
 t
(1 row)

=# SELECT * FROM gptext.drop_field('demo.wikipedia.articles', 'external_links');
INFO:  Drop field: external_links for index: demo.wikipedia.articles
 drop_field 
------------
 t
(1 row)

=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
 reload_index 
--------------
 t
(1 row)

=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
 commit_index 
--------------
 t
(1 row)

gptext.drop_index()

Removes an index.

Syntax

gptext.drop_index(<index_name>)

Parameters

<index_name>

The name of the index to drop. If the database table is partitioned, this must be the name of the root table.

Return type

boolean

Privileges required

Only the OWNER can execute this function.

Remarks

A dropped index cannot be recovered.

Example

=# SELECT * FROM gptext.drop_index('demo.wikipedia.articles');
 drop_index 
------------
 t
(1 row)

Search

Search functions enable querying an index.

Changing the query parser at query time

When using the search functions, you can change the query parser used by Solr at query time. A different query parser may be required, depending on the nature of the query. See Using Advanced Querying Options for a list of the query parsers VMware Greenplum Text supports.

To change the query parser at query time, use the defType Solr option with the gptext.search() function.

To change the query parser for any search function at query time, use the Solr localParams syntax, replacing the <query> term with '{!type=edismax}<query>'.

With the VMware Greenplum Text Universal Query Parser, you can use features from any of the other supported query parsers in a single query. To use the Universal Query Parser, replace the <query> term with '{!gptextqp}<query>'. See Using the Universal Query Parser for information and examples.


gptext.search()

Searches an index.

Syntax

gptext.search(TABLE(<select-statement>), <index_name>, <search_query>, <filter_queries>[, <options>])

Parameters

TABLE(<select-statement>)

A table-valued expression. This is an anytable pseudo-type, specified using the format:

TABLE(<select_statement>)

The VMware Greenplum query planner processes the <select_statement> to estimate the number of result rows, but the query is not executed by the body of the search function. You can use the expression TABLE(SELECT 1 SCATTER BY 1) as a simple, syntactically correct value for this argument.

<index_name> The name of the index to search. If the database table is partitioned you can specify the name of a sub-partition table to search. <search_query> Text value containing a Solr text search query. <filter_queries> A text array of filter queries, if any. If none, set this parameter to null. <options> An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

SETOF gptext.search_scored_result

This is a composite type with the following columns:

Column Type
id text
score double precision
hs (conditional) gptext hstore
rf (conditional) text

The id column is returned as text, even if the <id_col> specified in the gptext.create_index() function is an integer type. If you order results by id or join search results with the original table on id, you must cast the returned id column to the correct integer type in your query. For example, the following search query casts the id returned by the search query to an INT8 type to join with the numeric id column in the wikipedia.articles table and to sort the results numerically. However, the id column in the results is a text value and is therefore displayed left-justified.

SELECT s.id, s.score, a.title 
FROM wikipedia.articles a,
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 
    'demo.wikipedia.articles', '*:*', null) s
WHERE s.id::INT8 = a.id
ORDER BY s.id::INT8;
    id    | score |           title
----------+-------+---------------------------
 25784    |     1 | Renewable energy
 27743    |     1 | Solar energy
 54838    |     1 | Biogas
 55017    |     1 | Fusion power
 65499    |     1 | Soil salinity
 113728   |     1 | Geothermal energy
 213555   |     1 | Solar updraft tower
 533423   |     1 | Solar water heating
 608623   |     1 | Ethanol fuel
 855056   |     1 | Forward osmosis
 2008322  |     1 | Vehicle-to-grid
 2120798  |     1 | Lithium economy
 2988035  |     1 | Vortex engine
 4711003  |     1 | Osmotic power
 7906908  |     1 | Biomass
 13021878 |     1 | Geothermal power
 13690575 |     1 | Solar power
 14090587 |     1 | Low-carbon power
 14205946 |     1 | Algae fuel
 18965585 |     1 | Pressure-retarded osmosis
 22391303 |     1 | Liquid nitrogen engine
 26787212 |     1 | Reverse electrodialysis
 53716476 |     1 | Seaweed fuel
(23 rows)

If the <options> parameter is included in gptext.search(), the result includes the offsets column. This column contains key-value pairs, where the key is the column name and the value is a comma-separated list of offsets to locations where the search term occurs. This data is used by the gptext.highlight() function to add highlighting tags to the column data. If highlighting is not enabled with the 'hl=true' option, the offsets column is NULL.

If the fl option is included in the <options> parameter to specify additional fields to add to the result, the rf column contains the additional fields in a formatted text value. The gptext.gptext_retrieve_field() function can be used to extract a single field value from the rf column. There are variants of the gptext.gptext_retrieve_field() function to retrieve integer and float values from the rf column value.

Privileges required

You must have the SELECT privilege to execute this function.

Solr options

Solr queries allow the following optional refinements, specified as an ampersand-delimited list in the options parameter.

defType The name of the query parser to use for this query.

Example: defType=edismax rows The maximum number of rows to return per segment. If omitted, all rows are returned.

Example: rows=100 returns 100 rows per segment or all rows if there are fewer than 100. sort Sorts on a field or score in ascending or descending order.

Examples:

  • sort=score desc (default if no sort defined)
  • sort=date_time asc
  • sort=date_time asc score desc sorts on date_time ascending, then on score descending

start The number of the first record to return.

Examples:

  • start=0 default: returned records start with record 0
  • start=25 returned records start with record 25

hl Enable highlighting.

Example: hl=true hl.fl Comma-separated list of field names to consider when highlighting.

Examples:

  • hl.fl=message_text
  • hl.fl=title,content

fl Comma-separated list of fields to include in search results. The fields must have been set to stored=true in the managed-schema for the index.

Example: fl=title,refs

Remarks

  • The output includes a table with columns id (the ID named in gptext.create_index()) and score (the tf-idf score). A column named offsets is included if highlighting is specified in the options parameter. A column named rf is included when a list of additional fields to include is specified in the <options> parameter.

  • To change the query parser at query time, specify the defType option in the options parameter list. For example, setting the options parameter to 'rows=100&defType=edismax' limits the output to 100 rows per segment and will change the query parser to edismax.

  • The TABLE query is planned and affects the estimate for gptext.search(), but does not execute. For example, if your query includes

    gptext.search(TABLE(SELECT * FROM t), ...)
    

    the query planner estimates the number of results as the number of rows in t. This can cause the query planner to ignore the use of an index scan. Use a query like TABLE(SELECT 1 SCATTER BY 1) to avoid this issue.

  • If you do not specify options, gptext.search() returns all rows.

  • The options separator has changed from comma to ampersand (&) in order to support highlighting. If you do not use highlighting, you can revert to using the comma separator by setting the gptext.search_param_separator to ','.

  • The Solr option rows specifies the maximum number of rows to return per segment. For example, if you have four segments,'rows=100' returns at most a total of 400 rows. To limit the number of rows returned for an entire query, set a LIMIT in the SQL query. For example, the following query returns at most 20 rows for the query:

    =# SELECT t.id, q.score, t.message_text FROM twitter.message t, 
        gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message', 
            '*:*', null, null) q 
    WHERE t.id=q.id::int8 LIMIT '20';
    
  • The gptext hstore type is a limited form of the Postgres hstore type. It only has the hstore input and output functions implemented, as gptext_hstore_in and gptext_hstore_out.

Examples

  • Runs a VMware Greenplum Text query that looks for Wikipedia articles that contain the term "optimization", and joins the results to the original VMware Greenplum articles table:

    =# SELECT a.id, a.title, q.score 
    FROM wikipedia.articles a, 
         gptext.search(TABLE(SELECT * FROM wikipedia.articles), 
            'demo.wikipedia.articles', 'optimization', null) q 
    WHERE a.id = q.id::int8 ORDER BY score DESC;
        id    |           title           |   score
    ----------+---------------------------+------------
       213555 | Solar updraft tower       |  1.5528862
     18965585 | Pressure-retarded osmosis | 0.89540845
      7906908 | Biomass                   |  0.8692636
        25784 | Renewable energy          |  0.7473389
       533423 | Solar water heating       |  0.7186527
       608623 | Ethanol fuel              |  0.6943706
        27743 | Solar energy              |  0.6943706
      2008322 | Vehicle-to-grid           |  0.6352812
        55017 | Fusion power              |  0.6347449
     14205946 | Algae fuel                |  0.6347449
     13690575 | Solar power               | 0.58286035
    (11 rows)
    

-   Returns 5 rows from each segment with the text "iphone" highlighted in the `message_text` column. This example requires that you have enabled terms on the `message_text` field in the `demo.twitter.message` table. See the example in the [`gptext.enable_terms()`](#topic19) reference.

    ``` sql
    =# SELECT t.id,
    gptext.highlight(t.message_text, 'message_text', s.hs) message
    FROM twitter.message t,
         gptext.search(TABLE(SELECT 1 SCATTER BY 1),
            'demo.twitter.message','{!gptextqp}iphone', null,
             'rows=5&hl=true&hl.fl=message_text' ) s
    WHERE t.id = s.id::int8;
    id    |                                                       message
----------+---------------------------------------------------------------------------
------------------------------------------
 19714120 | @ayee_Eddy2011 I love pancakes too!  #<em>iPhone</em> #app
 19284329 | #nowplaying on my <em>iPhone</em>: Daft Punk - "Digital Love"
 19416451 | I'm in love with my new <em>iPhone</em>(:
 20257190 | Love my #<em>iphone</em> - only problem now?  I want an #Ipad!
 20759274 | Dropped frutopia on                                                       
                                          
          : My phone... #ciaowaterdamage I hate <em>iPhones</em>.
 20473459 | i love my iphone4 but I'm excited to see what the iphone5 has to offer #ga
dgets #<em>iphone</em> #apple #technology
 19424811 | I hate                                                                    
                                          
          : <em>iPhones</em>:
 20663075 | RT @indigoFKNvanity: I hate the auto correct on <em>iPhones</em> !!!!!!!!!
 20350436 | I absolutely love how fast this phone works. Love the <em>iPhone</em>.
 20042822 | @KDMC23 ohhhh!!! I hate <em>Iphone</em> Talk!
(10 rows)

gptext.search_count()

Returns the number of documents that match the search query.

Syntax

gptext.search_count(<index_name>, <search_query>, <filter_queries>, <options>)

Parameters

<index_name>

The name of the index. <search_query> The search query. <filter_queries>

A comma-delimited array of filter queries, if any. If none, set this parameter to null.

<options>

An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

bigint

Privileges required

You must have the SELECT privilege to execute this function.

Example

=# SELECT * FROM gptext.search_count('demo.wikipedia.articles', 'bubble', null);
 count 
-------
     3
(1 row)

gptext.search_external()

Searches a VMware Greenplum Text external index.

Syntax

gptext.search_external(<table-exp>, <index_name>, <search_query>, 
      <filter_queries>[, <options>])

Parameters

<table> A table-valued expression. Because external indexes are not associated with a database table, this parameter is ignored. An expression like the following is sufficient:

TABLE(SELECT 1 SCATTER BY 1)

<index_name>

The name of the VMware Greenplum Text external index to search.

<search_query>

Text value containing a Solr text search query.

<filter_queries>

A text array of filter queries, if any. If none, set this parameter to null.

<options>

An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

SETOF gptext.search_external_result

This type has the following columns:

Column     |       Type

---------------+------------------ id | text title | text subject | text description | text comments | text author | text keywords | text category | text resourcename | text url | text content_type | text last_modified | text links | text sha256 | text content | text score | double precision meta | text

The last column, meta is present only if the optional <options> argument is included in the search.

Remarks

When you add an external document to the index, Apache Tika extracts a core set of metadata from the document, the columns listed in the Return type section. If any of these core metadata values are not present or do not exist in the document type, the value of the column in the result row is null.

If the <options> argument is supplied, the results contain an additional text column named meta. The meta column contains additional document-type-specific metadata. You can use the gptext.gptext_retrieve_field() function and its variants to extract individual metadata values by name from the meta column.

If the <options> argument contains the fl=<list> Solr option, Solr returns values only for the columns included in <list> and the id, score, and meta columns. Other columns in the result set will have null values. It is more efficient to filter out columns in Solr than to retrieve all columns from Solr and then choose a subset of columns in the SQL SELECT statement.

Examples

  • Finds HTML documents containing the term "facet".

    =# \x on
    =# SELECT id, title 
    FROM gptext.search_external(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 
    'facet', '{content_type:*html*}');
    -[ RECORD 1 ]--------------------------------------------------------
    id    | http://gptext.docs.pivotal.io/latest/topics/function_ref.html
    title | GPText Function Reference |
          |     VMware Greenplum Text Docs
    -[ RECORD 2 ]--------------------------------------------------------
    id    | http://gptext.docs.pivotal.io/latest/topics/queries.html
    title | Querying GPText Indexes |
          |     VMware Greenplum Text Docs
    -[ RECORD 3 ]--------------------------------------------------------
    id    | http://gptext.docs.pivotal.io/latest/topics/guc_ref.html
    title | GPText Configuration Parameters |
          |     VMware Greenplum Text Docs
    

    Although just two columns are in this result set, the data VMware Greenplum receives from Solr includes all core metadata fields, including the content field, which contains the full text of the document. The next example shows how to limit the data transferred from Solr to VMware Greenplum with the Solr options argument.

  • This example lists the id, title, sha256, and score columns from the core metadata and extracts meta_creation_date from the additional metadata supplied for PDF documents in the gptext-docs external index. The fl=title,sha256 Solr option prevents Solr from transferring unneeded fields from the index to VMware Greenplum. (The id and score columns are always transferred.)

    =# \x on
    =# SELECT id, title, sha256, score,
          gptext.gptext_retrieve_field(meta, 'meta_creation_date') created
    FROM gptext.search_external(TABLE(SELECT 1 SCATTER by 1), 'gptext-docs', '*:*',
          '{content_type:*pdf*}', 'fl=title,sha256');
    -[ RECORD 1 ]-------------------------------------------------------------
    id      | http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf
    title   | VMware Greenplum Text 2.1.3 Documentation | VMware Greenplum Text Docs
    sha256  | 2E063DF5037B9ACC6E180681AE6838077BC5F7A362B4A1E67D9D8FF3E4DD7F3D
    score   | 1
    created | 2017-09-22T17:22:54Z,2017-09-22T17:22:54Z
    -[ RECORD 2 ]-------------------------------------------------------------
    id      | http://gpdb.docs.pivotal.io/latest/pdf/GPDB510Docs.pdf
    title   | Version 5.1.0
    sha256  | AF0B71D032C99A6BE817E1FA2FB774EB7B4D47D75A755ABF54F4F60FEBB92FF7
    score   | 1
    created | 2017-10-31T19:12:41Z,2017-10-31T19:12:41Z
    

gptext.gptext_retrieve_field()

Retrieves a single field from the rf or meta search result column as a text value.

Syntax

gptext.gptext_retrieve_field(rf | meta, <field_name>)

Parameters

rf | meta

The name of the column in which VMware Greenplum Text returns fields. This is rf for search results from regular VMware Greenplum Text indexes and meta for search results from VMware Greenplum Text external indexes.

<field_name>

The name of the field to retrieve.

Remarks

The fl=<field_list> Solr search option is added to the <options> parameter of the gptext.search() function to request additional stored fields. The additional fields are returned in the results in a column named rf (meta for external indexes). This column value has a format like the following:

column_value { name: "_version" value: "1544714234398507008" } column_value { name: "revision" value: "9.70" } column_value { name: "author" value: "jdough" }

The gptext.gptext_retrieve_field() function extracts the value for a single specified field and returns it as a text value. If there is no field with the specified name in the rf column, it returns NULL.

Storing additional fields in an index requires editing managed-schema to specify the fields that should be stored. See Storing Additional Fields in an Index for instructions.


gptext.gptext_retrieve_field_int()

Retrieves a single field from the rf or meta search result column as an integer value.

Syntax

gptext.gptext_retrieve_field_int(rf | meta, <field_name>)

Parameters

rf | meta

The name of the column containing fields to be retrieved. For regular VMware Greenplum Text indexes, it is rf. For VMware Greenplum Text external indexes it is meta.

<field_name>

The name of the integer field to retrieve.

Remarks

The gptext.gptext_retrieve_field_int() function is the same as the gptext.gptext_retrieve_field() function, except that the extracted field value is converted to an integer value.


gptext.gptext_retrieve_field_float()

Retrieves a single field from the search result column as a float value.

Syntax

gptext.gptext_retrieve_field_float(rf | meta, <field_name>)

Parameters

rf | meta

The name of the column containing fields to be retrieved. For regular VMware Greenplum Text indexes, it is rf. For VMware Greenplum Text external indexes it is meta.

<field_name>

The name of the float field to retrieve.

Remarks

The gptext.gptext_retrieve_field_float() function is the same as the gptext.gptext_retrieve_field() function, except that the extracted field value is converted to a float value.


gptext.highlight()

Highlights terms by inserting markup tags into data.

Syntax

gptext.highlight(<column_data>, <column_name>, <offsets>)

Parameters

<column_data>

The text data from the table which is to be tagged with highlighting tags.

<column_name>

The name of the corresponding column from the table. <offsets>

A gptext hstore value that contains key-value pairs that indicate the locations of the text to highlight within the column data. See Remarks for information about the gptext hstore data type.

Prequisite

To use highlighting, full term vectors, postings with term vectors, or stored fields must be enabled before creating the index.

Remarks

  • The offsets parameter is a gptext hstore, where each key is a column name and the value is a comma-separated list of offsets into the column data. This hstore is constructed by gptext.search() with highlighting enabled in the offsets parameter.

    Following is an example of the offsets hstore content:

    "field1"=>"0:5,9:14","field2"=>"13:20"
    

    gptext.highlight() will insert two sets of tags into the field1 data and one set into the field2 data at the indicated offsets.

  • The gptext hstore type is a limited form of the Postgres hstore type. It has only the hstore input and output functions implemented, as gptext_hstore_in and gptext_hstore_out.

  • The highlight tags are defined by the gptext.hl_pre_tag and gptext.hl_post_tag server configuration parameters. Their default values are <em> and </em>, respectively.

Example


gptext.highlight_external()

Highlights terms in search results from external indexes by inserting markup tags.

Syntax

gptext.highlight_external(<table_exp>, <index>, <search_query>, <filter_queries>[, <options>])

Parameters

<table_exp>

A table expression, ignored for external indexes. An expression such as TABLE(SELECT 1 SCATTER BY 1) is sufficient.

<index>

Name of the index containing data to highlight. <search_query>

Text value containing a Solr text search query.

<filter_queries> A text array of filter queries, if any. If none, set this parameter to null.

<options An optional ampersand-delimited list of Solr query parameters. See Solr options.

Remarks

The gptext.highlight_external() function searches a VMware Greenplum Text external index and encloses the search terms in markup tags in the returned results.

Example

Search for and highlight the terms "zookeeper" and "solr" in HTML documents.

=# SELECT id, content FROM gptext.highlight_external(TABLE(SELECT 1 SCATTER BY 1),
      'gptext-docs','{!gptextqp}zookeeper AND solr', '{content_type:*html*}',
      'rows=2');
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/administering.html
content |  includes security considerations, monitoring <em>Solr</em> index statistics, managing and monitoring <em>ZooKeeper</em>
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/performance.html
content |  problems can result from resource contention between the VMware Greenplum, <em>Solr</em>, and <em>ZooKeeper</em> clusters
-[ RECORD 3 ]-------------------------------------------------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/ha.html
content | /topics/utility_ref.html GPText Management Utilities
        |           rect /210/topics/type_ref.html GPText and <em>Solr</em>
-[ RECORD 4 ]-------------------------------------------------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/indexes.html
content | /topics/utility_ref.html GPText Management Utilities
        |           rect /210/topics/type_ref.html GPText and <em>Solr</em>

gptext.highlight_instant_content()

Highlights terms in local search results by inserting markup tags into data.

Syntax

gptext.highlight_instant_content(<table_query>, <index_name>, <highlight_query>)

Parameters

<table_query>

A table SELECT query that returns two columns: the first column is an <id>, the second column is the text <content> to be highlighted.

<index_name>

Name of the index containing data to highlight. <highlight_query>

Text value containing a Solr text search query.

Remarks

The gptext.highlight_instant_content() function provides a more disk efficient highlighting option. When you invoke this function, Greenplum Text sends the original text to Solr before it applies the highlight.

The function returns records with the following columns:

  • id - The first column returned by the <table_query>.
  • highlighted_content - The result of applying the <highlight_query> on the original content.

You can use the returned id to join the results with other tables.

Example

Search for and highlight the term "solr".

=# SELECT t.id, highlighted_content, date FROM 
     gptext.highlight_instant_content(TABLE(SELECT t.id, content FROM 
       gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.test', 'solr',
          null, 'rows=10') s JOIN hli_test t ON s.id::bigint=t.id),
       'demo.public.test', 'solr') h JOIN hli_test t ON h.id::bigint=t.id;

Faceted Search

Faceting breaks up a search result into multiple categories, showing counts for each.


The faceted_field_search() function breaks search results into field name categories.

Syntax

gptext.faceted_field_search(<index_name>, <query>, <filter_queries>, <facet_fields>, 
    <facet_limit>, <minimum>, <options>)

Parameters

<index_name>

The name of the index. <query>

Query statement. Use *:* to query for all results. <filter_queries>

A text array of filter queries, if any. If none, set this parameter to null. <facet_fields>

An array of field names to facet. Use VMware Greenplum array notation. <facet_limit>

Maximum number of results to be returned for each aggregation (facet). <minimum>

Minimum number of results required before an aggregation (facet) will be returned. Enter 0 to return all facets.

<options>

An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

SETOF gptext.facet_field_result

This is a composite type with the following columns:

Column Type
field_name text
field_value text
value_count bigint

Privileges required

You must have the SELECT privilege to execute this function.

Remarks

None.

Examples

  • Facet all tweets on spam and truncated fields.

    =# SELECT * 
       FROM gptext.faceted_field_search('demo.twitter.message', '*:*', null, 
          '{spam, truncated}', 2, 0);
     field_name | field_value | value_count 
    ------------+-------------+-------------
     spam       | true        |        1730
     truncated  | false       |        1705
     truncated  | true        |          25
    (3 rows)
    
  • Facet on author_id, no limit, with a minimum of five tweets, all tweets. Selects five authors with at least two tweets.

    =# SELECT * 
       FROM gptext.faceted_field_search('demo.twitter.message', '*:*', null, 
          '{author_id}', 5, 2);
     field_name | field_value | value_count 
    ------------+-------------+-------------
     author_id  | 102185050   |           9
     author_id  | 202305785   |           2
     author_id  | 64111799    |           2
     author_id  | 45326213    |           2
     author_id  | 195035308   |           2
    (5 rows)
    

The faceted_query_search() function breaks search results into categories defined by queries that you provide.

Syntax

gptext.faceted_query_search(<index_name>, <query>, <filter_queries>, 
      <facet_queries>, <options>)

Parameters

<index_name>

The name of the index.

<query>

Query statement. Use *:* to query for all results.

<filter_queries>

A text array of filter queries, if any. If none, set this parameter to null.

<facet_queries>

Type: text[]. Required. An array of facet queries.

<options>

An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

SETOF gptext.facet_query_result

This is a composite type with the following columns:

Column Type
query_name text
value_count bigint

Privileges required

You must have the SELECT privilege to execute this function.

Remarks

None.

Example

This example uses Solr queries to divide twitter authors into three classes based on number of followers.

=# SELECT * 
FROM gptext.faceted_query_search('demo.twitter.message', '*:*', null,
    '{author_followers_count:[0 TO  5], author_followers_count:[6 TO 10], 
          author_followers_count:[11 TO *]}');
            query_name            | value_count 
----------------------------------+-------------
 author_followers_count:[0 TO  5] |          39
 author_followers_count:[11 TO *] |        1632
 author_followers_count:[6 TO 10] |          36
(3 rows)

The faceted_range_search() function breaks search results into range categories over a numeric or date field, with ranges defined by the <range_start>, <range_end>, and <range_gap> arguments.

Syntax

gptext.faceted_range_search(<index_name>, <query>, <filter_queries>, 
  <field_name>, <range_start>, <range_end>, <range_gap>, <options>)

Parameters

<index_name>

The name of the index.

<query>

Query statement. Use *:* to query for all results.

<filter_queries>

A text array of filter queries, if any. If none, set this parameter to null.

<field_name>

The name of the field on which to facet.

<range_start>

Beginning of the range.

<range_end>

End of the range.

<range_gap>

Size of range increment, a text value. <options>

An optional ampersand-delimited list of Solr query parameters. See Solr options.

Return type

SETOF gptext.facet_range_result

This is a composite type with the following columns:

Column Type
field_name text
range_value text
value_count bigint

Privileges required

You must have the SELECT privilege to execute this function.

Example

Facet on date range from midnight August 1, 2011 to midnight November 1, 2011, with a 7-day gap.

=# SELECT * 
FROM gptext.faceted_range_search('demo.twitter.message', '*:*', null, 
       'created_at', '2011-08-01T00:00:00Z', 
       '2011-11-01T00:00:00Z', '+7DAY');
 field_name |     range_value      | value_count 
------------+----------------------+-------------
 created_at | 2011-08-01T00:00:00Z |           0
 created_at | 2011-08-08T00:00:00Z |           0
 created_at | 2011-08-15T00:00:00Z |           0
 created_at | 2011-08-22T00:00:00Z |          52
 created_at | 2011-08-29T00:00:00Z |         189
 created_at | 2011-09-05T00:00:00Z |         545
 created_at | 2011-09-12T00:00:00Z |           0
 created_at | 2011-09-19T00:00:00Z |         109
 created_at | 2011-09-26T00:00:00Z |          69
 created_at | 2011-10-03T00:00:00Z |          59
 created_at | 2011-10-10T00:00:00Z |         206
 created_at | 2011-10-17T00:00:00Z |         147
 created_at | 2011-10-24T00:00:00Z |         112
 created_at | 2011-10-31T00:00:00Z |          94
(14 rows)

Working with Terms

The optional terms component saves terms output by the Solr index analyzer chain to a vector in the index. Enabling terms is requried to use parts-of-speech and named entity recognition.


gptext.enable_terms()

Enables term vectors and positions to allow extracting terms and their positions from fields of data type text.

Syntax

gptext.enable_terms(<index_name>, <field_name>) 

Parameters

<index_name>

The name of the index for which you want to enable terms.

<field_name>

The name of the field for which you want to enable terms.

Return type

boolean

Privileges required

Only the OWNER can execute this function.

Remarks

Solr can mark terms and their positions in documents when indexing. This capability is deactivated by default. Use gptext.enable_terms() to activate the capability.

Call gptext.enable_terms() for each field where you want to enable terms.

After calling this function, you must index or re-index with gptext.index().

Examples

=# SELECT * FROM gptext.enable_terms('demo.twitter.message', 'message_text');
WARNING:  Enable terms for field: message_text of index: demo.twitter.message successfully. Reindex data needed.
 enable_terms 
--------------
 t
(1 row)

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM twitter.message), 'demo.twitter.message');
 dbid | num_docs 
------+----------
    3 |      947
    2 |     1020
(2 rows)

=# SELECT * FROM gptext.commit_index('demo.twitter.message');
 commit_index 
--------------
 t
(1 row)

gptext.ner_terms()

Gets NER-tagged terms for a text field configured for NER (Named Entity Recognition) from documents in a Solr index that match a query.

Syntax

gptext.ner_terms(TABLE(<select-statement>), <index_name>, <field_name>,
                <search_query>, <filter_queries>[, <options>])
                  
gptext.ner_terms(<index_name>, <field_name>, <search_query>, 
                <filter_queries>[, <options>])

Parameters

TABLE(<select-statement>) This parameter is ignored and can be omitted.

A table expression that specifies a SELECT statement. Specify in the format:

TABLE(SELECT * FROM <src_table>;)

<index_name> The name of the index to query for NER terms. <field_name> The name of the NER-enabled text field to query for NER terms.

<search_query> A query that matches documents to include in the results. <filter_queries> A comma-delimited array of filter queries, if any. If none, set this parameter to null. <options> An optional, comma-delimited list of Solr query parameters.

Return type

SETOF gptext.ner_term_info

This is a composite type with the following columns:

Column Type
id text
term text
entity_type text
frequency integer

Privileges required

You must have the SELECT privilege to execute this function.

Remarks

The gptext.ner_terms() function displays the NER-tagged terms in a field's terms vector for documents that match a query. The <fieldType> of the text field must be configured in the managed-schema for the index with the OpenNLP tokenizers and filters. See Using Named Entity Recognition with GPText for NER configuration instructions.

NER-tagged terms have the form _ner_<entity-type>_<token-text>. Terms that do not have the _ner_ prefix are not included in the output of this function. Use the gptext.terms() function to retrieve all terms from matched documents, including NER-tagged terms.

The output of the gptext.ner_terms() function contains one row for each tagged entity in documents that match the search. The term column contains the term with the _ner_<entity-type>- prefix removed. The entity_type column contains the type of the entity, and the frequency column is the number of times the term appears in the document.

Terms with the same entity type, such as person, organization, and location, that occur in consecutive positions in the document are concatenated to create a compound term in the output. For example, if the terms _ner_organization_federal, _ner_organization_reserve, and _ner_organization_board appear in consecutive positions in the document, the compound term federal reserve board appears in the output. When you perform a search on tagged terms and include highlighting, the compound term is highlighted, rather than each of the consecutive terms.

Example

  1. Get the NER-tagged terms for the content field from the document with id 842613485.

    =# SELECT * FROM gptext.ner_terms('demo.public.news_demo', 'content', 'id:842613485', null);
        id     |              term              | entity_type  | frequency
    -----------+--------------------------------+--------------+-----------
     842613485 |  alan murray                   | person       |         1
     842613485 |  citicorp information services | organization |         1
     842613485 |  morgan stanley                | organization |         1
     842613485 |  stephen roach                 | person       |         1
     842613485 |  mr. murray                    | person       |         1
     842613485 |  alan                          | person       |         1
     842613485 |  mr.                           | person       |         1
     842613485 |  murray                        | person       |         2
     842613485 |  roach                         | person       |         1
     842613485 |  stephen                       | person       |         1
     842613485 |  board                         | organization |         1
     842613485 |  citicorp                      | organization |         1
     842613485 |  consumer                      | organization |         1
     842613485 |  fed                           | organization |         1
     842613485 |  federal                       | organization |         1
     842613485 |  information                   | organization |         1
     842613485 |  morgan                        | organization |         1
     842613485 |  reserve                       | organization |         1
     842613485 |  services                      | organization |         1
     842613485 |  stanley                       | organization |         1
     842613485 |  federal reserve board         | organization |         1
    (21 rows)
    

gptext.terms()

Gets the term vectors for the specified field from documents in a Solr index. You can use gptext.terms() to create terms tables.

Syntax

gptext.terms(TABLE(<select-statement>), <index_name>, <field_name>, 
            <search_query>, <filter_queries>[, <options>])

gptext.terms(<index_name>, <field_name>, <search_query>, <filter_queries>[, options])

Parameters

TABLE(<select-statement>) A table expression that specifies a SELECT statement. Specify in the format:

TABLE(SELECT 1 SCATTER BY 1)

This parameter is ignored and can be omitted.

<index_name> The name of the index to query for terms. The index must have unique document IDs. <field_name> The name of the field to query for terms.

<search_query> A query that the document must match. <filter_queries> A comma-delimited array of filter queries, if any. If none, set this parameter to null. <options> An optional, comma-delimited list of Solr query parameters.

Return type

SETOF gptext.term_info

This is a composite type with the following columns:

Column Type
id text
term text
positions integer[]

Privileges required

You must have the SELECT privilege to execute this function.

Remarks

The gptext.terms() function cannot be used with indexes created with if_check_id_uniqueness set to false.

To enable using gptext.terms(), execute the VMware Greenplum Text function gptext.enable_terms(), then reindex with gptext.index().

If the field has been tagged with Named Entity Recognition (NER) or Parts-of-Speech (POS) filters, the tagged terms appear in the gptext.terms() output in raw format, for example _ner_person_david. Use the gptext.ner_terms() function to view only NER-tagged terms.

Examples

This example creates a terms table from the output of the gptext.terms() function.

=# CREATE TABLE twitter.terms AS 
    SELECT * FROM gptext.terms(
     'demo.twitter.message', 'message_text', 'iphone', null)
   DISTRIBUTED BY (id);
SELECT 5385

VMware Greenplum Text Index Monitoring

These functions provide statistics and status for indexes managed by the VMware Greenplum Text cluster.


gptext.cluster_status()

Shows the status of indexes managed by the VMware Greenplum Text cluster.

Syntax

gptext.cluster_status()

Return Type

SETOF gptext.cluster_status_result

This is a composite type with the following columns:

Column Type
index_name text
max_shards_per_node integer
router text
replication_factor integer
auto_add_replicas boolean
znode_version integer
config_name text
partitioned boolean

Example

=# SELECT * FROM gptext.cluster_status();
index_name        | max_shards_per_node |  router  | replication_factor | auto_add_
replicas | znode_version |       config_name       | partitioned
-------------------------+---------------------+----------+--------------------+----------
---------+---------------+-------------------------+-------------
demo.twitter.message    |                   4 | implicit |                  2 | f
  |             8 | demo.twitter.message    | t
demo.wikipedia.articles |                   4 | implicit |                  2 | f
  |             8 | demo.wikipedia.articles | f
(2 rows)

gptext.index_size()

Shows the number of documents indexed and total disk space used for VMware Greenplum Text indexes.

Syntax

gptext.index_size([<index_name>])

Parameters

<index_name>

The name of the index. Optional. Returns sizes for all indexes if no index is specified.

Return Types

SETOF gptext.index_size_result

This is a composite type with the following columns:

Column Type
index_name text
num_docs bigint
size_in_bytes bigint

Examples

=# SELECT * FROM gptext.index_size();
       index_name        | num_docs | size_in_bytes
-------------------------+----------+---------------
 demo.wikipedia.articles |       23 |        500515
 demo.twitter.message    |     1730 |        767118
 gptext-docs             |       16 |        618231
(3 rows)
=# SELECT * FROM gptext.index_size('demo.wikipedia.articles');
       index_name        | num_docs | size_in_bytes
-------------------------+----------+---------------
 demo.wikipedia.articles |       23 |        500515
(1 row)

gptext.index_status()

Shows status of replicas for a specified index or for all indexes.

Syntax

gptext.index_status([<index_name>])

Parameters

<index_name>

The name of the index. Optional. Returns status for all indexes if no index is specified.

Return Type

SETOF gptext.index_status_result

This is a composite type with the following columns:

Column Type
index_name text
shard_name text
shard_state text
replica_name text
replica_state text
core text
node_name text
base_url text
is_leader boolean
partitioned boolean
external_index boolean

Examples

  1. Show status for a single index.

    =# SELECT * FROM gptext.index_status('demo.wikipedia.articles');
           index_name        | shard_name | shard_state | replica_name | replica_state |                    core
                  |    node_name    |        base_url        | is_leader | partitioned | external_index
    -------------------------+------------+-------------+--------------+---------------+------------------------------
    --------------+-----------------+------------------------+-----------+-------------+----------------
     demo.wikipedia.articles | shard1     | active      | core_node3   | active        | demo.wikipedia.articles_shard
    1_replica_n1  | sdw2:18983_solr | http://sdw2:18983/solr | t         | f           | f
     demo.wikipedia.articles | shard1     | active      | core_node5   | active        | demo.wikipedia.articles_shard
    1_replica_n2  | sdw1:18983_solr | http://sdw1:18983/solr | f         | f           | f
     demo.wikipedia.articles | shard2     | active      | core_node7   | active        | demo.wikipedia.articles_shard
    2_replica_n4  | sdw1:18984_solr | http://sdw1:18984/solr | f         | f           | f
     demo.wikipedia.articles | shard2     | active      | core_node9   | active        | demo.wikipedia.articles_shard
    2_replica_n6  | sdw2:18984_solr | http://sdw2:18984/solr | t         | f           | f
     demo.wikipedia.articles | shard3     | active      | core_node11  | active        | demo.wikipedia.articles_shard
    3_replica_n8  | sdw2:18983_solr | http://sdw2:18983/solr | t         | f           | f
     demo.wikipedia.articles | shard3     | active      | core_node13  | active        | demo.wikipedia.articles_shard
    3_replica_n10 | sdw1:18983_solr | http://sdw1:18983/solr | f         | f           | f
     demo.wikipedia.articles | shard4     | active      | core_node15  | active        | demo.wikipedia.articles_shard
    4_replica_n12 | sdw1:18984_solr | http://sdw1:18984/solr | t         | f           | f
     demo.wikipedia.articles | shard4     | active      | core_node16  | active        | demo.wikipedia.articles_shard
    4_replica_n14 | sdw2:18984_solr | http://sdw2:18984/solr | f         | f           | f
    (8 rows)    ```
    
    
  2. Show status for all VMware Greenplum Text indexes.

    =# SELECT * FROM gptext.index_status();
        index_name        | shard_name | shard_state | replica_name | replica_state |                    core                    |    node_name    |        base_url        | is_leader | partitioned | external_index 
    -------------------------+------------+-------------+--------------+---------------+--------------------------------------------+-----------------+------------------------+-----------+-------------+----------------
    demo.public.news_demo   | shard1     | active      | core_node3   | active        | demo.public.news_demo_shard1_replica_n1    | sdw2:18984_solr | http://sdw2:18984/solr | t         | f           | f
    demo.public.news_demo   | shard1     | active      | core_node5   | active        | demo.public.news_demo_shard1_replica_n2    | sdw1:18984_solr | http://sdw1:18984/solr | f         | f           | f
    demo.public.news_demo   | shard2     | active      | core_node7   | active        | demo.public.news_demo_shard2_replica_n4    | sdw2:18983_solr | http://sdw2:18983/solr | t         | f           | f
    demo.public.news_demo   | shard2     | active      | core_node8   | active        | demo.public.news_demo_shard2_replica_n6    | sdw1:18983_solr | http://sdw1:18983/solr | f         | f           | f
    demo.twitter.message    | shard1     | active      | core_node3   | active        | demo.twitter.message_shard1_replica_n1     | sdw2:18984_solr | http://sdw2:18984/solr | t         | t           | f
    demo.twitter.message    | shard1     | active      | core_node5   | active        | demo.twitter.message_shard1_replica_n2     | sdw1:18983_solr | http://sdw1:18983/solr | f         | t           | f
    demo.twitter.message    | shard2     | active      | core_node7   | active        | demo.twitter.message_shard2_replica_n4     | sdw2:18983_solr | http://sdw2:18983/solr | f         | t           | f
    demo.twitter.message    | shard2     | active      | core_node9   | active        | demo.twitter.message_shard2_replica_n6     | sdw1:18984_solr | http://sdw1:18984/solr | t         | t           | f
    demo.twitter.message    | shard3     | active      | core_node11  | active        | demo.twitter.message_shard3_replica_n8     | sdw2:18984_solr | http://sdw2:18984/solr | t         | t           | f
    demo.twitter.message    | shard3     | active      | core_node13  | active        | demo.twitter.message_shard3_replica_n10    | sdw1:18983_solr | http://sdw1:18983/solr | f         | t           | f
    demo.twitter.message    | shard4     | active      | core_node15  | active        | demo.twitter.message_shard4_replica_n12    | sdw2:18983_solr | http://sdw2:18983/solr | f         | t           | f
    demo.twitter.message    | shard4     | active      | core_node16  | active        | demo.twitter.message_shard4_replica_n14    | sdw1:18984_solr | http://sdw1:18984/solr | t         | t           | f
    demo.wikipedia.articles | shard1     | active      | core_node3   | active        | demo.wikipedia.articles_shard1_replica_n1  | sdw2:18983_solr | http://sdw2:18983/solr | t         | f           | f
    demo.wikipedia.articles | shard1     | active      | core_node5   | active        | demo.wikipedia.articles_shard1_replica_n2  | sdw1:18983_solr | http://sdw1:18983/solr | f         | f           | f
    demo.wikipedia.articles | shard2     | active      | core_node7   | active        | demo.wikipedia.articles_shard2_replica_n4  | sdw1:18984_solr | http://sdw1:18984/solr | f         | f           | f
    demo.wikipedia.articles | shard2     | active      | core_node9   | active        | demo.wikipedia.articles_shard2_replica_n6  | sdw2:18984_solr | http://sdw2:18984/solr | t         | f           | f
    demo.wikipedia.articles | shard3     | active      | core_node11  | active        | demo.wikipedia.articles_shard3_replica_n8  | sdw2:18983_solr | http://sdw2:18983/solr | t         | f           | f
    demo.wikipedia.articles | shard3     | active      | core_node13  | active        | demo.wikipedia.articles_shard3_replica_n10 | sdw1:18983_solr | http://sdw1:18983/solr | f         | f           | f
    demo.wikipedia.articles | shard4     | active      | core_node15  | active        | demo.wikipedia.articles_shard4_replica_n12 | sdw1:18984_solr | http://sdw1:18984/solr | t         | f           | f
    demo.wikipedia.articles | shard4     | active      | core_node16  | active        | demo.wikipedia.articles_shard4_replica_n14 | sdw2:18984_solr | http://sdw2:18984/solr | f         | f           | f
    (20 rows)
    
    

gptext.partition_status()

Lists indexes on partitioned tables or child partition names in the current VMware Greenplum.

Syntax

gptext.partition_status([<index_name>])

Parameters

<index_name>

Optional. Returns partition status for all indexes if no index is specified.

Return Type

SETOF gptext.partition_status_result

This is a composite type with the following columns:

Column Type
partition_name text
inherits_name text
level integer
cons text

Example

List partition status for an index.

=# SELECT partition_name, inherits_name, level 
   FROM gptext.partition_status('demo.twitter.message');
        partition_name        |    inherits_name     | level
------------------------------+----------------------+-------
 demo.twitter.message_1_prt_1 | demo.twitter.message |     1
 demo.twitter.message_1_prt_2 | demo.twitter.message |     1
 demo.twitter.message_1_prt_3 | demo.twitter.message |     1
 demo.twitter.message_1_prt_4 | demo.twitter.message |     1
(4 rows)

Remarks

The gptext.partition_status() function can only list the index partitions for tables in the current VMware Greenplum.

gptext.index_summary()

Shows replica status and core statistics for a specified index or for all indexes.

Syntax

gptext.index_summary([<index_name>])

Parameters

<index_name>

Optional. Returns information for all indexes if no index is specified.

Return Type

SETOF gptext.index_summary_result

This is a composite (row) type with the following columns:

Column Type
index_name text
shard_name text
shard_state text
replica_name text
replica_state text
core text
node_name text
base_url text
is_leader boolean
partitioned boolean
external_index boolean
core_name text
instance_dir text
data_dir text
config text
schema text
start_time text
uptime bigint
num_docs integer
max_docs integer
delete_docs integer
index_heap_usage_bytes bigint
version bigint
segment_count integer
current boolean
has_deletions boolean
directory text
last_modified text
size_in_bytes bigint
size text

Notes

The gptext.index_summary() function can appear to hang while it waits on a lock when SolrCloud is performing recovery. If recovery takes a long time, the function times out. The timeout period is 600 seconds by default.

Example

List the Solr node, number of documents, and size of the leader replica for each shard of an index.

=# SELECT index_name, shard_name, node_name, num_docs, size_in_bytes 
   FROM gptext.index_summary('demo.twitter.message') 
   WHERE is_leader;
      index_name      | shard_name |    node_name    | num_docs | size_in_bytes 
----------------------+------------+-----------------+----------+---------------
 demo.twitter.message | shard4     | sdw2:18983_solr |      417 |        295873
 demo.twitter.message | shard3     | sdw2:18984_solr |      449 |        302987
 demo.twitter.message | shard2     | sdw2:18983_solr |      449 |        311868
 demo.twitter.message | shard1     | sdw2:18984_solr |      415 |        282736
(4 rows)

VMware Greenplum Text Index Configuration

The functions in this section provide configuration information and metrics for VMware Greenplum Text indexes and help to configure indexes.


gptext.analyzer()

Shows the output from each class in the index or query analyzer chain for a given field type and user-supplied input text.

Syntax

gptext.analyzer(<index_name>, {'index' | 'query'}, <field_type>, <input>)

Parameters

<index_name> The name of the index. {'index' | 'query'} Show output for the index analysis chain or the query analysis chain. <field_type> The field type to analyze. <input> A text string to pass through the analyzer.

Return Type

Text

Example

=# SELECT gptext.analyzer('demo.wikipedia.articles', 'index', 'text_intl',
    'Chopin takes a wayward sidestep into a delicious series of upward flying scales');
                                                                                          analyzer

----------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------
 (WorldLexerTokenizer,"{{""Chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of""}
,{""upward""},{""flying""},{""scales""}}")
 (CJKWidthFilter,"{{""Chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of""},{""u
pward""},{""flying""},{""scales""}}")
 (LowerCaseFilter,"{{""chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of""},{""
upward""},{""flying""},{""scales""}}")
 (WorldLexerBigramFilter,"{{""chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of
""},{""upward""},{""flying""},{""scales""}}")
 (StopFilter,"{{""chopin""},{""takes""},{},{""wayward""},{""sidestep""},{},{},{""delicious""},{""series""},{},{""upward""},{""flying""},{""sca
les""}}")
 (SetKeywordMarkerFilter,"{{""chopin""},{""takes""},{},{""wayward""},{""sidestep""},{},{},{""delicious""},{""series""},{},{""upward""},{""flyi
ng""},{""scales""}}")
 (PorterStemFilter,"{{""chopin""},{""take""},{},{""wayward""},{""sidestep""},{},{},{""delici""},{""seri""},{},{""upward""},{""fly""},{""scale"
"}}")
(7 rows)

Remarks

Each row of the gptext.analyzer() function result is a single text column containing the name of the analyzer class and a list of the tokens the class produced from the output of the previous class.

You can use this function to compare the output produced by analyzer chains configured for different field types and to test your analyzer configuration.


gptext.config_append()

Appends the contents of a local file to a ZooKeeper index configuration file.

Syntax

gptext.config_append(<index_name>, <local_config_file> [, <index_config_file>])

Parameters

<index_name> The name of the index to configure. <local_config_file> The path and filename of a local file that you will append to the index configuration file.

<index_config_file> Optional. The name of the ZooKeeper configuration file to which you will append the local file. If you omit this parameter, the function appends the local file to a file of the same name that resides in the top-level ZooKeeper directory.

Return Type

boolean

Example

Append the local file /home/gpadmin/stopwords.add to the top-level ZooKeeper file stopwords.txt for index demo.wikipedia.articles:

=# SELECT * FROM gptext.config_append('demo.wikipedia.articles', '/home/gpadmin/stopwords.add', 'stopwords.txt');
 config_append
---------------
 t
(1 row)

gptext.config_delete()

Deletes an index configuration file from ZooKeeper.

Syntax

gptext.config_delete(<index_name>, <index_config_file>)

Parameters

<index_name> The name of the index that has the file to delete. <index_config_file> The ZooKeeper configuration file to delete. Include the path if the file does not reside at the top-level directory.

Return Type

boolean

Example

Delete the file named stopwords.add from the top-level configuration directory for the index demo.wikipedia.articles:

=# select * from gptext.config_delete('demo.wikipedia.articles', 'stopwords.add');
 config_delete
---------------
 t
(1 row)

gptext.config_get()

Displays the contents of a ZooKeeper index configuration file.

Syntax

gptext.config_get(<index_name>, <index_config_file>)

Parameters

<index_name> The name of the index that has the file you want to display. <index_config_file> The ZooKeeper configuration file to display. Include the path if the file does not reside at the top-level ZooKeeper directory for the index.

Return Type

text

Example

Display the contents of synonyms.txt for the index demo.wikipedia.article:

=# select * from gptext.config_get('demo.wikipedia.articles', 'synonyms.txt');
                                 config_get
----------------------------------------------------------------------------
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 #-----------------------------------------------------------------------
 #some test synonym mappings unlikely to appear in real input text
 aaa => aaaa
 bbb => bbbb1 bbbb2
 ccc => cccc1,cccc2
 a\=>a => b\=>b
 a\,a => b\,b
 fooaaa,baraaa,bazaaa

 # Some synonym groups specific to this example
 GB,gib,gigabyte,gigabytes
 MB,mib,megabyte,megabytes
 Television, Televisions, TV, TVs
 #notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
 #after us won't split it into two words.

 # Synonym mappings can be used for spelling correction too
 pixima => pixma


(1 row)

gptext.config_list()

Lists the ZooKeeper configuration files and directories for an index.

Syntax

gptext.config_list(<index_name>, [<index_config_path>,] <is_recursive>)

Parameters

<index_name> The name of the index that has the files and directories you want to list. <index_config_path> Optional. A specific directory in the ZooKeeper configuration that you want to list. Omit this option to list configuration files and directories in the top-level directory. <is_recursive> Optional. A boolean value that determines whether the function recursively lists files and directories that are present in subdirectories.

Return Type

SETOF text

Examples

List ZooKeeper configuration files and directories only in the top-level directory for the index demo.wikipedia.articles:

=# select * from gptext.config_list('demo.wikipedia.articles', false);
         config_list
-----------------------------
 currency.xml
 mapping-FoldToASCII.txt
 managed-schema
 protwords.txt
 scripts.conf
 synonyms.txt
 managed_schema
 stopwords.txt
 velocity
 admin-extra.html
 aggconfig.xml
 emoticons.txt
 solrconfig.xml
 elevate.xml
 xslt
 mapping-ISOLatin1Accent.txt
 spellings.txt
 lang
(18 rows)

List ZooKeeper configuration files in the ZooKeeper lang subdirectory for demo.wikipedia.articles:

=# select * from gptext.config_list('demo.wikipedia.articles', 'lang', false);
       config_list
--------------------------
 lang/contractions_it.txt
 lang/contractions_ca.txt
 lang/stemdict_nl.txt
 lang/stopwords_hy.txt
 lang/stopwords_no.txt
 lang/stopwords_id.txt
 [...]
(39 rows)

List all configuration files and directories for demo.wikipedia.articles:

=# select * from gptext.config_list('demo.wikipedia.articles', true);
           config_list
----------------------------------
 currency.xml
 mapping-FoldToASCII.txt
 managed-schema
 protwords.txt
 scripts.conf
 synonyms.txt
 managed_schema
 stopwords.txt
 velocity
 velocity/doc.vm
 velocity/suggest.vm
 velocity/hit.vm
 [...]
(86 rows)

gptext.config_upload()

Uploads an index configuration file to ZooKeeper, replacing any existing file of the same name.

Syntax

gptext.config_upload(<index_name>, <local_config_file> [, <index_config_file>])

Parameters

<index_name> The name of the index to configure. <local_config_file> The path and filename of a local file that you wan to upload to ZooKeeper for the index. The function uploads this file to a file the same name in the top-level ZooKeeper directory for the index, unless you include the <index_config_file> option to change the path or filename.

<index_config_file> Optional. The destination path for the file in ZooKeeper. If you omit this parameter, the function uploads the local file to the top-level ZooKeeper directory for the index.

Return Types

boolean

Examples

Upload the local file /home/gpadmin/stopwords.txt to ZooKeeper, overwriting the existing stopwords.txt file for the index demo.wikipedia.articles:

=# select * from gptext.config_upload('demo.wikipedia.articles', '/home/gpadmin/stopwords.txt');
 config_upload
---------------
 t
(1 row)

Upload the local file /home/gpadmin/stopwords_japanese.txt to ZooKeeper, overwriting the file lang/stopwords_ja.txt for the index demo.wikipedia.articles:

# select * from gptext.config_upload('demo.wikipedia.articles', '/home/gpadmin/stopwords_japanese.txt', 'lang/stopwords_ja.txt');
 config_upload
---------------
 t
(1 row)

gptext.get_field_type()

Displays the analyzer chain for a field type defined in the configuration for a specified index.

Syntax

gptext.get_field_type(<index_name>, <field_type>)

Parameters

<index_name>

The name of the index.

<field_type>

The name of a field type defined in the managed-schema configuration file for the index.

Example

=# SELECT gptext.get_field_type('demo.wikipedia.articles', 'text');
                 get_field_type
-------------------------------------------------
 {
  "name": "text",
  "class": "solr.TextField",
  "indexAnalyzer": {
   "tokenizer": {
    "class": "solr.WhitespaceTokenizerFactory"
   },
   "filters": [
    {
     "class": "solr.StopFilterFactory",
     "attributes": [
      {
       "name": "words",
       "value": "stopwords.txt"
      },
      {
       "name": "ignoreCase",
       "value": "true"
      }
     ]
    },
    {
     "class": "solr.WordDelimiterFilterFactory",
     "attributes": [
      {
       "name": "catenateNumbers",
       "value": "1"
      },
      {
       "name": "generateNumberParts",
       "value": "1"
      },
      {
       "name": "splitOnCaseChange",
       "value": "1"
      },
      {
       "name": "generateWordParts",
       "value": "1"
      },
      {
       "name": "catenateAll",
       "value": "0"
      },
      {
       "name": "catenateWords",
       "value": "1"
      }
     ]
    },
    {
     "class": "solr.LowerCaseFilterFactory"
    },
    {
     "class": "solr.KeywordMarkerFilterFactory",
     "attributes": [
      {
       "name": "protected",
       "value": "protwords.txt"
      }
     ]
    },
    {
     "class": "solr.PorterStemFilterFactory"
    }
   ]
  },
  "queryAnalyzer": {
   "tokenizer": {
    "class": "solr.WhitespaceTokenizerFactory"
   },
   "filters": [
    {
     "class": "solr.SynonymFilterFactory",
     "attributes": [
      {
       "name": "expand",
       "value": "true"
      },
      {
       "name": "ignoreCase",
       "value": "true"
      },
      {
       "name": "synonyms",
       "value": "synonyms.txt"
      }
     ]
    },
    {
     "class": "solr.StopFilterFactory",
     "attributes": [
      {
       "name": "words",
       "value": "stopwords.txt"
      },
      {
       "name": "ignoreCase",
       "value": "true"
      }
     ]
    },
    {
     "class": "solr.WordDelimiterFilterFactory",
     "attributes": [
      {
       "name": "catenateNumbers",
       "value": "0"
      },
      {
       "name": "generateNumberParts",
       "value": "1"
      },
      {
       "name": "splitOnCaseChange",
       "value": "1"
      },
      {
       "name": "generateWordParts",
       "value": "1"
      },
      {
       "name": "catenateAll",
       "value": "0"
      },
      {
       "name": "catenateWords",
       "value": "0"
      }
     ]
    },
    {
     "class": "solr.LowerCaseFilterFactory"
    },
    {
     "class": "solr.KeywordMarkerFilterFactory",
     "attributes": [
      {
       "name": "protected",
       "value": "protwords.txt"
      }
     ]
    },
    {
     "class": "solr.PorterStemFilterFactory"
    }
   ]
  },
  "attributes": [
   {
    "name": "autoGeneratePhraseQueries",
    "value": "true"
   },
   {
    "name": "positionIncrementGap",
    "value": "100"
   }
  ]
 }

(1 row)

Remarks

You can use the gptext.list_field_types() function to list the text field types defined in the configuration for an index.

See Customizing VMware Greenplum Text Indexes for information about text analyzer chains.


gptext.list_field_types()

Lists available field types in the managed-schema configuration file for a specified VMware Greenplum Text index.

Syntax

gptext.list_field_types(<index_name>)

Parameters

<index_name>

The name of the index.

Return Type

SETOF text

Example

List the field types defined in the managed-schema configuration file for the demo.wikipedia.articles index.

=# SELECT gptext.list_field_types('demo.wikipedia.articles');
     list_field_types
---------------------------
 ancestor_path
 delimited_payloads_float
 delimited_payloads_int
 delimited_payloads_string
 descendent_path
 lowercase
 phonetic_en
 text
 text_ar
 text_bg
 text_ca
 text_cjk
 text_cz
 text_da
 text_de
 text_el
 text_en
 text_en_splitting
 text_en_splitting_tight
 text_es
 text_eu
 text_fa
 text_fi
 text_fr
 text_ga
 text_general
 text_general_rev
 text_gl
 text_hi
 text_hu
 text_hy
 text_icu
 text_id
 text_intl
 text_intl_prev
 text_it
 text_ja
 text_lv
 text_nl
 text_no
 text_pt
 text_ro
 text_ru
 text_sm
 text_sv
 text_th
 text_tr
 text_ws
 text_zhsmart
(49 rows)

gptext.reload_index()

Reloads Solr configuration files if they have been modified.

Syntax

gptext.reload_index(<index_name>)

Parameters

<index_name>

Optional. The name of the index for which to reload the configuration files.

Return type

boolean

Privileges required

Only the OWNER can execute this function.

Remarks

None.

Example

=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
 reload_index 
--------------
 t
(1 row)

VMware Greenplum Text Cluster Monitoring and Management

These functions provide information about the VMware Greenplum Text cluster operation and status.

gptext.live_nodes() – lists active Solr nodes.

gptext.version() – returns version of VMware Greenplum Text installation.

gptext.zookeeper_hosts() – returns a list of the ZooKeeper host names and ports.


gptext.live_nodes()

Lists active Solr nodes and their up or down state.

Syntax

gptext.live_nodes()

Return Type

SETOF gptext.live_nodes_result

This is a composite type with the following columns:

Column Type
host text
port bigint
data_dir text
status text

Example

=# SELECT * FROM gptext.live_nodes();
host  | port  |      data_dir       | status
--------+-------+---------------------+--------
gpdb51 | 18983 | /data/gpdata1/solr0 | u
gpdb51 | 18984 | /data/gpdata2/solr0 | u
(2 rows)

Remarks

The status column can be u (up) or d (down).


gptext.version()

Returns the version of your VMware Greenplum Text installation.

Syntax

SELECT * FROM gptext.version()

Parameters

None.

Return type

text

Privileges required

You do not need any privileges to execute this function.

Example

=# SELECT * FROM gptext.version();
version
--------------------------------
Greenplum Text Analytics 2.1.3
(1 row)

gptext.zookeeper_hosts()

Returns a list of ZooKeeper hosts and ports.

Syntax

gptext.zookeeper_hosts()

Return type

text

Remarks

This function returns a comma-separated list of ZooKeeper nodes in the the format <host-name>:<port>.

Example

=# SELECT * FROM gptext.zookeeper_hosts()
host  | port
--------+------
gpdb51 | 2188
gpdb51 | 2189
gpdb51 | 2190
(3 rows)

High Availability


gptext.add_replica()

Adds a replica of an index shard.

Syntax

gptext.add_replica(<index_name>, <shard_name>[, <node_name>])

Parameters

<index_name>

Name of the index. If the index is for a partitioned database table, this must be the name of the root table. <shard_name>

Name of the shard to replicate. <node_name>

Name of the node where the replica is to be added. Optional. If omitted, SolrCloud chooses the node.

Return type

boolean

Remarks

This function is used by the VMware Greenplum Text management utility gptext-replica add.

The value of the gptext.replication_factor configuration parameter when an index is created determines how many replicas are created for each shard. In a Greenplum system, there are the same number of shards as there are Greenplum segments. The number of replicas created for a new index is the number of segments times the value of the gptext.replication_factor configuration parameter, 2 by default. The replicas are distributed evenly among the live VMware Greenplum Text nodes.

Replicas consume space on the host where they are created, so they are usually only created to replace a replica that has failed or become unavailable or to relocate a replica to another VMware Greenplum Text instance. When adding replicas, you should maintain equal distribution of replicas among the VMware Greenplum Text nodes and avoid placing multiple replicas for the same shard on the same host.

The total number of replicas for an index that can be placed on each VMware Greenplum Text node is set when the index is created. (In Solr, this is the MaxShardsPerNode parameter.) VMware Greenplum Text sets this limit by calculating the number of replicas to create per node and adding an additional factor, specified in the gptext.extension_factor server configuration parameter. This parameter can be set between 0 and 10; the default value is 2. Since the limit is set when the index is created, it is recommended to set the gptext.extension_factor parameter to a higher number to allow new replicas to be created when necessary.

Example

=# SELECT * FROM gptext.add_replica('demo.wikipedia.articles', 'shard1');
 success |                core_name
---------+-----------------------------------------
 t       | demo.wikipedia.articles_shard1_replica3
(1 row)

gptext.delete_replica()

Deletes a named replica from the specified index and shard.

Syntax

gptext.delete_replica(<index_name>, <shard_name>, <replica_name>[, <only_if_down>])

Parameters

<index_name>

Name of the index. <shard_name>

Name of the shard that contains the replica to delete. <replica_name>

Name of the replica to remove. <only_if_down>

Optional. When true, no action is taken if the replica is active. Default is false.

Return type

boolean

Remarks

Use the gptext.index_status() function to find the name of the replica to drop. Names are in the format core_nodeX, where X is a number.

This function is called from the gptext-replica drop management utility.

Examples

  1. Delete the core_node5 replica if it is down.

    =# SELECT * FROM gptext.delete_replica('demo.wikipedia.articles', 'shard1', 'core_node5', true);
    ERROR:  Delete replica failed: Attempted to remove replica : demo.wikipedia.articles/shard1/core_node5 with onlyIfDown='true', but state is 'active'.
    
  2. Delete the core_node5 replica even if it is active.

    =# SELECT * FROM gptext.delete_replica('demo.wikipedia.articles', 'shard1', 'core_node5');
     success 
    ---------
     t
    (1 row)
    

General Purpose Functions


gptext.count_t()

Counts the number of records in a table.

Syntax

gptext.count_t(<table_name>)

Parameters

<table_name>

Name of the table for which to count records.

Return type

integer

Privileges required

You need SELECT privileges on <table_name> to execute this function.

Example

=# SELECT * FROM gptext.count_t('demo.wikipedia.articles');
count_t
---------
     23
(1 row)
check-circle-line exclamation-circle-line close-line
Scroll to top icon