The following functions are available in Tanzu Greenplum Text.
gptext.create_index() – creates an empty index.
gptext.create_index_external() - creates an index for external documents.
gptext.index() – populates an index.
gptext.index_external() - adds documents to an external index.
gptext.index_external_dir() - adds all documents in a directory in an external document source to a Tanzu Greenplum Text external index.
gptext.commit_index() – finalizes index operations.
gptext.extract_rich_doc() – shows the content extracted from an external document, without actually adding the content to the index.
gptext.recreate_error_table() – recreates the error table that records errors that occur while adding documents to an external index.
gptext.external_login() – log in to an external document store that requires authentication.
gptext.external_logout() – log out of an external document store.
gptext.add_field() – adds a field to an index.
gptext.delete() – deletes documents matching a search query.
gptext.drop_field() – deletes a field from an index.
gptext.drop_index() - deletes an index.
gptext.search() – searches an index.
gptext.search_count() – returns number of documents that match search.
gptext.search_external() - searches a Tanzu Greenplum Text external index.
gptext.gptext_retrieve_field() – extracts a single field from the rf
search result column as text.
gptext.gptext_retrieve_field_int() – extracts a single field from the rf
search result column and converts to an integer.
gptext.gptext_retrieve_field_float() – extracts a single field from the rf
search result column and converts to a float.
gptext.highlight() – returns search result with search term highlighted.
gptext.highlight_external() – applys highlighting to search results from external indexes.
gptext.highlight_instant_content() – returns local search results with search term highlighted.
gptext.faceted_field_search() – search, faceted by fields.
gptext.faceted_query_search() – search, faceted by queries.
gptext.faceted_range_search() – search, faceted by defined ranges.
gptext.enable_terms() – enables term vectors and positions to allow extracting terms and their positions from text
fields.
gptext.ner_terms() - gets terms tagged with Named Entity Recognition (NER) filters from the term vector for a specified field.
gptext.terms() – gets the term vectors for the indexed documents in a Solr index for the specified field.
gptext.cluster_status() - shows status of indexes managed by the Tanzu Greenplum Text cluster.
gptext.index_size() - shows the number of documents indexed and total disk space used for Tanzu Greenplum Text indexes.
gptext.index_status() – shows status of replicas for an index or for all indexes.
gptext.partition_status() - lists partitioned indexes and child partitions.
gptext.index_summary()- shows configuration details, status, and metrics for all of the replicas for a specified index or for all Tanzu Greenplum Text indexes.
gptext.analyzer() - shows the output from each class in the index or query analyzer chain for a given field type and user-supplied input text.
gptext.config_append() - appends the contents of a local file to a ZooKeeper configuration file for an index.
gptext.config_delete() - deletes an index configuration file from ZooKeeper.
gptext.config_get() - displays the contents of a ZooKeeper index configuration file.
gptext.config_list() - lists ZooKeeper configuration files and directories for an index.
gptext.config_upload() - uploads an index configuration file to ZooKeeper.
gptext.get_field_type() - displays the analyzer chain for a field type defined in the configuration for a specified index.
gptext.list_field_types() - lists the text field types defined in the configuration for a specified index.
gptext.reload_index() – reloads Solr configuration files.
gptext.live_nodes() – lists active Solr nodes.
gptext.version() – returns version of Tanzu Greenplum Text installation.
gptext.zookeeper_hosts() – returns a list of the ZooKeeper host names and ports.
gptext.add_replica() – Adds a replica of an index shard.
gptext.delete_replica() – Deletes a replica of an index shard.
gptext.count_t() – counts number of rows in a table.
Your privileges to execute the Tanzu Greenplum Text functions depend on your Greenplum Database privileges for the table from which the index is generated. For example, if you have SELECT privileges for a table in the Greenplum database, you have SELECT privileges for an index generated from that table.
Executing index functions requires one of OWNER, SELECT, INSERT, UPDATE, or DELETE privileges, depending on the function. The OWNER is the person who created the table and has all privileges. See the Security section of the Tanzu Greenplum Text User's Guide for information about setting privileges.
The Privileges required section for each of the Tanzu Greenplum Text functions specifies the privileges required to execute that function.
The gptext
functions in this section must be executed as SQL queries in the form:
SELECT * FROM gptext.function();
The examples in this document use a Greenplum database named demo
set up as follows:
articles
in the wikipedia
schema.message
in the twitter
schema.See Setting Up the Sample Database for details about these tables.
Indexing functions create, set up, populate, and finalize (commit) Solr indexes.
Creates an empty Solr index.
gptext.create_index(<schema_name>, <table_name>, <id_col_name>,
<def_search_col_name> [, <if_check_id_uniqueness>])
or
gptext.create_index(<schema_name>, <table_name>, <p_columns>, <p_types>,
<id_col_name>, <def_search_col_name> [, <if_check_id_uniqueness>])
<schema_name>
The name of the schema in the Greenplum database.
<table_name>
The name of the table in the Greenplum database. To include all subpartitions of a partitioned table in the index, specify the name of the root table.
<p_columns>
A text array containing the names of the table columns to index. If <p_columns>
and <p_types>
are omitted, all table columns are indexed.
The columns must be valid columns in the table. The columns identified by the <id_col_name>
and <def_search_col_name>
must be included in the array.
If the <p_columns>
parameter is supplied, the <p_types>
parameter must also be supplied. The sizes of the <p_columns>
and <p_types>
arrays must be the same.
<p_types>
A text array containing the Solr data types of the columns in the <p_columns>
array.
Text types can be mapped to the name of an analyzer chain, for example <text_intl>
, <text_sm>
, or any type defined in the <managed_schema>
. See Map Greenplum Database Data Types to Solr Data Types for equivalent Solr data types for other Greenplum types.
The <p_types>
parameter must be supplied if the <p_columns>
parameter is supplied.
<id_col_name>
The name of a column in <table_name>
that is unique for each row. The column must be of type int4
, int8
, varchar
, text
, or uuid
.
<def_search_col_name>
The name of the default column to search in <table_name>
, if no other column is named in a query.
<if_check_id_uniqueness>
Optional. A Boolean value. The default is true. Set to false to index a table with a non-unique ID field.
boolean
Only the OWNER can execute this function.
A Tanzu Greenplum Text index is a Solr collection.
The name of the index created has the format:
<database_name>.<schema_name>.<table_name>
The contents of the <id_col_name>
column should, in most cases, be a unique ID for each row. It must be of type int4
, int8
, varchar
, or text
.
The index is broken up into sections called shards. By default, Tanzu Greenplum Text indexes have one shard per Greenplum Database segment. You can create an index with fewer shards by setting the Tanzu Greenplum Text configuration parameter gptext.idx_num_shards
to the number of shards before you create the index. See Specifying the Number of Index Shards for more information.
A shard can contain a maximum of 2147483519 documents, so the maximum number of documents a Tanzu Greenplum Text index can contain is 2147483519 times the number of shards.
If the <if_check_id_uniqueness>
argument is true, the default, a document with an ID matching an existing ID cannot be added to the index.
If the <if_check_id_uniqueness>
argument is false, documents with duplicate IDs are allowed to be added to the index. The content of other fields may or may not be the same as existing documents with the same ID. When a query returns multiple documents with the same ID, it is the user's responsibility to anticipate and handle the multiple documents. For example, a table could have a revision
column that is incremented when a new version of a document is added to the index, allowing queries that omit all but the most recent version from search results.
The gptext.terms()
function is not supported with indexes created with the <if_check_id_uniqueness>
argument set to false because term vectors require unique document IDs.
Tanzu Greenplum Text does not support indexes with fewer shards than the number of Greenplum Database segments with the <if_check_id_uniqueness>
argument set to false.
If you specify the name of the root partition for a partitioned table, the Tanzu Greenplum Text index created for the table will contain records for all partitions. The index records for documents added to the index have a __partition
field containing the name of the child partition table. See Searching Partitioned Tables for syntax to search by partitions.
Populate the new index with gptext.index()
.
The number of replicas for each shard is determined when the index is created. It is the value of the gptext.replication_factor
server configuration parameter, 2 by default.
If the gptext.failover_factor
server configuration parameter is set, gptext.create_index()
fails if the ratio of the number of Tanzu Greenplum Text nodes that are up to the total number of Tanzu Greenplum Text nodes is less than the gptext.failover_factor
value (from 0.0 to 1.0). Index shards can only be created on active Tanzu Greenplum Text nodes, so the gptext.failover_factor
parameter prevents overloading the active Tanzu Greenplum Text nodes when too many nodes are down.
Create an index, demo.wikipedia.articles
, with content
as the default search field.
=# SELECT * FROM gptext.create_index('wikipedia', 'articles', 'id', 'content');
Create an index, demo.wikipedia.articles
, with content
as the default search field. Index the id
, content
, and title
fields.
=# SELECT * FROM gptext.create_index('wikipedia', 'articles',
'{"id", "content", "title"}', '{"long", "text", "text"}',
'id', 'content');
Creates an empty Solr index for external documents.
gptext.create_index_external(<index_name>)
<index_name>
The name of the index to create. The name cannot contain periods (.
).
A Tanzu Greenplum Text external index is a Solr index for documents external to Greenplum Database, for example, PDF, Microsoft Word, XML, and HTML files. Unlike regular Tanzu Greenplum Text indexes, external indexes are not associated with a Greenplum Database table, but they can be searched with Tanzu Greenplum Text search functions.
Like regular Tanzu Greenplum Text indexes, an external index by default has one shard per Greenplum Database segment. To create an external index with fewer shards, set the gptext.idx_num_shards
configuration parameter to the desired number of shards before you create the index. See Specifying the Number of Index Shards for more information.
=# SELECT * FROM gptext.create_index_external('gptext-docs');
Populates an index by indexing data in a table.
gptext.index(TABLE(SELECT * FROM <table_name>), <index_name>)
TABLE(<select-statement>)
The document content to be indexed, with data type anytable
. <index_name>
Name of the index that was created with gptext.create_index()
and is to be populated.
SETOF dbid INT, num_docs BIGINT
where dbid
is the dbid
of the segment that the documents were sent to, and num_docs
is the number of documents that were indexed.
You must have the INSERT or UPDATE privilege to execute this function.
<index_name>
must have been created with gptext.create_index()
.
The first argument to gptext.index()
is a table expression specifying rows to add to the index. To write this argument, wrap a SELECT
statement in a TABLE
function to select rows to pass to the gptext.index()
function. For example, TABLE(SELECT * FROM articles)
is a table expression that indexes all columns and rows of the articles table. The gptext.index()
function adds each row of the SELECT
result set as a document in the Tanzu Greenplum Text index.
You can selectively index table columns by changing the SELECT
list in the query and you can filter rows to index with a WHERE
clause.
The SELECT
statement does not have to query the base table that was specified when the Tanzu Greenplum Text index was created, but the results of the query must have the same columns and distribution policy. For example, you can index rows from a temporary table with the same definition as the base table. Or you could index the results from a join query that produces the same columns as the base table. If the results have a different distribution key, or no distribution key, as is the case with a join query, you must specify a compatible distribution policy by adding a SCATTER BY <column>
to the SELECT STATEMENT
. You can specify the column by name or by column number.
The output of the gptext.index()
function includes a two-column table with dbid
(the Greenplum segment ID) and num_docs
(the number of documents added to the index shard for that segment) as the columns.
After adding documents to the index, you must commit the index with gptext.commit_index(<index_name>)
.
=# SELECT * FROM gptext.index(TABLE(SELECT * FROM wikipedia.articles), 'demo.wikipedia.articles');
dbid | num_docs
------+----------
3 | 6
2 | 5
(2 rows)
Adds documents stored outside of Greenplum Database to a Tanzu Greenplum Text external index.
gptext.index_external(<url-list>, <index-name>)
<url-list>
A list of URLs for documents to add to the Tanzu Greenplum Text external index. The URLs may be expressed as an array or as a table-valued expression.
<index-name>
The name of the index to which the documents are to be added.
If the documents to add to the Tanzu Greenplum Text external index are in a store that requires authentication, use the gptext.external_login()
function to log in to the document store before you execute gptext.index_external()
.
If the document cannot be retrieved at the given URL, or an error occurs while indexing the document, Tanzu Greenplum Text inserts a row in the gptext.error_table
table. You can use gptext.recreate_error_table()
to create an empty error table before you call gptext.index()
.
The value of the Tanzu Greenplum Text custom server parameter gptext.idx_segment_error_limit
(default 10) is the number of errors that can occur on any one segment before the indexing operation is canceled.
When adding a document to an external index, Tanzu Greenplum Text calculates a 256-bit hash on the contents of the document. The hash is stored as a 64-byte hexadecimal value in the sha256
field. If you later add a document with a URL matching an existing document in the index, the new document is only added to the index if the newly calculated hash differs from the current value in the sha256
field.
This example adds a single PDF document to the index gptext-docs
.
=# SELECT * FROM gptext.index_external(
'{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf}',
'gptext-docs');
dbid | num_docs
------+----------
3 | 0
2 | 1
(2 rows)
This example adds multiple HTML documents to the gptext-docs
external index by selecting URLs from a database table. Errors will be logged in the gptext.gptext_errrors
table.
=# DROP TABLE IF EXISTS gptext_html_docs;
=# CREATE TABLE gptext_html_docs (
id bigint,
url text)
DISTRIBUTED BY (id);
CREATE TABLE
=# INSERT INTO gptext_html_docs VALUES
(1, 'http://gptext.docs.pivotal.io/latest/topics/administering.html'),
(2, 'http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html'),
(3, 'http://gptext.docs.pivotal.io/latest/topics/function_ref.html'),
(4, 'http://gptext.docs.pivotal.io/latest/topics/guc_ref.html'),
(5, 'http://gptext.docs.pivotal.io/latest/topics/ha.html'),
(6, 'http://gptext.docs.pivotal.io/latest/topics/index.html'),
(7, 'http://gptext.docs.pivotal.io/latest/topics/indexes.html'),
(8, 'http://gptext.docs.pivotal.io/latest/topics/intro.html'),
(9, 'http://gptext.docs.pivotal.io/latest/topics/managed-schema.html'),
(10, 'http://gptext.docs.pivotal.io/latest/topics/performance.html'),
(11, 'http://gptext.docs.pivotal.io/latest/topics/queries.html'),
(12, 'http://gptext.docs.pivotal.io/latest/topics/type_ref.html'),
(13, 'http://gptext.docs.pivotal.io/latest/topics/upgrading.html'),
(14, 'http://gptext.docs.pivotal.io/latest/topics/utility_ref.html'),
(15, 'http://gptext.docs.pivotal.io/latest/topics/installing.html');
INSERT 0 15
=# SELECT * FROM gptext.index_external(
TABLE(SELECT url FROM gptext_html_docs),
'gptext-docs');
dbid | num_docs
------+----------
3 | 6
2 | 8
(2 rows)
=# SELECT * FROM gptext.error_table;
-[ RECORD 1 ]-------------------------------------------------------------------------------------------
error_time | 2000-01-01 00:25:11.282769
index_name | gptext-docs
sqlcmd |
errmsg | Code: RUNTIME_ERROR, Message: 'http://gptext.docs.pivotal.io/210/topics/ext-indexes.html. '
rawdata | http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html
rawbytes |
=# SELECT * FROM gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)
This example indexes a single HTML file in an Amazon S3 bucket named "gptext".
=# SELECT gptext.external_login('s3', 's3://s3-us-west-2.amazonaws.com', 'mys3_auth');
external_login
----------------
t
(1 row)
=# SELECT * FROM gptext.index_external('{s3://gptext/topics/queries.html}', 'gptext-docs');
dbid | num_docs
------+----------
3 | 0
5 | 0
4 | 0
2 | 1
(4 rows)
=# SELECT gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)
=# SELECT gptext.external_logout('s3');
external_logout
-----------------
t
(1 row)
Adds all documents in a directory in an external document source to a Tanzu Greenplum Text external index.
gptext.index_external_dir(<directory_url>, <index_name>)
<directory_url>
The URL for the directory with documents to add to the index.
<index_name>
The name of the Tanzu Greenplum Text external index to which the documents are to be added.
The gptext.index_external_dir()
function adds the documents in a directory and its subdirectories to a Tanzu Greenplum Text external index.
Log in to the document source using the gptext.external_login()
function before you call the gptext.index_external_dir()
function.
If you specify a file instead of a directory, an error is added to the gptext.error_table
table.
The ID for each file added to the index is the URL for the file in the external document source.
The Apache Tika library discovers the content_type
for each file.
The user who logs in to the external document source must have read permissions on the directory. The gptext.index_external_dir()
function adds to the index only those documents and subdirectories that the user has permission to read.
This example adds documents from a directory in an hdfs store to the Tanzu Greenplum Text webdocs
external index.
#= SELECT * FROM gptext.external_login('hdfs', 'hdfs://myhadoop:9000', 'myhadoop');
external_login
----------------
t
(1 row)
=# SELECT * FROM gptext.index_external_dir('hdfs://myhadoop:9000/gptext_web/public/300/', 'webdocs');
num_docs
----------
37
(1 row)
=# SELECT * FROM gptext.commit_index('webdocs');
commit_index
--------------
t
(1 row)
=# SELECT * FROM gptext.external_logout('hdfs');
external_logout
-----------------
t
(1 row)
Finishes an index operation. The results of an indexing operation are not available until this function is called for the index.
gptext.commit_index(<index_name>)
<index_name>
The name of the index to commit. If the table is partitioned this must be the name of the root table.
boolean
You must have the INSERT, UPDATE, or DELETE privilege to execute this function.
Must be called after gptext.index() and gptext.delete().
=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
commit_index
--------------
t
(1 row)
Retrieves metadata and content from an external document but does not add the content to the Tanzu Greenplum Text external index.
gptext.extract_rich_doc(<index-name>, <document-url>)
<index-name>
The name of the external index to test. Metadata and content are extracted from the document using the current Solr configuration files for this index.
<document-url>
The URL for the document to extract.
Column | Type |
---|---|
stream_name | text |
title | text |
author | text |
keywords | text |
created | text |
modified | text |
content | text |
The gptext.extract_rich_doc()
function can be used while testing or troubleshooting to verify the metadata and content extracted from external documents. The external document is not added to the Tanzu Greenplum Text external index. For example, you can use this function to test that changes you make to the solrconfig.xml
configuration file for the index have the desired effect.
If the document is in a document store that requires authentication, you must first log in using gptext.external_login()
.
The stream_name
column is the same as the <document-url>
function parameter. If the document is added to the index, this will be the unique document ID.
The title
, author
, keywords
, created
, and modified
columns contain metadata dependent on the document type Apache Tika detects.
The content
column contains the extracted document content.
View the metadata and content that would be extracted from an HTML document.
=# SELECT * FROM gptext.extract_rich_doc('gptext-docs', 'http://gptext.docs.pivotal.io/latest/topics/intro.html');
View the text extracted from an image file when the Tesseract optical character recognition (OCR) engine has been installed on the Greenplum Database cluster.
=# SELECT * FROM gptext.extract_rich_doc('gptext-docs', 'http://gptext.docs.pivotal.io/latest/graphics/ocrtest.png');
See Indexing Text Embedded in Images for information about setting up OCR for external image files and documents with embedded images.
Drops and recreates the gptext.error_table
database table.
gptext.recreate_error_table()
Boolean
If an error occurs while accessing a document to add to a Tanzu Greenplum Text external index, Tanzu Greenplum Text adds a record to the gptext.error_table
table. See gptext.error_table
for a description of this table. Users should not drop or modify the table.
Rows added to the gptext.error_table
table remain until you use the gptext.recreate_error_table()
function to create a new empty table.
If you attempt to execute gptext.recreate_error_table()
while it is in use for an indexing operation, a warning is raised and the function returns false without recreating the table.
=# SELECT gptext.recreate_error_table();
recreate_error_table
----------------------
t
(1 row)
To add documents from a document source that requires authentication, such as Hadoop, an ftp server, or Amazon S3 or other S3-compatible storage, log in before adding documents to a Tanzu Greenplum Text index and log out when done.
Logs in to an external document source before adding documents from the source to a Tanzu Greenplum Text external index.
gptext.external_login(<type>, <url>, <config-name>)
<type>
Identifies the type of the external document source. The valid types are 'ftp'
, 'hdfs'
, and 's3'
. The type is not case-sensitive.
<url>
The URL of the external document source.
<config-name>
The name of the configuration uploaded with the gptext-external upload
utility command.
You can log in to only one external document source at a time.
Use the gptext-external list
command to list the configurations that have been uploaded.
Log in to an hdfs file system using the myhdfs
configuration.
=# SELECT * FROM gptext.external_login('HDFS', 'hdfs://198.51.100.23:19000', 'myhdfs');
Log in to a ftp server using the myftp
configuration.
=# SELECT gptext.external_login('ftp', 'ftp://198.51.100.23', 'myftp');
The S3 connection <url>
has the format <protocol>://<s3-endpoint>[/<region>][/]
.
Amazon S3 — The <protocol>
is s3
. If the <s3-endpoint>
begins with s3-
or s3.
and is followed by a region code, for example s3-us-east-1.amazonaws.com
or s3.ap-southeast-1.amazonaws.com
, the /<region>
part of the URL is not required. An endpoint like s3.dualstack.us-east-1.amazonaws.com
, however, must include the region code at the end: s3://s3.dualstack.us-east-1.amazonaws.com/us-east-1
. See Amazon AWS Regions and Endpoints for a list of valid S3 endpoints by region.
Example: Log in to Amazon S3 using the mys3
configuration.
=# SELECT gptext.external_login('s3', 's3://s3.us-east-2.amazonaws.com', 'mys3');
Other S3-compatible storage — <protocol>
identifies the login protocol. For example, when you access MinIO, the <protocol>
may be http
or https
. The <s3-endpoint>
is an endpoint that identifies the server. The <region>/
part of the URL is optional, and depends on the configuration of the S3 storage server.
Example: Log in to MinIO using the myminio
configuration.
=# SELECT gptext.external_login('s3', 'http://my.minio.server:9000', 'myminio');
Logs out of an external document source after adding documents from the source to a Tanzu Greenplum Text external index.
gptext.external_logout(<type>)
<type>
Identifies the type of the external document source. The supported types are 'ftp'
, 'hdfs'
, and 's3'
. The type is not case-sensitive.
You can log in to only one external document source at a time. To index documents from another source, you must first call gptext.external_logout()
and then log in to the new source with gptext.external_login()
.
Log out of an hdfs file system.
=# SELECT * FROM gptext.external_logout('HDFS');
Log out of an ftp server.
=# SELECT gptext.external_logout('ftp');
Log out of S3.
=# SELECT gptext.external_logout('s3');
You can change an index by adding or dropping fields, reverting an index to its previous state, or deleting the index.
Adds a field to your schema if the field was added to the database after the index was created.
gptext.add_field(<index_name>, <field_name>[, <is_default_search_col> [, <if_enable_terms>]])
<index_name>
The name of the index to which you want to add the field. If the table is partitioned this must be the name of the root table.
<field_name>
The name of the field to be indexed.
<is_default_search_col>
Optional. Boolean value. Is this to become the default search column (field)? <if_enable_terms>
Optional. Boolean value. Enable terms support on this field when added to the Tanzu Greenplum Text index.
SETOF boolean
Only the OWNER can execute this function.
Call this function for each field you add.
Before and after you add one or more fields, reload the Solr configuration files with gptext.reload_index()
. The initial reload_index()
call is required because of Solr 4.0 behavior and may not be required in subsequent versions.
After you add one or more fields, you should reindex the table and commit the index with gptext.commit_index()
.
Adds the field external_links
to the index, then recreates, repopulates, and commits the index.
=# ALTER TABLE wikipedia.articles ADD external_links text;
ALTER TABLE
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
reload_index
--------------
t
(1 row)
=# SELECT * FROM gptext.add_field('demo.wikipedia.articles', 'external_links', false, false);
INFO: Add field: external_links for index: demo.wikipedia.articles
add_field
-----------
t
(1 row)
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
reload_index
--------------
t
(1 row)
=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
commit_index
--------------
t
(1 row)
Deletes all documents that match the search query.
gptext.delete(<index_name>, <query>)
<index_name>
The name of the index. <query>
Documents matching this query will be deleted. To delete all documents use the query'*'
or '*:*'
.
boolean
You must have the DELETE privilege to execute this function.
After a successful delete, commit the index using gptext.commit_index(<index_name>)
.
Delete all documents containing the word "unverified" in the default search field:
=# SELECT * FROM gptext.delete('demo.wikipedia.articles', 'unverified');
delete
--------
t
(1 row)
=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
commit_index
--------------
t
(1 row)
Delete all documents from the index:
=# SELECT * FROM gptext.delete('demo.wikipedia.articles', '*:*');
delete
--------
t
(1 row)
=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
commit_index
--------------
t
(1 row)
Removes a field from your schema.
gptext.drop_field(<index_name>, <field_name>)
<index_name>
The name of the index from which to drop the field. If the table is partitioned this must be the name of the root table.
<field_name>
The name of the field to drop.
boolean
Only the OWNER can execute this function.
Call this function for each field you drop.
Before and after dropping one or more fields, you must reload the Solr configuration files with gptext.reload_index()
, then commit the index with gptext.commit_index()
.
The column __partition
in indexes for partitioned database tables cannot be dropped.
The initial reload_index()
is required by Solr 4.0 behavior and may not be necessary in subsequent versions.
Drops the field external_links
from the index.
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
reload_index
--------------
t
(1 row)
=# SELECT * FROM gptext.drop_field('demo.wikipedia.articles', 'external_links');
INFO: Drop field: external_links for index: demo.wikipedia.articles
drop_field
------------
t
(1 row)
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
reload_index
--------------
t
(1 row)
=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
commit_index
--------------
t
(1 row)
Removes an index.
gptext.drop_index(<index_name>)
<index_name>
The name of the index to drop. If the database table is partitioned, this must be the name of the root table.
boolean
Only the OWNER can execute this function.
A dropped index cannot be recovered.
=# SELECT * FROM gptext.drop_index('demo.wikipedia.articles');
drop_index
------------
t
(1 row)
Search functions enable querying an index.
When using the search functions, you can change the query parser used by Solr at query time. A different query parser may be required, depending on the nature of the query. See Using Advanced Querying Options for a list of the query parsers Tanzu Greenplum Text supports.
To change the query parser at query time, use the defType
Solr option with the gptext.search()
function.
To change the query parser for any search function at query time, use the Solr localParams syntax, replacing the <query>
term with '{!type=edismax}<query>'
.
With the Tanzu Greenplum Text Universal Query Parser, you can use features from any of the other supported query parsers in a single query. To use the Universal Query Parser, replace the <query>
term with '{!gptextqp}<query>'
. See Using the Universal Query Parser for information and examples.
Searches an index.
gptext.search(TABLE(<select-statement>), <index_name>, <search_query>, <filter_queries>[, <options>])
TABLE(<select-statement>)
A table-valued expression. This is an anytable
pseudo-type, specified using the format:
TABLE(<select_statement>)
The Greenplum Database query planner processes the <select_statement>
to estimate the number of result rows, but the query is not executed by the body of the search function. You can use the expression TABLE(SELECT 1 SCATTER BY 1)
as a simple, syntactically correct value for this argument.
<index_name>
The name of the index to search. If the database table is partitioned you can specify the name of a sub-partition table to search. <search_query>
Text value containing a Solr text search query. <filter_queries>
A text array of filter queries, if any. If none, set this parameter to null
. <options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.
SETOF gptext.search_scored_result
This is a composite type with the following columns:
Column | Type |
---|---|
id | text |
score | double precision |
hs (conditional) | gptext hstore |
rf (conditional) | text |
The id
column is returned as text, even if the <id_col>
specified in the gptext.create_index()
function is an integer type. If you order results by id
or join search results with the original table on id
, you must cast the returned id
column to the correct integer type in your query. For example, the following search query casts the id
returned by the search query to an INT8 type to join with the numeric id
column in the wikipedia.articles
table and to sort the results numerically. However, the id
column in the results is a text value and is therefore displayed left-justified.
SELECT s.id, s.score, a.title
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT 1 SCATTER BY 1),
'demo.wikipedia.articles', '*:*', null) s
WHERE s.id::INT8 = a.id
ORDER BY s.id::INT8;
id | score | title
----------+-------+---------------------------
25784 | 1 | Renewable energy
27743 | 1 | Solar energy
54838 | 1 | Biogas
55017 | 1 | Fusion power
65499 | 1 | Soil salinity
113728 | 1 | Geothermal energy
213555 | 1 | Solar updraft tower
533423 | 1 | Solar water heating
608623 | 1 | Ethanol fuel
855056 | 1 | Forward osmosis
2008322 | 1 | Vehicle-to-grid
2120798 | 1 | Lithium economy
2988035 | 1 | Vortex engine
4711003 | 1 | Osmotic power
7906908 | 1 | Biomass
13021878 | 1 | Geothermal power
13690575 | 1 | Solar power
14090587 | 1 | Low-carbon power
14205946 | 1 | Algae fuel
18965585 | 1 | Pressure-retarded osmosis
22391303 | 1 | Liquid nitrogen engine
26787212 | 1 | Reverse electrodialysis
53716476 | 1 | Seaweed fuel
(23 rows)
If the <options>
parameter is included in gptext.search()
, the result includes the offsets
column. This column contains key-value pairs, where the key is the column name and the value is a comma-separated list of offsets to locations where the search term occurs. This data is used by the gptext.highlight()
function to add highlighting tags to the column data. If highlighting is not enabled with the 'hl=true'
option, the offsets
column is NULL
.
If the fl
option is included in the <options>
parameter to specify additional fields to add to the result, the rf
column contains the additional fields in a formatted text value. The gptext.gptext_retrieve_field()
function can be used to extract a single field value from the rf
column. There are variants of the gptext.gptext_retrieve_field()
function to retrieve integer and float values from the rf
column value.
You must have the SELECT privilege to execute this function.
Solr queries allow the following optional refinements, specified as an ampersand-delimited list in the options
parameter.
defType
The name of the query parser to use for this query.
Example: defType=edismax
rows
The maximum number of rows to return per segment. If omitted, all rows are returned.
Example: rows=100
returns 100 rows per segment or all rows if there are fewer than 100. sort
Sorts on a field or score in ascending or descending order.
Examples:
sort=score desc
(default if no sort defined)sort=date_time asc
sort=date_time asc score desc
sorts on date_time
ascending, then on score
descendingstart
The number of the first record to return.
Examples:
start=0
default: returned records start with record 0start=25
returned records start with record 25hl
Enable highlighting.
Example: hl=true
hl.fl
Comma-separated list of field names to consider when highlighting.
Examples:
hl.fl=message_text
hl.fl=title,content
fl
Comma-separated list of fields to include in search results. The fields must have been set to stored=true
in the managed-schema for the index.
Example: fl=title,refs
The output includes a table with columns id
(the ID named in gptext.create_index()) and score
(the tf-idf
score). A column named offsets
is included if highlighting is specified in the options
parameter. A column named rf
is included when a list of additional fields to include is specified in the <options>
parameter.
To change the query parser at query time, specify the defType
option in the options parameter list. For example, setting the options parameter to 'rows=100&defType=edismax'
limits the output to 100 rows per segment and will change the query parser to edismax
.
The TABLE
query is planned and affects the estimate for gptext.search()
, but does not execute. For example, if your query includes
gptext.search(TABLE(SELECT * FROM t), ...)
the query planner estimates the number of results as the number of rows in t
. This can cause the query planner to ignore the use of an index scan. Use a query like TABLE(SELECT 1 SCATTER BY 1)
to avoid this issue.
If you do not specify options, gptext.search()
returns all rows.
The options separator has changed from comma to ampersand (&) in order to support highlighting. If you do not use highlighting, you can revert to using the comma separator by setting the gptext.search_param_separator
to ','
.
The Solr option rows
specifies the maximum number of rows to return per segment. For example, if you have four segments,'rows=100'
returns at most a total of 400 rows. To limit the number of rows returned for an entire query, set a LIMIT
in the SQL query. For example, the following query returns at most 20 rows for the query:
=# SELECT t.id, q.score, t.message_text FROM twitter.message t,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.twitter.message',
'*:*', null, null) q
WHERE t.id=q.id::int8 LIMIT '20';
The gptext hstore
type is a limited form of the Postgres hstore
type. It only has the hstore
input and output functions implemented, as gptext_hstore_in
and gptext_hstore_out
.
Runs a Tanzu Greenplum Text query that looks for Wikipedia articles that contain the term "optimization", and joins the results to the original Greenplum Database articles
table:
=# SELECT a.id, a.title, q.score
FROM wikipedia.articles a,
gptext.search(TABLE(SELECT * FROM wikipedia.articles),
'demo.wikipedia.articles', 'optimization', null) q
WHERE a.id = q.id::int8 ORDER BY score DESC;
id | title | score
----------+---------------------------+------------
213555 | Solar updraft tower | 1.5528862
18965585 | Pressure-retarded osmosis | 0.89540845
7906908 | Biomass | 0.8692636
25784 | Renewable energy | 0.7473389
533423 | Solar water heating | 0.7186527
608623 | Ethanol fuel | 0.6943706
27743 | Solar energy | 0.6943706
2008322 | Vehicle-to-grid | 0.6352812
55017 | Fusion power | 0.6347449
14205946 | Algae fuel | 0.6347449
13690575 | Solar power | 0.58286035
(11 rows)
- Returns 5 rows from each segment with the text "iphone" highlighted in the `message_text` column. This example requires that you have enabled terms on the `message_text` field in the `demo.twitter.message` table. See the example in the [`gptext.enable_terms()`](#topic19) reference.
``` sql
=# SELECT t.id,
gptext.highlight(t.message_text, 'message_text', s.hs) message
FROM twitter.message t,
gptext.search(TABLE(SELECT 1 SCATTER BY 1),
'demo.twitter.message','{!gptextqp}iphone', null,
'rows=5&hl=true&hl.fl=message_text' ) s
WHERE t.id = s.id::int8;
id | message
----------+---------------------------------------------------------------------------
------------------------------------------
19714120 | @ayee_Eddy2011 I love pancakes too! #<em>iPhone</em> #app
19284329 | #nowplaying on my <em>iPhone</em>: Daft Punk - "Digital Love"
19416451 | I'm in love with my new <em>iPhone</em>(:
20257190 | Love my #<em>iphone</em> - only problem now? I want an #Ipad!
20759274 | Dropped frutopia on
: My phone... #ciaowaterdamage I hate <em>iPhones</em>.
20473459 | i love my iphone4 but I'm excited to see what the iphone5 has to offer #ga
dgets #<em>iphone</em> #apple #technology
19424811 | I hate
: <em>iPhones</em>:
20663075 | RT @indigoFKNvanity: I hate the auto correct on <em>iPhones</em> !!!!!!!!!
20350436 | I absolutely love how fast this phone works. Love the <em>iPhone</em>.
20042822 | @KDMC23 ohhhh!!! I hate <em>Iphone</em> Talk!
(10 rows)
Returns the number of documents that match the search query.
gptext.search_count(<index_name>, <search_query>, <filter_queries>, <options>)
<index_name>
The name of the index. <search_query>
The search query. <filter_queries>
A comma-delimited array of filter queries, if any. If none, set this parameter to null
.
<options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.
bigint
You must have the SELECT privilege to execute this function.
=# SELECT * FROM gptext.search_count('demo.wikipedia.articles', 'bubble', null);
count
-------
3
(1 row)
Searches a Tanzu Greenplum Text external index.
gptext.search_external(<table-exp>, <index_name>, <search_query>,
<filter_queries>[, <options>])
<table>
A table-valued expression. Because external indexes are not associated with a database table, this parameter is ignored. An expression like the following is sufficient:
TABLE(SELECT 1 SCATTER BY 1)
<index_name>
The name of the Tanzu Greenplum Text external index to search.
<search_query>
Text value containing a Solr text search query.
<filter_queries>
A text array of filter queries, if any. If none, set this parameter to null
.
<options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.
SETOF gptext.search_external_result
This type has the following columns:
Column | Type
---------------+------------------ id | text title | text subject | text description | text comments | text author | text keywords | text category | text resourcename | text url | text content_type | text last_modified | text links | text sha256 | text content | text score | double precision meta | text
The last column, meta
is present only if the optional <options>
argument is included in the search.
When you add an external document to the index, Apache Tika extracts a core set of metadata from the document, the columns listed in the Return type section. If any of these core metadata values are not present or do not exist in the document type, the value of the column in the result row is null.
If the <options>
argument is supplied, the results contain an additional text column named meta
. The meta
column contains additional document-type-specific metadata. You can use the gptext.gptext_retrieve_field()
function and its variants to extract individual metadata values by name from the meta
column.
If the <options>
argument contains the fl=<list>
Solr option, Solr returns values only for the columns included in <list>
and the id
, score
, and meta
columns. Other columns in the result set will have null values. It is more efficient to filter out columns in Solr than to retrieve all columns from Solr and then choose a subset of columns in the SQL SELECT
statement.
Finds HTML documents containing the term "facet".
=# \x on
=# SELECT id, title
FROM gptext.search_external(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
'facet', '{content_type:*html*}');
-[ RECORD 1 ]--------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/function_ref.html
title | GPText Function Reference |
| Tanzu Greenplum Text Docs
-[ RECORD 2 ]--------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/queries.html
title | Querying GPText Indexes |
| Tanzu Greenplum Text Docs
-[ RECORD 3 ]--------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/guc_ref.html
title | GPText Configuration Parameters |
| Tanzu Greenplum Text Docs
Although just two columns are in this result set, the data Greenplum Database receives from Solr includes all core metadata fields, including the content
field, which contains the full text of the document. The next example shows how to limit the data transferred from Solr to Greenplum Database with the Solr options argument.
This example lists the id
, title
, sha256
, and score
columns from the core metadata and extracts meta_creation_date
from the additional metadata supplied for PDF documents in the gptext-docs
external index. The fl=title,sha256
Solr option prevents Solr from transferring unneeded fields from the index to Greenplum Database. (The id
and score
columns are always transferred.)
=# \x on
=# SELECT id, title, sha256, score,
gptext.gptext_retrieve_field(meta, 'meta_creation_date') created
FROM gptext.search_external(TABLE(SELECT 1 SCATTER by 1), 'gptext-docs', '*:*',
'{content_type:*pdf*}', 'fl=title,sha256');
-[ RECORD 1 ]-------------------------------------------------------------
id | http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf
title | Tanzu Greenplum Text 2.1.3 Documentation | Tanzu Greenplum Text Docs
sha256 | 2E063DF5037B9ACC6E180681AE6838077BC5F7A362B4A1E67D9D8FF3E4DD7F3D
score | 1
created | 2017-09-22T17:22:54Z,2017-09-22T17:22:54Z
-[ RECORD 2 ]-------------------------------------------------------------
id | http://gpdb.docs.pivotal.io/latest/pdf/GPDB510Docs.pdf
title | Version 5.1.0
sha256 | AF0B71D032C99A6BE817E1FA2FB774EB7B4D47D75A755ABF54F4F60FEBB92FF7
score | 1
created | 2017-10-31T19:12:41Z,2017-10-31T19:12:41Z
Retrieves a single field from the rf
or meta
search result column as a text value.
gptext.gptext_retrieve_field(rf | meta, <field_name>)
rf | meta
The name of the column in which Tanzu Greenplum Text returns fields. This is rf
for search results from regular Tanzu Greenplum Text indexes and meta
for search results from Tanzu Greenplum Text external indexes.
<field_name>
The name of the field to retrieve.
The fl=<field_list>
Solr search option is added to the <options>
parameter of the gptext.search()
function to request additional stored fields. The additional fields are returned in the results in a column named rf
(meta
for external indexes). This column value has a format like the following:
column_value { name: "_version" value: "1544714234398507008" } column_value { name: "revision" value: "9.70" } column_value { name: "author" value: "jdough" }
The gptext.gptext_retrieve_field()
function extracts the value for a single specified field and returns it as a text value. If there is no field with the specified name in the rf
column, it returns NULL.
Storing additional fields in an index requires editing managed-schema
to specify the fields that should be stored. See Storing Additional Fields in an Index for instructions.
Retrieves a single field from the rf
or meta
search result column as an integer value.
gptext.gptext_retrieve_field_int(rf | meta, <field_name>)
rf | meta
The name of the column containing fields to be retrieved. For regular Tanzu Greenplum Text indexes, it is rf
. For Tanzu Greenplum Text external indexes it is meta
.
<field_name>
The name of the integer field to retrieve.
The gptext.gptext_retrieve_field_int()
function is the same as the gptext.gptext_retrieve_field()
function, except that the extracted field value is converted to an integer value.
Retrieves a single field from the search result column as a float value.
gptext.gptext_retrieve_field_float(rf | meta, <field_name>)
rf | meta
The name of the column containing fields to be retrieved. For regular Tanzu Greenplum Text indexes, it is rf
. For Tanzu Greenplum Text external indexes it is meta
.
<field_name>
The name of the float field to retrieve.
The gptext.gptext_retrieve_field_float()
function is the same as the gptext.gptext_retrieve_field()
function, except that the extracted field value is converted to a float value.
Highlights terms by inserting markup tags into data.
gptext.highlight(<column_data>, <column_name>, <offsets>)
<column_data>
The text data from the table which is to be tagged with highlighting tags.
<column_name>
The name of the corresponding column from the table. <offsets>
A gptext hstore
value that contains key-value pairs that indicate the locations of the text to highlight within the column data. See Remarks for information about the gptext hstore
data type.
To use highlighting, full term vectors, postings with term vectors, or stored
fields must be enabled before creating the index.
gptext.enable_terms()
for each field where you want to enable terms, then index or re-index with gptext.index()
.gptext.index()
.managed-schema
as described in Storing Additional Fields in an Index, then index or re-index with gptext.index()
.The offsets
parameter is a gptext hstore
, where each key is a column name and the value is a comma-separated list of offsets into the column data. This hstore
is constructed by gptext.search() with highlighting enabled in the offsets
parameter.
Following is an example of the offsets
hstore
content:
"field1"=>"0:5,9:14","field2"=>"13:20"
gptext.highlight()
will insert two sets of tags into the field1
data and one set into the field2
data at the indicated offsets.
The gptext hstore
type is a limited form of the Postgres hstore
type. It has only the hstore
input and output functions implemented, as gptext_hstore_in
and gptext_hstore_out
.
The highlight tags are defined by the gptext.hl_pre_tag
and gptext.hl_post_tag
server configuration parameters. Their default values are <em>
and </em>
, respectively.
Highlights terms in search results from external indexes by inserting markup tags.
gptext.highlight_external(<table_exp>, <index>, <search_query>, <filter_queries>[, <options>])
<table_exp>
A table expression, ignored for external indexes. An expression such as TABLE(SELECT 1 SCATTER BY 1)
is sufficient.
<index>
Name of the index containing data to highlight. <search_query>
Text value containing a Solr text search query.
<filter_queries>
A text array of filter queries, if any. If none, set this parameter to null
.
<options
An optional ampersand-delimited list of Solr query parameters. See Solr options
.
The gptext.highlight_external()
function searches a Tanzu Greenplum Text external index and encloses the search terms in markup tags in the returned results.
Search for and highlight the terms "zookeeper" and "solr" in HTML documents.
=# SELECT id, content FROM gptext.highlight_external(TABLE(SELECT 1 SCATTER BY 1),
'gptext-docs','{!gptextqp}zookeeper AND solr', '{content_type:*html*}',
'rows=2');
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/administering.html
content | includes security considerations, monitoring <em>Solr</em> index statistics, managing and monitoring <em>ZooKeeper</em>
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/performance.html
content | problems can result from resource contention between the Greenplum Database, <em>Solr</em>, and <em>ZooKeeper</em> clusters
-[ RECORD 3 ]-------------------------------------------------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/ha.html
content | /topics/utility_ref.html GPText Management Utilities
| rect /210/topics/type_ref.html GPText and <em>Solr</em>
-[ RECORD 4 ]-------------------------------------------------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/indexes.html
content | /topics/utility_ref.html GPText Management Utilities
| rect /210/topics/type_ref.html GPText and <em>Solr</em>
Highlights terms in local search results by inserting markup tags into data.
gptext.highlight_instant_content(<table_query>, <index_name>, <highlight_query>)
<table_query>
A table SELECT
query that returns two columns: the first column is an <id>, the second column is the text <content> to be highlighted.
<index_name>
Name of the index containing data to highlight. <highlight_query>
Text value containing a Solr text search query.
The gptext.highlight_instant_content()
function provides a more disk efficient highlighting option. When you invoke this function, Greenplum Text sends the original text to Solr before it applies the highlight.
The function returns records with the following columns:
id
- The first column returned by the <table_query>
.highlighted_content
- The result of applying the <highlight_query>
on the original content.You can use the returned id
to join the results with other tables.
Search for and highlight the term "solr".
=# SELECT t.id, highlighted_content, date FROM
gptext.highlight_instant_content(TABLE(SELECT t.id, content FROM
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.test', 'solr',
null, 'rows=10') s JOIN hli_test t ON s.id::bigint=t.id),
'demo.public.test', 'solr') h JOIN hli_test t ON h.id::bigint=t.id;
Faceting breaks up a search result into multiple categories, showing counts for each.
The faceted_field_search()
function breaks search results into field name categories.
gptext.faceted_field_search(<index_name>, <query>, <filter_queries>, <facet_fields>,
<facet_limit>, <minimum>, <options>)
<index_name>
The name of the index. <query>
Query statement. Use *:*
to query for all results. <filter_queries>
A text array of filter queries, if any. If none, set this parameter to null
. <facet_fields>
An array of field names to facet. Use Greenplum Database array notation. <facet_limit>
Maximum number of results to be returned for each aggregation (facet). <minimum>
Minimum number of results required before an aggregation (facet) will be returned. Enter 0 to return all facets.
<options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.
SETOF gptext.facet_field_result
This is a composite type with the following columns:
Column | Type |
---|---|
field_name | text |
field_value | text |
value_count | bigint |
You must have the SELECT privilege to execute this function.
None.
Facet all tweets on spam
and truncated
fields.
=# SELECT *
FROM gptext.faceted_field_search('demo.twitter.message', '*:*', null,
'{spam, truncated}', 2, 0);
field_name | field_value | value_count
------------+-------------+-------------
spam | true | 1730
truncated | false | 1705
truncated | true | 25
(3 rows)
Facet on author_id
, no limit, with a minimum of five tweets, all tweets. Selects five authors with at least two tweets.
=# SELECT *
FROM gptext.faceted_field_search('demo.twitter.message', '*:*', null,
'{author_id}', 5, 2);
field_name | field_value | value_count
------------+-------------+-------------
author_id | 102185050 | 9
author_id | 202305785 | 2
author_id | 64111799 | 2
author_id | 45326213 | 2
author_id | 195035308 | 2
(5 rows)
The faceted_query_search()
function breaks search results into categories defined by queries that you provide.
gptext.faceted_query_search(<index_name>, <query>, <filter_queries>,
<facet_queries>, <options>)
<index_name>
The name of the index.
<query>
Query statement. Use *:*
to query for all results.
<filter_queries>
A text array of filter queries, if any. If none, set this parameter to null
.
<facet_queries>
Type: text[]. Required. An array of facet queries.
<options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.
SETOF gptext.facet_query_result
This is a composite type with the following columns:
Column | Type |
---|---|
query_name | text |
value_count | bigint |
You must have the SELECT privilege to execute this function.
None.
This example uses Solr queries to divide twitter authors into three classes based on number of followers.
=# SELECT *
FROM gptext.faceted_query_search('demo.twitter.message', '*:*', null,
'{author_followers_count:[0 TO 5], author_followers_count:[6 TO 10],
author_followers_count:[11 TO *]}');
query_name | value_count
----------------------------------+-------------
author_followers_count:[0 TO 5] | 39
author_followers_count:[11 TO *] | 1632
author_followers_count:[6 TO 10] | 36
(3 rows)
The faceted_range_search()
function breaks search results into range categories over a numeric or date field, with ranges defined by the <range_start>
, <range_end>
, and <range_gap>
arguments.
gptext.faceted_range_search(<index_name>, <query>, <filter_queries>,
<field_name>, <range_start>, <range_end>, <range_gap>, <options>)
<index_name>
The name of the index.
<query>
Query statement. Use *:*
to query for all results.
<filter_queries>
A text array of filter queries, if any. If none, set this parameter to null
.
<field_name>
The name of the field on which to facet.
<range_start>
Beginning of the range.
<range_end>
End of the range.
<range_gap>
Size of range increment, a text value. <options>
An optional ampersand-delimited list of Solr query parameters. See Solr options.
SETOF gptext.facet_range_result
This is a composite type with the following columns:
Column | Type |
---|---|
field_name | text |
range_value | text |
value_count | bigint |
You must have the SELECT privilege to execute this function.
Facet on date range from midnight August 1, 2011 to midnight November 1, 2011, with a 7-day gap.
=# SELECT *
FROM gptext.faceted_range_search('demo.twitter.message', '*:*', null,
'created_at', '2011-08-01T00:00:00Z',
'2011-11-01T00:00:00Z', '+7DAY');
field_name | range_value | value_count
------------+----------------------+-------------
created_at | 2011-08-01T00:00:00Z | 0
created_at | 2011-08-08T00:00:00Z | 0
created_at | 2011-08-15T00:00:00Z | 0
created_at | 2011-08-22T00:00:00Z | 52
created_at | 2011-08-29T00:00:00Z | 189
created_at | 2011-09-05T00:00:00Z | 545
created_at | 2011-09-12T00:00:00Z | 0
created_at | 2011-09-19T00:00:00Z | 109
created_at | 2011-09-26T00:00:00Z | 69
created_at | 2011-10-03T00:00:00Z | 59
created_at | 2011-10-10T00:00:00Z | 206
created_at | 2011-10-17T00:00:00Z | 147
created_at | 2011-10-24T00:00:00Z | 112
created_at | 2011-10-31T00:00:00Z | 94
(14 rows)
The optional terms component saves terms output by the Solr index analyzer chain to a vector in the index. Enabling terms is requried to use parts-of-speech and named entity recognition.
Enables term vectors and positions to allow extracting terms and their positions from fields of data type text
.
gptext.enable_terms(<index_name>, <field_name>)
<index_name>
The name of the index for which you want to enable terms.
<field_name>
The name of the field for which you want to enable terms.
boolean
Only the OWNER can execute this function.
Solr can mark terms and their positions in documents when indexing. This capability is deactivated by default. Use gptext.enable_terms()
to activate the capability.
Call gptext.enable_terms()
for each field where you want to enable terms.
After calling this function, you must index or re-index with gptext.index().
=# SELECT * FROM gptext.enable_terms('demo.twitter.message', 'message_text');
WARNING: Enable terms for field: message_text of index: demo.twitter.message successfully. Reindex data needed.
enable_terms
--------------
t
(1 row)
=# SELECT * FROM gptext.index(TABLE(SELECT * FROM twitter.message), 'demo.twitter.message');
dbid | num_docs
------+----------
3 | 947
2 | 1020
(2 rows)
=# SELECT * FROM gptext.commit_index('demo.twitter.message');
commit_index
--------------
t
(1 row)
Gets NER-tagged terms for a text field configured for NER (Named Entity Recognition) from documents in a Solr index that match a query.
gptext.ner_terms(TABLE(<select-statement>), <index_name>, <field_name>,
<search_query>, <filter_queries>[, <options>])
gptext.ner_terms(<index_name>, <field_name>, <search_query>,
<filter_queries>[, <options>])
TABLE(<select-statement>)
This parameter is ignored and can be omitted.
A table expression that specifies a SELECT
statement. Specify in the format:
TABLE(SELECT * FROM <src_table>;)
<index_name>
The name of the index to query for NER terms. <field_name>
The name of the NER-enabled text field to query for NER terms.
<search_query>
A query that matches documents to include in the results. <filter_queries>
A comma-delimited array of filter queries, if any. If none, set this parameter to null
. <options>
An optional, comma-delimited list of Solr query parameters.
SETOF gptext.ner_term_info
This is a composite type with the following columns:
Column | Type |
---|---|
id | text |
term | text |
entity_type | text |
frequency | integer |
You must have the SELECT privilege to execute this function.
The gptext.ner_terms()
function displays the NER-tagged terms in a field's terms vector for documents that match a query. The <fieldType>
of the text field must be configured in the managed-schema
for the index with the OpenNLP tokenizers and filters. See Using Named Entity Recognition with GPText for NER configuration instructions.
NER-tagged terms have the form _ner_<entity-type>_<token-text>
. Terms that do not have the _ner_
prefix are not included in the output of this function. Use the gptext.terms()
function to retrieve all terms from matched documents, including NER-tagged terms.
The output of the gptext.ner_terms()
function contains one row for each tagged entity in documents that match the search. The term
column contains the term with the _ner_<entity-type>-
prefix removed. The entity_type
column contains the type of the entity, and the frequency
column is the number of times the term appears in the document.
Terms with the same entity type, such as person, organization, and location, that occur in consecutive positions in the document are concatenated to create a compound term in the output. For example, if the terms _ner_organization_federal
, _ner_organization_reserve
, and _ner_organization_board
appear in consecutive positions in the document, the compound term federal reserve board
appears in the output. When you perform a search on tagged terms and include highlighting, the compound term is highlighted, rather than each of the consecutive terms.
Get the NER-tagged terms for the content
field from the document with id 842613485.
=# SELECT * FROM gptext.ner_terms('demo.public.news_demo', 'content', 'id:842613485', null);
id | term | entity_type | frequency
-----------+--------------------------------+--------------+-----------
842613485 | alan murray | person | 1
842613485 | citicorp information services | organization | 1
842613485 | morgan stanley | organization | 1
842613485 | stephen roach | person | 1
842613485 | mr. murray | person | 1
842613485 | alan | person | 1
842613485 | mr. | person | 1
842613485 | murray | person | 2
842613485 | roach | person | 1
842613485 | stephen | person | 1
842613485 | board | organization | 1
842613485 | citicorp | organization | 1
842613485 | consumer | organization | 1
842613485 | fed | organization | 1
842613485 | federal | organization | 1
842613485 | information | organization | 1
842613485 | morgan | organization | 1
842613485 | reserve | organization | 1
842613485 | services | organization | 1
842613485 | stanley | organization | 1
842613485 | federal reserve board | organization | 1
(21 rows)
Gets the term vectors for the specified field from documents in a Solr index. You can use gptext.terms()
to create terms tables.
gptext.terms(TABLE(<select-statement>), <index_name>, <field_name>,
<search_query>, <filter_queries>[, <options>])
gptext.terms(<index_name>, <field_name>, <search_query>, <filter_queries>[, options])
TABLE(<select-statement>)
A table expression that specifies a SELECT
statement. Specify in the format:
TABLE(SELECT 1 SCATTER BY 1)
This parameter is ignored and can be omitted.
<index_name>
The name of the index to query for terms. The index must have unique document IDs. <field_name>
The name of the field to query for terms.
<search_query>
A query that the document must match. <filter_queries>
A comma-delimited array of filter queries, if any. If none, set this parameter to null
. <options>
An optional, comma-delimited list of Solr query parameters.
SETOF gptext.term_info
This is a composite type with the following columns:
Column | Type |
---|---|
id | text |
term | text |
positions | integer[] |
You must have the SELECT privilege to execute this function.
The gptext.terms()
function cannot be used with indexes created with if_check_id_uniqueness
set to false.
To enable using gptext.terms()
, execute the Tanzu Greenplum Text function gptext.enable_terms()
, then reindex with gptext.index()
.
If the field has been tagged with Named Entity Recognition (NER) or Parts-of-Speech (POS) filters, the tagged terms appear in the gptext.terms()
output in raw format, for example _ner_person_david
. Use the gptext.ner_terms()
function to view only NER-tagged terms.
This example creates a terms table from the output of the gptext.terms()
function.
=# CREATE TABLE twitter.terms AS
SELECT * FROM gptext.terms(
'demo.twitter.message', 'message_text', 'iphone', null)
DISTRIBUTED BY (id);
SELECT 5385
These functions provide statistics and status for indexes managed by the Tanzu Greenplum Text cluster.
Shows the status of indexes managed by the Tanzu Greenplum Text cluster.
gptext.cluster_status()
SETOF gptext.cluster_status_result
This is a composite type with the following columns:
Column | Type |
---|---|
index_name | text |
max_shards_per_node | integer |
router | text |
replication_factor | integer |
auto_add_replicas | boolean |
znode_version | integer |
config_name | text |
partitioned | boolean |
=# SELECT * FROM gptext.cluster_status();
index_name | max_shards_per_node | router | replication_factor | auto_add_
replicas | znode_version | config_name | partitioned
-------------------------+---------------------+----------+--------------------+----------
---------+---------------+-------------------------+-------------
demo.twitter.message | 4 | implicit | 2 | f
| 8 | demo.twitter.message | t
demo.wikipedia.articles | 4 | implicit | 2 | f
| 8 | demo.wikipedia.articles | f
(2 rows)
Shows the number of documents indexed and total disk space used for Tanzu Greenplum Text indexes.
gptext.index_size([<index_name>])
<index_name>
The name of the index. Optional. Returns sizes for all indexes if no index is specified.
SETOF gptext.index_size_result
This is a composite type with the following columns:
Column | Type |
---|---|
index_name | text |
num_docs | bigint |
size_in_bytes | bigint |
=# SELECT * FROM gptext.index_size();
index_name | num_docs | size_in_bytes
-------------------------+----------+---------------
demo.wikipedia.articles | 23 | 500515
demo.twitter.message | 1730 | 767118
gptext-docs | 16 | 618231
(3 rows)
=# SELECT * FROM gptext.index_size('demo.wikipedia.articles');
index_name | num_docs | size_in_bytes
-------------------------+----------+---------------
demo.wikipedia.articles | 23 | 500515
(1 row)
Shows status of replicas for a specified index or for all indexes.
gptext.index_status([<index_name>])
<index_name>
The name of the index. Optional. Returns status for all indexes if no index is specified.
SETOF gptext.index_status_result
This is a composite type with the following columns:
Column | Type |
---|---|
index_name | text |
shard_name | text |
shard_state | text |
replica_name | text |
replica_state | text |
core | text |
node_name | text |
base_url | text |
is_leader | boolean |
partitioned | boolean |
external_index | boolean |
Show status for a single index.
=# SELECT * FROM gptext.index_status('demo.wikipedia.articles');
index_name | shard_name | shard_state | replica_name | replica_state | core
| node_name | base_url | is_leader | partitioned | external_index
-------------------------+------------+-------------+--------------+---------------+------------------------------
--------------+-----------------+------------------------+-----------+-------------+----------------
demo.wikipedia.articles | shard1 | active | core_node3 | active | demo.wikipedia.articles_shard
1_replica_n1 | sdw2:18983_solr | http://sdw2:18983/solr | t | f | f
demo.wikipedia.articles | shard1 | active | core_node5 | active | demo.wikipedia.articles_shard
1_replica_n2 | sdw1:18983_solr | http://sdw1:18983/solr | f | f | f
demo.wikipedia.articles | shard2 | active | core_node7 | active | demo.wikipedia.articles_shard
2_replica_n4 | sdw1:18984_solr | http://sdw1:18984/solr | f | f | f
demo.wikipedia.articles | shard2 | active | core_node9 | active | demo.wikipedia.articles_shard
2_replica_n6 | sdw2:18984_solr | http://sdw2:18984/solr | t | f | f
demo.wikipedia.articles | shard3 | active | core_node11 | active | demo.wikipedia.articles_shard
3_replica_n8 | sdw2:18983_solr | http://sdw2:18983/solr | t | f | f
demo.wikipedia.articles | shard3 | active | core_node13 | active | demo.wikipedia.articles_shard
3_replica_n10 | sdw1:18983_solr | http://sdw1:18983/solr | f | f | f
demo.wikipedia.articles | shard4 | active | core_node15 | active | demo.wikipedia.articles_shard
4_replica_n12 | sdw1:18984_solr | http://sdw1:18984/solr | t | f | f
demo.wikipedia.articles | shard4 | active | core_node16 | active | demo.wikipedia.articles_shard
4_replica_n14 | sdw2:18984_solr | http://sdw2:18984/solr | f | f | f
(8 rows) ```
Show status for all Tanzu Greenplum Text indexes.
=# SELECT * FROM gptext.index_status();
index_name | shard_name | shard_state | replica_name | replica_state | core | node_name | base_url | is_leader | partitioned | external_index
-------------------------+------------+-------------+--------------+---------------+--------------------------------------------+-----------------+------------------------+-----------+-------------+----------------
demo.public.news_demo | shard1 | active | core_node3 | active | demo.public.news_demo_shard1_replica_n1 | sdw2:18984_solr | http://sdw2:18984/solr | t | f | f
demo.public.news_demo | shard1 | active | core_node5 | active | demo.public.news_demo_shard1_replica_n2 | sdw1:18984_solr | http://sdw1:18984/solr | f | f | f
demo.public.news_demo | shard2 | active | core_node7 | active | demo.public.news_demo_shard2_replica_n4 | sdw2:18983_solr | http://sdw2:18983/solr | t | f | f
demo.public.news_demo | shard2 | active | core_node8 | active | demo.public.news_demo_shard2_replica_n6 | sdw1:18983_solr | http://sdw1:18983/solr | f | f | f
demo.twitter.message | shard1 | active | core_node3 | active | demo.twitter.message_shard1_replica_n1 | sdw2:18984_solr | http://sdw2:18984/solr | t | t | f
demo.twitter.message | shard1 | active | core_node5 | active | demo.twitter.message_shard1_replica_n2 | sdw1:18983_solr | http://sdw1:18983/solr | f | t | f
demo.twitter.message | shard2 | active | core_node7 | active | demo.twitter.message_shard2_replica_n4 | sdw2:18983_solr | http://sdw2:18983/solr | f | t | f
demo.twitter.message | shard2 | active | core_node9 | active | demo.twitter.message_shard2_replica_n6 | sdw1:18984_solr | http://sdw1:18984/solr | t | t | f
demo.twitter.message | shard3 | active | core_node11 | active | demo.twitter.message_shard3_replica_n8 | sdw2:18984_solr | http://sdw2:18984/solr | t | t | f
demo.twitter.message | shard3 | active | core_node13 | active | demo.twitter.message_shard3_replica_n10 | sdw1:18983_solr | http://sdw1:18983/solr | f | t | f
demo.twitter.message | shard4 | active | core_node15 | active | demo.twitter.message_shard4_replica_n12 | sdw2:18983_solr | http://sdw2:18983/solr | f | t | f
demo.twitter.message | shard4 | active | core_node16 | active | demo.twitter.message_shard4_replica_n14 | sdw1:18984_solr | http://sdw1:18984/solr | t | t | f
demo.wikipedia.articles | shard1 | active | core_node3 | active | demo.wikipedia.articles_shard1_replica_n1 | sdw2:18983_solr | http://sdw2:18983/solr | t | f | f
demo.wikipedia.articles | shard1 | active | core_node5 | active | demo.wikipedia.articles_shard1_replica_n2 | sdw1:18983_solr | http://sdw1:18983/solr | f | f | f
demo.wikipedia.articles | shard2 | active | core_node7 | active | demo.wikipedia.articles_shard2_replica_n4 | sdw1:18984_solr | http://sdw1:18984/solr | f | f | f
demo.wikipedia.articles | shard2 | active | core_node9 | active | demo.wikipedia.articles_shard2_replica_n6 | sdw2:18984_solr | http://sdw2:18984/solr | t | f | f
demo.wikipedia.articles | shard3 | active | core_node11 | active | demo.wikipedia.articles_shard3_replica_n8 | sdw2:18983_solr | http://sdw2:18983/solr | t | f | f
demo.wikipedia.articles | shard3 | active | core_node13 | active | demo.wikipedia.articles_shard3_replica_n10 | sdw1:18983_solr | http://sdw1:18983/solr | f | f | f
demo.wikipedia.articles | shard4 | active | core_node15 | active | demo.wikipedia.articles_shard4_replica_n12 | sdw1:18984_solr | http://sdw1:18984/solr | t | f | f
demo.wikipedia.articles | shard4 | active | core_node16 | active | demo.wikipedia.articles_shard4_replica_n14 | sdw2:18984_solr | http://sdw2:18984/solr | f | f | f
(20 rows)
Lists indexes on partitioned tables or child partition names in the current Greenplum database.
gptext.partition_status([<index_name>])
<index_name>
Optional. Returns partition status for all indexes if no index is specified.
SETOF gptext.partition_status_result
This is a composite type with the following columns:
Column | Type |
---|---|
partition_name | text |
inherits_name | text |
level | integer |
cons | text |
List partition status for an index.
=# SELECT partition_name, inherits_name, level
FROM gptext.partition_status('demo.twitter.message');
partition_name | inherits_name | level
------------------------------+----------------------+-------
demo.twitter.message_1_prt_1 | demo.twitter.message | 1
demo.twitter.message_1_prt_2 | demo.twitter.message | 1
demo.twitter.message_1_prt_3 | demo.twitter.message | 1
demo.twitter.message_1_prt_4 | demo.twitter.message | 1
(4 rows)
The gptext.partition_status()
function can only list the index partitions for tables in the current Greenplum database.
Shows replica status and core statistics for a specified index or for all indexes.
gptext.index_summary([<index_name>])
<index_name>
Optional. Returns information for all indexes if no index is specified.
SETOF gptext.index_summary_result
This is a composite (row) type with the following columns:
Column | Type |
---|---|
index_name | text |
shard_name | text |
shard_state | text |
replica_name | text |
replica_state | text |
core | text |
node_name | text |
base_url | text |
is_leader | boolean |
partitioned | boolean |
external_index | boolean |
core_name | text |
instance_dir | text |
data_dir | text |
config | text |
schema | text |
start_time | text |
uptime | bigint |
num_docs | integer |
max_docs | integer |
delete_docs | integer |
index_heap_usage_bytes | bigint |
version | bigint |
segment_count | integer |
current | boolean |
has_deletions | boolean |
directory | text |
last_modified | text |
size_in_bytes | bigint |
size | text |
The gptext.index_summary()
function can appear to hang while it waits on a lock when SolrCloud is performing recovery. If recovery takes a long time, the function times out. The timeout period is 600 seconds by default.
List the Solr node, number of documents, and size of the leader replica for each shard of an index.
=# SELECT index_name, shard_name, node_name, num_docs, size_in_bytes
FROM gptext.index_summary('demo.twitter.message')
WHERE is_leader;
index_name | shard_name | node_name | num_docs | size_in_bytes
----------------------+------------+-----------------+----------+---------------
demo.twitter.message | shard4 | sdw2:18983_solr | 417 | 295873
demo.twitter.message | shard3 | sdw2:18984_solr | 449 | 302987
demo.twitter.message | shard2 | sdw2:18983_solr | 449 | 311868
demo.twitter.message | shard1 | sdw2:18984_solr | 415 | 282736
(4 rows)
The functions in this section provide configuration information and metrics for Tanzu Greenplum Text indexes and help to configure indexes.
Shows the output from each class in the index or query analyzer chain for a given field type and user-supplied input text.
gptext.analyzer(<index_name>, {'index' | 'query'}, <field_type>, <input>)
<index_name>
The name of the index. {'index' | 'query'}
Show output for the index analysis chain or the query analysis chain. <field_type>
The field type to analyze. <input>
A text string to pass through the analyzer.
Text
=# SELECT gptext.analyzer('demo.wikipedia.articles', 'index', 'text_intl',
'Chopin takes a wayward sidestep into a delicious series of upward flying scales');
analyzer
----------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------
(WorldLexerTokenizer,"{{""Chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of""}
,{""upward""},{""flying""},{""scales""}}")
(CJKWidthFilter,"{{""Chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of""},{""u
pward""},{""flying""},{""scales""}}")
(LowerCaseFilter,"{{""chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of""},{""
upward""},{""flying""},{""scales""}}")
(WorldLexerBigramFilter,"{{""chopin""},{""takes""},{""a""},{""wayward""},{""sidestep""},{""into""},{""a""},{""delicious""},{""series""},{""of
""},{""upward""},{""flying""},{""scales""}}")
(StopFilter,"{{""chopin""},{""takes""},{},{""wayward""},{""sidestep""},{},{},{""delicious""},{""series""},{},{""upward""},{""flying""},{""sca
les""}}")
(SetKeywordMarkerFilter,"{{""chopin""},{""takes""},{},{""wayward""},{""sidestep""},{},{},{""delicious""},{""series""},{},{""upward""},{""flyi
ng""},{""scales""}}")
(PorterStemFilter,"{{""chopin""},{""take""},{},{""wayward""},{""sidestep""},{},{},{""delici""},{""seri""},{},{""upward""},{""fly""},{""scale"
"}}")
(7 rows)
Each row of the gptext.analyzer()
function result is a single text column containing the name of the analyzer class and a list of the tokens the class produced from the output of the previous class.
You can use this function to compare the output produced by analyzer chains configured for different field types and to test your analyzer configuration.
Appends the contents of a local file to a ZooKeeper index configuration file.
gptext.config_append(<index_name>, <local_config_file> [, <index_config_file>])
<index_name>
The name of the index to configure. <local_config_file>
The path and filename of a local file that you will append to the index configuration file.
<index_config_file>
Optional. The name of the ZooKeeper configuration file to which you will append the local file. If you omit this parameter, the function appends the local file to a file of the same name that resides in the top-level ZooKeeper directory.
boolean
Append the local file /home/gpadmin/stopwords.add
to the top-level ZooKeeper file stopwords.txt
for index demo.wikipedia.articles
:
=# SELECT * FROM gptext.config_append('demo.wikipedia.articles', '/home/gpadmin/stopwords.add', 'stopwords.txt');
config_append
---------------
t
(1 row)
Deletes an index configuration file from ZooKeeper.
gptext.config_delete(<index_name>, <index_config_file>)
<index_name>
The name of the index that has the file to delete. <index_config_file>
The ZooKeeper configuration file to delete. Include the path if the file does not reside at the top-level directory.
boolean
Delete the file named stopwords.add
from the top-level configuration directory for the index demo.wikipedia.articles
:
=# select * from gptext.config_delete('demo.wikipedia.articles', 'stopwords.add');
config_delete
---------------
t
(1 row)
Displays the contents of a ZooKeeper index configuration file.
gptext.config_get(<index_name>, <index_config_file>)
<index_name>
The name of the index that has the file you want to display. <index_config_file>
The ZooKeeper configuration file to display. Include the path if the file does not reside at the top-level ZooKeeper directory for the index.
text
Display the contents of synonyms.txt
for the index demo.wikipedia.article
:
=# select * from gptext.config_get('demo.wikipedia.articles', 'synonyms.txt');
config_get
----------------------------------------------------------------------------
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#-----------------------------------------------------------------------
#some test synonym mappings unlikely to appear in real input text
aaa => aaaa
bbb => bbbb1 bbbb2
ccc => cccc1,cccc2
a\=>a => b\=>b
a\,a => b\,b
fooaaa,baraaa,bazaaa
# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
# Synonym mappings can be used for spelling correction too
pixima => pixma
(1 row)
Lists the ZooKeeper configuration files and directories for an index.
gptext.config_list(<index_name>, [<index_config_path>,] <is_recursive>)
<index_name>
The name of the index that has the files and directories you want to list. <index_config_path>
Optional. A specific directory in the ZooKeeper configuration that you want to list. Omit this option to list configuration files and directories in the top-level directory. <is_recursive>
Optional. A boolean value that determines whether the function recursively lists files and directories that are present in subdirectories.
SETOF text
List ZooKeeper configuration files and directories only in the top-level directory for the index demo.wikipedia.articles
:
=# select * from gptext.config_list('demo.wikipedia.articles', false);
config_list
-----------------------------
currency.xml
mapping-FoldToASCII.txt
managed-schema
protwords.txt
scripts.conf
synonyms.txt
managed_schema
stopwords.txt
velocity
admin-extra.html
aggconfig.xml
emoticons.txt
solrconfig.xml
elevate.xml
xslt
mapping-ISOLatin1Accent.txt
spellings.txt
lang
(18 rows)
List ZooKeeper configuration files in the ZooKeeper lang
subdirectory for demo.wikipedia.articles
:
=# select * from gptext.config_list('demo.wikipedia.articles', 'lang', false);
config_list
--------------------------
lang/contractions_it.txt
lang/contractions_ca.txt
lang/stemdict_nl.txt
lang/stopwords_hy.txt
lang/stopwords_no.txt
lang/stopwords_id.txt
[...]
(39 rows)
List all configuration files and directories for demo.wikipedia.articles
:
=# select * from gptext.config_list('demo.wikipedia.articles', true);
config_list
----------------------------------
currency.xml
mapping-FoldToASCII.txt
managed-schema
protwords.txt
scripts.conf
synonyms.txt
managed_schema
stopwords.txt
velocity
velocity/doc.vm
velocity/suggest.vm
velocity/hit.vm
[...]
(86 rows)
Uploads an index configuration file to ZooKeeper, replacing any existing file of the same name.
gptext.config_upload(<index_name>, <local_config_file> [, <index_config_file>])
<index_name>
The name of the index to configure. <local_config_file>
The path and filename of a local file that you wan to upload to ZooKeeper for the index. The function uploads this file to a file the same name in the top-level ZooKeeper directory for the index, unless you include the <index_config_file>
option to change the path or filename.
<index_config_file>
Optional. The destination path for the file in ZooKeeper. If you omit this parameter, the function uploads the local file to the top-level ZooKeeper directory for the index.
boolean
Upload the local file /home/gpadmin/stopwords.txt
to ZooKeeper, overwriting the existing stopwords.txt
file for the index demo.wikipedia.articles
:
=# select * from gptext.config_upload('demo.wikipedia.articles', '/home/gpadmin/stopwords.txt');
config_upload
---------------
t
(1 row)
Upload the local file /home/gpadmin/stopwords_japanese.txt
to ZooKeeper, overwriting the file lang/stopwords_ja.txt
for the index demo.wikipedia.articles
:
# select * from gptext.config_upload('demo.wikipedia.articles', '/home/gpadmin/stopwords_japanese.txt', 'lang/stopwords_ja.txt');
config_upload
---------------
t
(1 row)
Displays the analyzer chain for a field type defined in the configuration for a specified index.
gptext.get_field_type(<index_name>, <field_type>)
<index_name>
The name of the index.
<field_type>
The name of a field type defined in the managed-schema
configuration file for the index.
=# SELECT gptext.get_field_type('demo.wikipedia.articles', 'text');
get_field_type
-------------------------------------------------
{
"name": "text",
"class": "solr.TextField",
"indexAnalyzer": {
"tokenizer": {
"class": "solr.WhitespaceTokenizerFactory"
},
"filters": [
{
"class": "solr.StopFilterFactory",
"attributes": [
{
"name": "words",
"value": "stopwords.txt"
},
{
"name": "ignoreCase",
"value": "true"
}
]
},
{
"class": "solr.WordDelimiterFilterFactory",
"attributes": [
{
"name": "catenateNumbers",
"value": "1"
},
{
"name": "generateNumberParts",
"value": "1"
},
{
"name": "splitOnCaseChange",
"value": "1"
},
{
"name": "generateWordParts",
"value": "1"
},
{
"name": "catenateAll",
"value": "0"
},
{
"name": "catenateWords",
"value": "1"
}
]
},
{
"class": "solr.LowerCaseFilterFactory"
},
{
"class": "solr.KeywordMarkerFilterFactory",
"attributes": [
{
"name": "protected",
"value": "protwords.txt"
}
]
},
{
"class": "solr.PorterStemFilterFactory"
}
]
},
"queryAnalyzer": {
"tokenizer": {
"class": "solr.WhitespaceTokenizerFactory"
},
"filters": [
{
"class": "solr.SynonymFilterFactory",
"attributes": [
{
"name": "expand",
"value": "true"
},
{
"name": "ignoreCase",
"value": "true"
},
{
"name": "synonyms",
"value": "synonyms.txt"
}
]
},
{
"class": "solr.StopFilterFactory",
"attributes": [
{
"name": "words",
"value": "stopwords.txt"
},
{
"name": "ignoreCase",
"value": "true"
}
]
},
{
"class": "solr.WordDelimiterFilterFactory",
"attributes": [
{
"name": "catenateNumbers",
"value": "0"
},
{
"name": "generateNumberParts",
"value": "1"
},
{
"name": "splitOnCaseChange",
"value": "1"
},
{
"name": "generateWordParts",
"value": "1"
},
{
"name": "catenateAll",
"value": "0"
},
{
"name": "catenateWords",
"value": "0"
}
]
},
{
"class": "solr.LowerCaseFilterFactory"
},
{
"class": "solr.KeywordMarkerFilterFactory",
"attributes": [
{
"name": "protected",
"value": "protwords.txt"
}
]
},
{
"class": "solr.PorterStemFilterFactory"
}
]
},
"attributes": [
{
"name": "autoGeneratePhraseQueries",
"value": "true"
},
{
"name": "positionIncrementGap",
"value": "100"
}
]
}
(1 row)
You can use the gptext.list_field_types()
function to list the text field types defined in the configuration for an index.
See Customizing Tanzu Greenplum Text Indexes for information about text analyzer chains.
Lists available field types in the managed-schema
configuration file for a specified Tanzu Greenplum Text index.
gptext.list_field_types(<index_name>)
<index_name>
The name of the index.
SETOF text
List the field types defined in the managed-schema
configuration file for the demo.wikipedia.articles
index.
=# SELECT gptext.list_field_types('demo.wikipedia.articles');
list_field_types
---------------------------
ancestor_path
delimited_payloads_float
delimited_payloads_int
delimited_payloads_string
descendent_path
lowercase
phonetic_en
text
text_ar
text_bg
text_ca
text_cjk
text_cz
text_da
text_de
text_el
text_en
text_en_splitting
text_en_splitting_tight
text_es
text_eu
text_fa
text_fi
text_fr
text_ga
text_general
text_general_rev
text_gl
text_hi
text_hu
text_hy
text_icu
text_id
text_intl
text_intl_prev
text_it
text_ja
text_lv
text_nl
text_no
text_pt
text_ro
text_ru
text_sm
text_sv
text_th
text_tr
text_ws
text_zhsmart
(49 rows)
Reloads Solr configuration files if they have been modified.
gptext.reload_index(<index_name>)
<index_name>
Optional. The name of the index for which to reload the configuration files.
boolean
Only the OWNER can execute this function.
None.
=# SELECT * FROM gptext.reload_index('demo.wikipedia.articles');
reload_index
--------------
t
(1 row)
These functions provide information about the Tanzu Greenplum Text cluster operation and status.
gptext.live_nodes() – lists active Solr nodes.
gptext.version() – returns version of Tanzu Greenplum Text installation.
gptext.zookeeper_hosts() – returns a list of the ZooKeeper host names and ports.
Lists active Solr nodes and their up or down state.
gptext.live_nodes()
SETOF gptext.live_nodes_result
This is a composite type with the following columns:
Column | Type |
---|---|
host | text |
port | bigint |
data_dir | text |
status | text |
=# SELECT * FROM gptext.live_nodes();
host | port | data_dir | status
--------+-------+---------------------+--------
gpdb51 | 18983 | /data/gpdata1/solr0 | u
gpdb51 | 18984 | /data/gpdata2/solr0 | u
(2 rows)
The status column can be u
(up) or d
(down).
Returns the version of your Tanzu Greenplum Text installation.
SELECT * FROM gptext.version()
None.
text
You do not need any privileges to execute this function.
=# SELECT * FROM gptext.version();
version
--------------------------------
Greenplum Text Analytics 2.1.3
(1 row)
Returns a list of ZooKeeper hosts and ports.
gptext.zookeeper_hosts()
text
This function returns a comma-separated list of ZooKeeper nodes in the the format <host-name>:<port>
.
=# SELECT * FROM gptext.zookeeper_hosts()
host | port
--------+------
gpdb51 | 2188
gpdb51 | 2189
gpdb51 | 2190
(3 rows)
Adds a replica of an index shard.
gptext.add_replica(<index_name>, <shard_name>[, <node_name>])
<index_name>
Name of the index. If the index is for a partitioned database table, this must be the name of the root table. <shard_name>
Name of the shard to replicate. <node_name>
Name of the node where the replica is to be added. Optional. If omitted, SolrCloud chooses the node.
boolean
This function is used by the Tanzu Greenplum Text management utility gptext-replica add
.
The value of the gptext.replication_factor
configuration parameter when an index is created determines how many replicas are created for each shard. In a Greenplum system, there are the same number of shards as there are Greenplum segments. The number of replicas created for a new index is the number of segments times the value of the gptext.replication_factor
configuration parameter, 2 by default. The replicas are distributed evenly among the live Tanzu Greenplum Text nodes.
Replicas consume space on the host where they are created, so they are usually only created to replace a replica that has failed or become unavailable or to relocate a replica to another Tanzu Greenplum Text instance. When adding replicas, you should maintain equal distribution of replicas among the Tanzu Greenplum Text nodes and avoid placing multiple replicas for the same shard on the same host.
The total number of replicas for an index that can be placed on each Tanzu Greenplum Text node is set when the index is created. (In Solr, this is the MaxShardsPerNode
parameter.) Tanzu Greenplum Text sets this limit by calculating the number of replicas to create per node and adding an additional factor, specified in the gptext.extension_factor
server configuration parameter. This parameter can be set between 0 and 10; the default value is 2. Since the limit is set when the index is created, it is recommended to set the gptext.extension_factor
parameter to a higher number to allow new replicas to be created when necessary.
=# SELECT * FROM gptext.add_replica('demo.wikipedia.articles', 'shard1');
success | core_name
---------+-----------------------------------------
t | demo.wikipedia.articles_shard1_replica3
(1 row)
Deletes a named replica from the specified index and shard.
gptext.delete_replica(<index_name>, <shard_name>, <replica_name>[, <only_if_down>])
<index_name>
Name of the index. <shard_name>
Name of the shard that contains the replica to delete. <replica_name>
Name of the replica to remove. <only_if_down>
Optional. When true, no action is taken if the replica is active. Default is false.
boolean
Use the gptext.index_status()
function to find the name of the replica to drop. Names are in the format core_nodeX
, where X
is a number.
This function is called from the gptext-replica drop
management utility.
Delete the core_node5
replica if it is down.
=# SELECT * FROM gptext.delete_replica('demo.wikipedia.articles', 'shard1', 'core_node5', true);
ERROR: Delete replica failed: Attempted to remove replica : demo.wikipedia.articles/shard1/core_node5 with onlyIfDown='true', but state is 'active'.
Delete the core_node5
replica even if it is active.
=# SELECT * FROM gptext.delete_replica('demo.wikipedia.articles', 'shard1', 'core_node5');
success
---------
t
(1 row)
Counts the number of records in a table.
gptext.count_t(<table_name>)
<table_name>
Name of the table for which to count records.
integer
You need SELECT privileges on <table_name>
to execute this function.
=# SELECT * FROM gptext.count_t('demo.wikipedia.articles');
count_t
---------
23
(1 row)