Working With VMware Tanzu Greenplum Text Indexes

Indexing prepares documents for text analysis and fast query processing. This topic shows you how to create Tanzu Greenplum text indexes and add documents from VMware Greenplum tables to them, and how to maintain and customize indexes for your own applications.

For help indexing and searching documents stored outside of VMware Greenplum see Working With Tanzu Greenplum text External Indexes.

Setting Up the Sample Database

The examples in this documentation work with a demo database containing three database tables, called wikipedia.articles, twitter.message, and store.products. If you want to run the examples yourself, follow the instructions in this section to set up the demo database.

Log in to the VMware Greenplum master as the gpadmin user and create the demo database.
```
$ createdb demo
```
Open an interactive shell for executing queries in the demo database.
```
$ psql demo
```

Create the articles table in the wikipedia schema with the following statements.

CREATE SCHEMA wikipedia;
CREATE TABLE wikipedia.articles (
    id int8 primary key,
    date_time timestamptz,
    title text,
    content text,
    refs text
) DISTRIBUTED BY (id);

Create the message table in the twitter schema with the following statements.

CREATE SCHEMA twitter;
CREATE TABLE twitter.message (
    id bigint,
    message_id bigint,
    spam boolean,
    created_at timestamp without time zone,
    source text,
    retweeted boolean,
    favorited boolean,
    truncated boolean,
    in_reply_to_screen_name text,
    in_reply_to_user_id bigint,
    author_id bigint,
    author_name text,
    author_screen_name text,
    author_lang text,
    author_url text,
    author_description text,
    author_listed_count integer,
    author_statuses_count integer,
    author_followers_count integer,
    author_friends_count integer,
    author_created_at timestamp without time zone,
    author_location text,
    author_verified boolean,
    message_url text,
    message_text text )
DISTRIBUTED BY (id)
PARTITION BY RANGE (created_at)
( START (DATE '2011-08-01') INCLUSIVE
  END (DATE '2011-12-01') EXCLUSIVE
  EVERY (INTERVAL '1 month'));
CREATE INDEX id_idx ON twitter.message USING btree (id);

CREATE the store.products table with these statements.

CREATE SCHEMA store;
CREATE TABLE store.products (
    id bigint,
    title text,
    category varchar(32),
    brand varchar(32),
    price float )
DISTRIBUTED BY (id);

Download test data for the three tables here. Right-click the link, save the file, and then copy it to the gpadmin user's home directory.
Extract the data files with this tar command.
```
$ tar xvfz gptext-demo-data.tgz
```
Load the wikipedia data into the wikipedia.articles table using the psql \COPY metacommand.
```
\COPY wikipedia.articles FROM '/home/gpadmin/demo/articles.csv' HEADER CSV;
```
The articles table now contains text from 23 Wikipedia articles.
Load the twitter data into the twitter.message table using the following psql \COPY metacommand.
```
\COPY twitter.message FROM '/home/gpadmin/demo/twitter.csv' CSV;
```
The message table now contains 1730 tweets from August to October, 2011.
Load the products table into the store.products table with the following psql \COPY metacommand.
```
\COPY store.products FROM '/home/gpadmin/demo/products.csv' HEADER CSV;
```
The products table now contains 50 rows. This table is used to demonstrate faceted search queries. See Creating Faceted Search Queries.

Setting up the Tanzu Greenplum Text Command-line Environment

To work with Tanzu Greenplum text indexes, you must first set up your environment and add the Tanzu Greenplum text schema to the database containing the documents (VMware Greenplum data) you want to index.

To set the environment, log in as the gpadmin user and source the VMware Greenplum and Tanzu Greenplum text environment scripts. The VMware Greenplum environment must be set before you source the Tanzu Greenplum text environment script. For example, if both VMware Greenplum and Tanzu Greenplum text are installed in the /usr/local/ directory, enter these commands:

$ source /usr/local/greenplum-db-<version>/greenplum_path.sh
$ source /usr/local/greenplum-text-<version>/greenplum-text_path.sh

With the environment now set, you can access the Tanzu Greenplum text command-line utilities.

Adding the Tanzu Greenplum Text Schema to a Database

Use the gptext-installsql utility to add the Tanzu Greenplum text schema to databases containing data you want to index with Tanzu Greenplum text. You perform this task one time for each database. In this example, the gptext schema is installed into the demo database.

$ gptext-installsql demo

The gptext schema provides user-defined types, tables, views, and functions for Tanzu Greenplum text. This schema is reserved for Tanzu Greenplum text. If you create any new objects in the gptext schema, they will be lost when you reinstall the schema or upgrade Tanzu Greenplum text.

Creating Tanzu Greenplum Text Indexes and Indexing Data

The general steps for creating a Tanzu Greenplum text index and indexing documents are:

Create an empty Solr index
Customize the index (optional)
Populate the index
Commit the index

After you complete these steps, you can create and execute a search query or implement machine learning algorithms. Searching Tanzu Greenplum text indexes is described in the Querying Tanzu Greenplum text Indexes topic.

The following steps are completed by executing SQL commands and Tanzu Greenplum text functions in the database. Refer to the Tanzu Greenplum text Function Reference for details about the Tanzu Greenplum text functions described in the following examples.

Create an Empty Tanzu Greenplum Text Index

A Tanzu Greenplum text index is an Apache Solr collection containing documents added from a VMware Greenplum table. There can be one Tanzu Greenplum text index per VMware Greenplum table. Each row in the database table is a document that can be added to the Tanzu Greenplum text index.

If the database table is partitioned, you can create one Tanzu Greenplum text index for all partitions by specifying the root table name when creating the index and adding documents to it. Tanzu Greenplum text provides search semantics that enable searching partitions efficiently.

A Tanzu Greenplum text external index is a Solr index for documents that are located outside of VMware Greenplum. Tanzu Greenplum text provides user-defined functions to create external indexes and add documents to them. See Working with Tanzu Greenplum text External Indexes.

A Tanzu Greenplum text index, by default, has one Solr shard for each VMware Greenplum segment. You can specify fewer shards when you create an index by changing the gptext.idx_num_shards configuration parameter from 0 to the number of shards you want before you create the index. See Specifying the Number of Index Shards for information about using this option.

The maximum number of documents a single shard can contain is 2147483519, so the maximum number of documents you can add to a Tanzu Greenplum text index is 2147483519 times the number of shards.

The gptext.create_index() function creates a new Tanzu Greenplum text index. This function has two signatures:

gptext.create_index(<schema_name>, <table_name>, <id_col_name>,
    <def_search_col_name> [, <if_check_id_uniqueness>])

gptext.create_index(<schema_name>, <table_name>, <p_columns>, <p_types>,
    <id_col_name>, <def_search_col_name> [, <if_check_id_uniqueness>])

The <schema_name> and <table_name> arguments specify the database table that contains the source documents.

The <id_col_name> argument is the name of the table column that contains a unique identifier for each row. The <id_col_name> column can be of type int4, int8, varchar, text, or uuid.

The <def_search_col_name> argument is the name of the table column that contains the content you want to search by default. For example, if you want to index and search just the <content> column, you can use the first signature and specify the content column name in the <def_search_col_name> argument.

The final, optional argument, <if_check_id_uniqueness>, is a Boolean argument. When true, the default, attempting to add a document with an id that already exists in the index generates an error. If you set the argument to false, you can add documents with the same id, but when you search the index all documents with the same ID are returned.

The gptext.terms() function requires unique IDs and cannot be used with indexes created with <if_check_id_uniqueness> set to false.

The following command creates an index for the twitter.message table, with the id column as the unique ID field and the message_text column for the default search column:

=# SELECT * FROM gptext.create_index('twitter', 'message', 'id', 'message_text');

To verify that the demo.twitter.message index was created, call gptext.index_status():

=# SELECT * FROM gptext.index_status('demo.twitter.message');
      index_name      | shard_name | shard_state | replica_name | replica_state |                  core
        |    node_name    |        base_url        | is_leader | partitioned | external_index
----------------------+------------+-------------+--------------+---------------+---------------------------------
--------+-----------------+------------------------+-----------+-------------+----------------
 demo.twitter.message | shard1     | active      | core_node3   | active        | demo.twitter.message_shard1_repl
ica_n1  | sdw2:18984_solr | http://sdw2:18984/solr | t         | t           | f
 demo.twitter.message | shard1     | active      | core_node5   | active        | demo.twitter.message_shard1_repl
ica_n2  | sdw1:18983_solr | http://sdw1:18983/solr | f         | t           | f
 demo.twitter.message | shard2     | active      | core_node7   | active        | demo.twitter.message_shard2_repl
ica_n4  | sdw2:18983_solr | http://sdw2:18983/solr | f         | t           | f
 demo.twitter.message | shard2     | active      | core_node9   | active        | demo.twitter.message_shard2_repl
ica_n6  | sdw1:18984_solr | http://sdw1:18984/solr | t         | t           | f
 demo.twitter.message | shard3     | active      | core_node11  | active        | demo.twitter.message_shard3_repl
ica_n8  | sdw2:18984_solr | http://sdw2:18984/solr | t         | t           | f
 demo.twitter.message | shard3     | active      | core_node13  | active        | demo.twitter.message_shard3_repl
ica_n10 | sdw1:18983_solr | http://sdw1:18983/solr | f         | t           | f
 demo.twitter.message | shard4     | active      | core_node15  | active        | demo.twitter.message_shard4_repl
ica_n12 | sdw2:18983_solr | http://sdw2:18983/solr | f         | t           | f
 demo.twitter.message | shard4     | active      | core_node16  | active        | demo.twitter.message_shard4_repl
ica_n14 | sdw1:18984_solr | http://sdw1:18984/solr | t         | t           | f
(8 rows)

This example executed on a VMware Greenplum cluster with four primary segments. Four shards were created, one for each segment, and each shard has two replicas.

You can also run the gptext-state -D command-line utility to verify the index was created. See the gptext-state reference for details.

The Tanzu Greenplum text index for the demo.twitter.message table is configured, by default, to index all columns in the twitter.message database table. You can write search queries that contain criteria using any column in the table.

If you want to index and search a subset of the table columns, you can use the second gptext.create_index() signature, specifying the columns to index in the <p_columns> argument and the data types of those columns in the <p_types> argument. The <p_columns> and <p_types> arguments are text arrays. The id column name and default search column name must be included in the arrays.

Use the second gptext.create_index() signature to create an index for the wikipedia.articles table. This index will allow you to search on the title, content, and refs columns. Note that the id column and default search column are still specified in separate arguments following the <p_columns> and <p_types> arrays.

=# SELECT * FROM gptext.create_index('wikipedia', 'articles', '{id, title, content, refs}',
     '{long, text_intl, text_intl, text_intl}', 'id', 'content', true);
INFO:  Created index demo.wikipedia.articles
 create_index
--------------
 t
(1 row)

Because the date_time column was omitted from the <p_columns> and <p_types> arrays, it will not be possible to search the wikipedia.articles index on date with the Tanzu Greenplum text search functions.

Customize the Index (optional)

Creating a Tanzu Greenplum text index generates a set of configuration files for the index. Before you add documents to the index, you can customize the configuration files to change the way data is indexed and stored. You can customize an index later, after you have added documents to it, but you must then reindex the data to take advantage of your customizations.

One common customization is to remap data types for some database columns. In the managed-schema configuration file for an index, Tanzu Greenplum text maps the data types for each field from the VMware Greenplum type to an equivalent Solr data type. Tanzu Greenplum text applies default mappings (see Tanzu Greenplum text and Solr Data Type Mappings), but your index may be more effective if you use a different mapping for some fields.

The demo.twitter.message table, for example, has a message_text text column that contains tweets. By default, Tanzu Greenplum text maps text columns to the Solr text_intl (international text) type. The Tanzu Greenplum text text_sm (social media text) type is a better mapping for a text column that contains social media idioms such as emoticons.

Follow these steps to remap the message_text field to the gtext_sm type.

Use the gptext-config utility to edit the managed-schema file for the demo.twitter.message index.
```
$ gptext-config edit -i demo.twitter.message -f managed-schema
```
The managed-schema file loads in a text editor (normally vi).

Find the <field> element for the message_text field.

<field name="message_text" stored="false" type="text_intl" indexed="true"/>

Change the type attribute from text_intl to text_sm.

<field name="message_text" stored="false" type="text_sm" indexed="true"/>

Save the file and exit the editor.

There are many other ways to customize a Tanzu Greenplum text index. For example, you can omit fields from the index by changing the indexed attribute of the <field> element to false, store the contents of the field in the index by changing the stored attribute to true, or use gptext-config to edit the stopwords.txt file to specify additional words to ignore when indexing.

See Customizing Tanzu Greenplum text Indexes to learn how data type mapping determines how Solr analyzes and indexes field contents and for more ways to customize Tanzu Greenplum text indexes.

Populate the Index

To populate the index, use the table function gptext.index(), which has the following syntax:

SELECT * FROM gptext.index(TABLE(SELECT * FROM <table_name>), <index_name>);

To index all rows in the twitter.message table, execute this command:

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM twitter.message), 'demo.twitter.message');
 dbid | num_docs
------+----------
    2 |      892
    3 |      838
(2 rows)

This command indexes the rows in the wikipedia.articles table.

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM wikipedia.articles), 'demo.wikipedia.articles');
dbid | num_docs
------+----------
   3 |       11
   2 |       12
(2 rows)

The results of this command show that 23 documents from two segments were added to the index.

The first argument of the gptext.index() function is a table expression. TABLE(SELECT * FROM wikipedia.articles) creates a table expression from the articles table, using the table function TABLE.

You can choose the data to index or update by changing the inner select list in the query to select the rows you want to index. When adding new documents to an existing index, for example, specify a WHERE clause in the gptext.index() call to choose only the new rows to index.

The inner SELECT statement could also be a query on a different table with the same structure, or a result set constructed with an arbitrarily complex join, provided the columns specified in the gptext.create_index() function are present in the results. If you index data from a source other than the table used to create the index, be sure the distribution key for the result set matches the distribution key of the base table. The VMware Greenplum SELECT statement has a SCATTER BY clause that you can use to specify the distribution key for the results from a query. See Specifying a distribution key with SCATTER BY for more about the distribution policy and Tanzu Greenplum text indexes.

Commit the Index

After you create and populate an index, you commit the index using gptext.commit_index(<index_name>).

This example commits the documents added to the indexes in the previous example.

=# SELECT * FROM gptext.commit_index('demo.twitter.message');
 commit_index
--------------
 t
(1 row)

=# SELECT * FROM gptext.commit_index('demo.wikipedia.articles');
 commit_index
--------------
 t
(1 row)

The gptext.commit_index() function commits any new data added to or deleted from the index since the last commit.

Managing Tanzu Greenplum text Indexes

Tanzu Greenplum text provides command-line utilities and functions you can use to perform these Tanzu Greenplum text management tasks:

Configuring an index
Optimizing an index
Specifying a distribution policy with SCATTER BY
Deleting from an index
Dropping an index
Adding a field to an index
Dropping a field from an index
Listing all indexes

Configuring an index

You can modify your indexing behavior globally by using the gptext-config utility to edit a set of index configuration files. The files you can edit with gptext-config are:

solrconfig.xml – Contains most of the parameters for configuring Solr itself (see http://wiki.apache.org/solr/SolrConfigXml).
managed-schema – Defines the analyzer chains that Solr uses for various different types of search fields (see Text Analyzer Chains).
stopwords.txt – Lists words you want to eliminate from the final index.
protwords.txt – Lists protected words that you do not want to be modified by the analyzer chain. For example, iPhone.
synonyms.txt – Lists words that you want replaced by synonyms in the analyzer chain.
elevate.xml – Moves specific words to the top of your final index.
emoticons.txt – Defines emoticons for the text_sm social media analyzer chain. (see The emoticons.txt file).

You can also use gptext-config to move files.

Optimizing an Index

The function gptext.optimize_index(<index_name>, <max_segments>) merges all segments into a small number of segments (<max_segments>) for increased efficiency.

Example:

=# SELECT * FROM gptext.optimize_index('demo.wikipedia.articles', 10);
 optimize_index
----------------
 t
(1 row)

Specifying the Number of Index Shards

The value of the gptext.idx_num_shards Tanzu Greenplum text configuration parameter determines how many shards are created for the Tanzu Greenplum text index, as well as the default Solr router used for assigning documents to index shards. The parameter can have values between 0 and 1024. When gptext.idx_num_shards is not set, or is set to 0, Tanzu Greenplum text creates one shard for each VMware Greenplum primary segment and uses the implicit Solr router when creating new indexes. You can specify fewer shards for an index, and you may find that having fewer shards uses resources more efficiently, without affecting performance significantly. When the value of gptext.idx_num_shards is an integer greater than 0, the index has the specified number of shards and Tanzu Greenplum text uses the compositeId Solr router when creating new indexes.

Note
If you set the if_check_id_uniqueness argument of the gptext.create_index() function to false, the gptext.idx_num_shards parameter is ignored and the index will have one shard per VMware Greenplum primary segment. See Allowing Duplicate IDs.

See Changing Tanzu Greenplum text Server Configuration Parameters for instructions to set Tanzu Greenplum text configuration parameters.

Altering Shards per Index

The gptext-shard utility (from Tanzu Greenplum text 3.5.0) allows you to add or delete a group of shards in an index that has an implicit router. Increasing shards enables you to overcome the document number limitation in an index, which is 214748351 * shard_number (where shard_number is equal to the number of VMware Greenplum segments in the cluster).

Increasing shard number

The gptext-shard command can be run multiple times, each time increasing the shard number by the VMware Greenplum segment number. To increase shards in an index use:

$ gptext-shard batch-add -i <index_name> --force

where:

batch-add adds a number of shards equal to the VMware Greenplum segment number.
--index|-i <index_name> (required) identifies the index where the shards will be added.
--force|-f (optional) skips the confirmation steps.

For example, the shard configuration in a 16 segment VMware Greenplum cluster with 16 shards and an index name "index1" would be similar to:

seg0 -> index1_shard0
seg1 -> index1_shard1
...
seg15 -> index1_shard15

After running gptext-shard batch-add -i index1 the configuration is similar to:

seg0 -> index1_shard0, index1_shard16
seg1 -> index1_shard1, index1_shard17
...
seg15 -> index1_shard15, index1_shard31

Running the same command again would produce the following configuration:

seg0 -> index1_shard0, index1_shard16, index1_shard32
seg1 -> index1_shard1, index1_shard17, index1_shard33
...
seg15 -> index1_shard15, index1_shard31, index1_shard47

The newly added shards will be active and all new index data will populate the new shards. New data cannot be added to the older shards, but users can delete old data using DELETE queries.

Note
The maximum recommended number of total shards in a VMware Greenplum cluster is 500.

Deleting shards

To delete a group of shards in an index, use batch-del:

$ gptext-shard batch-del -i <index_name> --force

where:

batch-del deletes the oldest empty shards in the <index_name> index. The number of shards deleted matches the number of segments in the VMware Greenplum.
--index|-i <index_name> (required) identifies the index from where the shards will be deleted.
--force|-f (optional) skips the confirmation steps. The --force|-f flag does not delete non-empty shards.

The gptext-shard batch-del always deletes the oldest shards.

To delete the most recently added group of shards (used in cases where the increase was accidental), use batch-del --del-latest :

$ gptext-shard batch-del -i <index_name> --del-latest --del-non-empty --force

where:

--index|-i <index_name> (required) specifies the index from where the shards will be deleted.
--del-latest deletes the latest batch of shards; this command can be used to revert accidental changes.
--del-non-empty (optional) deletes the shards even if they are not empty.
--force|-f (optional) skips the confirmation steps.

Note
You can not delete the last group of shards that are active for indexing.

To check if a shard is empty before deleting it, login to your database from your local client and use the psql tool to run:

select index_name, shard_name, num_docs from gptext.index_summary('demo.public.t1');

The shard still contains data when the returned value of num_docs is not zero.

Allowing Duplicate IDs

The final, optional argument of the gptext.create_index() function, if_check_id_uniqueness, specifies whether the index disallows duplicate IDs. It is true by default, which means the index does not allow duplicate IDs. If you set it to false, the index can have multiple documents with the same ID and a search can return more than one document for an ID.

Solr has two methods of assigning documents to index shards, the implicit router and the compositeId router. With either router, Solr can determine the shard a document belongs to given the document's ID. Only the default, implicit router supports duplicate IDs. The compositeId router is used only if you specify a non-zero value for gptext.idx_num_shards.

The gptext.terms() function requires unique document IDs and cannot be used with indexes created with if_check_id_uniqueness set to false.

Specifying a Distribution Policy with SCATTER BY

The first parameter of gptext.index() is a table expression, such as TABLE(SELECT * FROM wikipedia.articles). The query in this parameter must have the same distribution policy as the table you are indexing so that documents added to the index are associated with the correct VMware Greenplum segments. Some queries, however, have no distribution policy or they have a different distribution policy. This could happen if the query is a join of two or more tables or a query on an intermediate (staging) table that is distributed differently than the base table for the index.

To specify a distribution policy for a query result set, the VMware Greenplum SELECT statement has a "SCATTER BY" clause.

TABLE(SELECT * FROM wikipedia.articles SCATTER BY <distrib_id>)

where distrib_id is the name or number of the column used to distribute the base table for the index.

Deleting from an Index

You can delete from an index using a query with the function gptext.delete(<index_name>, <query>). This deletes from the index all documents that match the search query. To delete all documents, use the query '*'.

After a successful deletion, execute gptext.commit_index(<index_name>) to commit the change.

This example deletes all documents containing "toxin" in the default search field.

=# SELECT * FROM gptext.delete('demo.wikipedia.articles', 'toxin');
 delete
--------
 t
(1 row)

SELECT * FROM gptext.commit_index('demo.wikipedia.articles');

Example that deletes all documents from the index:

SELECT * FROM gptext.delete('demo.wikipedia.articles', '*:*');

Be sure to commit changes to the index after deleting documents.

SELECT * FROM gptext.commit_index('demo.wikipedia.articles');

Dropping an Index

You can completely remove an index with the gptext.drop_index(<index_name>) function.

Example:

SELECT * FROM gptext.drop_index('demo.wikipedia.articles');

Adding a Field to an Index

You can add a field to an existing index using the gptext.add_field() function. For example, you can add a field to the index after a column is added to the underlying database table or you can add a field to index a column that was not specified when the index was created.

Tanzu Greenplum text maps the VMware Greenplum field type to an equivalent Solr data type automatically. See Tanzu Greenplum text and Solr Data Type Mappings for a table of data type mappings.

CREATE TABLE myarticles (
    id int8 primary key,
    date_time timestamptz,
    title text,
    content text,
    refs text
) DISTRIBUTED BY (id);

SELECT * FROM gptext.create_index('wikipedia', 'myarticles', 'id', 'content', true);
   ... populate the index ...
SELECT * FROM gptext.commit_index('demo.wikipedia.myarticles');

ALTER TABLE myarticles ADD notes text;
SELECT * FROM gptext.add_field('demo.wikipedia.myarticles', 'notes', false, false);
SELECT * FROM gptext.reload_index('demo.wikipedia.myarticles');

Adding a field to a Tanzu Greenplum text index requires the base table to be available. If you drop the table after creating the index, you cannot add fields to the index.

Dropping a Field from an Index

You can drop a field from an existing index with the gptext.drop_field() function. After you have dropped fields, call gptext.reload_index() to reload the index.

Example:

SELECT * FROM gptext.drop_field('demo.wikipedia.myarticles', 'notes');
SELECT * FROM gptext.reload_index('demo.wikipedia.myarticles');

Listing All Indexes

You can list all indexes in the Tanzu Greenplum text cluster using the gptext-state command-line utility. For example:

$ gptext-state -D
20170822:10:11:23:029752 gptext-state:gpsne:gpadmin-[INFO]:-Execute VMware Greenplum Text state ...
20170822:10:11:23:029752 gptext-state:gpsne:gpadmin-[INFO]:-Check zookeeper cluster state ...
20170822:10:11:23:029752 gptext-state:gpsne:gpadmin-[INFO]:-Check VMware Greenplum Text cluster status...
20170822:10:11:23:029752 gptext-state:gpsne:gpadmin-[INFO]:-Current VMware Greenplum Text Version: 2.1.2
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:-All nodes are up and running.
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:------------------------------------------------
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:-Index state details.
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:------------------------------------------------
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:-   database    index name                 state
20170822:10:11:24:029752 gptext-state:gpsne:gpadmin-[INFO]:-   wikipedia   demo.wikipedia.articles  Green
20170822:10:11:28:029752 gptext-state:gpsne:gpadmin-[INFO]:-Done.

Storing Field Content in an Index

Solr can store the contents of columns in the index so that results of a search on the index can include the column contents. This makes it unnecessary to join the search query results with the original table. You can even store the contents of database columns that are not indexed and return that content with search results. Tanzu Greenplum text returns the additional field content in a buffer added to the search results. Individual fields can be retrieved from this buffer using the gptext.gptext_retrieve_field(), gptext.gptext_retrieve_field_int(), and gptext.gptext_retrieve_field_float() functions.

One design pattern is to store content for all of a table's columns in the Tanzu Greenplum text index so the database table can then be truncated or dropped. Additional documents can be added to the Tanzu Greenplum text index later by inserting them into the truncated table, or into a temporary table with the same structure, and then adding them to the index with the gptext.index() function.

To enable storing content in a Tanzu Greenplum text index, you must edit the managed-schema file for the index. The <field> element for each field has a stored attribute, which defaults to false, except for the unique id field.

Follow these steps to configure the demo.wikipedia.articles index to store content for the title, content, and refs columns.

Log into the master as gpadmin and use gptext-config to edit the managed-schema file.
```
$ gptext-config edit -i demo.wikipedia.articles -f managed-schema
```

Find the <field> elements for the columns you want to store in the index. Note that <field> elements with names beginning with an underscore are internal fields and should not be modified. The "title", "content", and "refs" fields in this example are indexed, but not stored.

<field name="__temp_field" type="intl_text" indexed="true" stored="false" multiValued="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="id" stored="true" type="long" indexed="true"/>
<field name="__pk" stored="true" indexed="true" type="long"/>
<field name="title" stored="false" type="text" indexed="true"/>
<field name="content" stored="false" type="text" indexed="true"/>
<field name="refs" stored="false" type="text" indexed="true"/>

For each field you want to store in the index, change the stored attribute from "false" to "true".

<field name="title" stored="true" type="text" indexed="true"/>
<field name="content" stored="true" type="text" indexed="true"/>
<field name="refs" stored="true" type="text" indexed="true"/>

Save the file and, if any documents were already added to the index, reindex the table.

See Retrieving Stored Field Content for information about retrieving the stored content with Tanzu Greenplum text query results.

For more about the contents of the managed-schema file and additional ways to customize Tanzu Greenplum text indexes see Customizing Tanzu Greenplum text Indexes.

Configuring Postings with Term Vectors

To enable postings with term vectors in a Tanzu Greenplum text index, you must edit the managed-schema file for the index. The <field> element for each field must include the storeOffsetsWithPositions and termVectors attributes, and these attribute values must be set to true.

Follow these steps to configure the demo.wikipedia.articles index to enable postings with term vectors for the content column.

Log into the master as gpadmin and use gptext-config to edit the managed-schema file.
```
$ gptext-config edit -i demo.wikipedia.articles -f managed-schema
```
Locate the <field> element for the "content" column. (Note that <field> elements with names beginning with an underscore are internal fields and should not be modified.) The "content" field in this example is indexed.
```
...
<field name="content" stored="false" type="text" indexed="true"/>
...
```

Add the storeOffsetsWithPositions and termVectors attributes to the <field> element and set their values to "true".

<field name="content" stored="true" type="text" indexed="true" storeOffsetsWithPositions="true" termVectors="true"/>

Save the file and, if any documents were already added to the index, reindex the table.

For more about the contents of the managed-schema file and additional ways to customize Tanzu Greenplum text indexes see Customizing Tanzu Greenplum text Indexes.

Working with JSON Documents

GP Text supports the gptext_json virtual field type, on the VMware Greenplum column types json and jsonb. Users can do advanced searches on json documents, with different json paths.

The type automatically creates new fields from the json input, and enables a schemaless index. The field type has to be explicitly specified as the data type during index creation. The gptext_json field type contains no data; every child field is mapped into a Solr field with the related data type and data. Users can index the whole json or jsonb column, or part of the json path, by explicitly specifing child fields. Any erros are logged in json_error_table.

Sample Data

Use the following sample data for the examples in Create a json index, Add a field with a json type, and Dropping a json field topics:

CREATE DATABASE gptext_test; 
\! gptext-installsql gptext_test 
\c gptext_test 
CREATE SCHEMA wikipedia; 
 CREATE TABLE wikipedia.articles ( 
     id int8 primary key, 
     date_time timestamptz, 
     title text, 
     content text, 
     json1 jsonb, 
     json2 jsonb 
 ) DISTRIBUTED BY (id);

insert into wikipedia.articles values( 
 1, 
 '2017-08-28 20:14:00+00', 
 'Forward osmosis', 
 'The driving force for this separation is an [[osmotic pressure]] gradient, such that a ""draw"" [[solution]] of high [[concentration]] (relative to that of the feed solution), is used to induce a net flow of water through the membrane into the draw solution, thus effectively separating the feed water from its solutes. In contrast, the reverse osmosis process uses hydraulic [[pressure]] as the driving force for separation, which serves to counteract the osmotic pressure gradient that would otherwise favor water flux from the permeate to the feed.  Hence significantly more energy is required for reverse osmosis compared to forward osmosis', 
 '{"id": 110, "pages": 111, "title": "Today is Monday", "content": "hello world...", "description":{"name": "nested json object", "pages": 36, "content": "Today is Monday"}}', 
 '{"id": 27, "brand": "Ollieroo", "price": 129.99, "title": "Ollieroo 4 in 1 Aluminum Rolling Cosmetic Makeup Train Cases Trolley"}' 
 );

insert into wikipedia.articles values( 
 2, 
 '2017-08-28 05:00:00+00', 
 'Biomass', 
 'It has become popular among coal power stations, which switch from coal to biomass in order to convert to renewable energy generation without wasting existing generating plant and infrastructure. Biomass most often refers to plants or plant-based materials that are not used for food or feed, and are specifically called', 
 '{"id": 111, "pages": 115, "title": "Happy weekend!", "content": "I went out to play...", "description":{"name": "nested json object2", "pages": 37, "content": "Happy weekend"}}', 
 null 
 );

insert into wikipedia.articles values(  
3,  
'2017-08-29 06:00:00+00',  
'Json array test',  
' And we want to index nest.name , after it be flattened, the result json will not have nest.name filed. ',  
'{"id": 112, "pages": 1125, "title": "Happy json array!", "content": "Need json arry works...", "description":{"name": "nested json object3", "pages": 37, "content": "Happy weekend"}, "type": ["ddd", "eee", "aaa bbb ddd"]}',  
'{"id": 127, "brand": "VMware Greenplum", "price": 11129.99, "title": "jupiter 4 in 1 Aluminum Rolling Cosmetic Makeup Train Cases Trolley", "type": ["aaa", "bbb", "aaa"]}'  
);

Create a JSON Index

Define a json/jsonb field as gptext_json when creating the index using gptext.create_index. If you add a GPDB column with json/jsonb type to an index without a defined field type, it takes the default field type text_intl, and GP Text will treat it as a raw text index.

If you do not wish to index the entire json in Solr, you can explicitly define specific child fields with a corresponding field type.

Create an index with type gptext_json:

select * from gptext.create_index('wikipedia', 'articles', '{id, date_time, title, json1}', '{long, tdate, text_intl, gptext_json}', 'id', 'title');

Create an index with type gptext_json, and specify only child json fields:

select * from gptext.create_index('wikipedia', 'articles', '{id, date_time, title, json1, json2.id, json2.brand}', '{long, tdate, text_intl, gptext_json, long, text_intl}', 'id', 'title');

This gptext.create_index query creates the following fields:

<field name="__pk" type="long" indexed="true" stored="true"/> 
<field name="__temp_field" type="text" multiValued="true" indexed="true" stored="false"/> 
<field name="_version_" type="long" indexed="true" stored="true"/> 
<field name="date_time" type="tdate" indexed="true" stored="false"/> 
<field name="id" type="long" indexed="true" stored="true"/> 
<field name="json1" type="gptext_json" indexed="true" stored="false"/> 
<field name="json1.content" type="text_intls"/> 
<field name="json1.description.content" type="text_intls"/> 
<field name="json1.description.name" type="text_intls"/> 
<field name="json1.description.pages" type="longs"/> 
<field name="json1.id" type="longs"/> 
<field name="json1.pages" type="longs"/> 
<field name="json1.title" type="text_intls"/> 
<field name="json2.brand" type="text_intl" multiValued="true" indexed="true" stored="false"/> 
<field name="json2.id" type="long" multiValued="true" indexed="true" stored="false"/> 
<field name="title" type="text_intl" indexed="true" stored="false"/>

Add a Field with a JSON Type

Use gptext.add_field to explicitly define the type when adding a field to an index.

Adds the field json2, with the type gptext_json:

select * from gptext.add_field('gptext_test.wikipedia.articles', 'json2', 'gptext_json');

Adds only the child field json2.id to Solr:

select * from gptext.add_field('gptext_test.wikipedia.articles', 'json2.id', 'long');

Attempts to add a child field json1.description with type gptext_json:

select * from gptext.add_field('gptext_test.wikipedia.articles', 'json1.description', 'gptext_json');

but generates an error similar to:

“WARNING:  Don't support child field of json 'json1.description' to be defined as gptext_json!  
 ERROR:  Add field operation failed.”

The error is caused because a child field cannot be assigned a virtual type like gptext_json.

Dropping a JSON Field

Users can drop a gptext_json field, or a child field of a json/jsonb VMware Greenplum column. It is important to note that:

if you delete a gptext_json field, all related child fields will also be deleted.
if you delete a child field of a gptext_json field, when you index new data that includes data of that specific child field, the deleted field will be readded to Solr.

Review the following examples:

This example drops the field json1 which is defined as gptext_json:

select * from gptext.drop_field('gptext_test.wikipedia.articles', 'json1');

This example drops the field json1.content, where json1 is defined as gptext_json:

select * from gptext.drop_field('gptext_test.wikipedia.articles', 'json1.content');

Searching on JSON Fields

Use the corresponding search field to search json data, similar to other searches. For example:

select *  
from wikipedia.articles a,  
gptext.search(table(select 1 scatter by 1), 'gptext_test.wikipedia.articles',  
'{!gptextqp} json1.title:weekend', null, 'row=5')q  
where q.id::int8 = a.id;

select *  
from wikipedia.articles a,  
gptext.search(table(select 1 scatter by 1), 'gptext_test.wikipedia.articles',  
'{!gptextqp} json1.description.content: happy', null, 'row=5')q  
where q.id::int8 = a.id;

Working with Rich Documents

For indexing documents stored within the database, Greenplum Text supports many more document types in addition to raw text data, including PDF, Microsoft Word, XML, and images. For a full list of supported document types, see Apache Tika Supported Document Formats.

You can extract textual information searchable from inside the rich documents and make this information searchable. For example, you may store many PDF documents in a table, index these documents with Greenplum Text, and then perform a text search with Greenplum Text to return the relevant document IDs. The next section provides an example that does exactly this.

Example

The following example downloads a PDF and inserts it into a table, indexes it, and then queries the table to return the relevant document id:

# Create table 

\c demo 

CREATE SCHEMA documents; 
CREATE TABLE documents.samples ( 
    id BIGSERIAL primary key, 
    t text, 
    c bytea 
) DISTRIBUTED BY (id);


# Download the PDF and insert it into the table 

create extension plpythonu; 
CREATE OR REPLACE FUNCTION download(url text) RETURNS bytea AS $$ 
  import urllib 
  return urllib.urlopen(url).read() 
$$ LANGUAGE plpythonu; 
 
INSERT INTO documents.samples (t, c) VALUES ('sample_one', download('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf')); 
 
INSERT INTO documents.samples (t, c) VALUES ('sample_two', download('https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf ')); 


# Create index 

SELECT * FROM gptext.create_index('documents', 'samples', '{id, t, c}', '{long, text_intl, gptext_rich_doc}', 'id', 't', true); 
 
SELECT * FROM gptext.index(TABLE(SELECT * FROM documents.samples), 'demo.documents.samples'); 

SELECT * FROM gptext.commit_index('demo.documents.samples '); 
 
SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.documents.samples ', 'c.content: dummy', null); 
 
SELECT * FROM gptext.drop_index('demo.documents.samples ');

Creating a Tanzu Greenplum Text Index for a VMware Greenplum Partitioned Table

Creating a Tanzu Greenplum text index for a partitioned VMware Greenplum table using gptext.create_index() is the same as creating an index for a non-partitioned table. Supply the name of the root partition to create a single Tanzu Greenplum text index containing all partitions.

Tanzu Greenplum text recognizes a partitioned table and adds a __partition field to the index. Then when you add documents to the index, Tanzu Greenplum text saves the child partition table name in the __partition field. You can use the __partition field to create Tanzu Greenplum text queries that search and filter by partition.

Unlike VMware Greenplum, which manages child partitions as separate database tables, Tanzu Greenplum text does not create a separate Solr collection for each database partition because the larger number of Solr cores could adversely affect the capacity and performance of the Solr cluster.

The demo.twitter.message table created in the Setting Up the Sample Database section is a partitioned table. See Searching Partitioned Tables for examples of searching partitions.

Adding and Dropping Partitions from Tanzu Greenplum text Indexes

You can add new partitions to, and drop partitions from, VMware Greenplum partitioned tables. If you have created a Tanzu Greenplum text index on a partitioned table, when you add or drop partitions in the base database table, you must perform a parallel Tanzu Greenplum text index operation.

When a new partition is added, the partition can be indexed once the data is in place. You can select rows directly from the newly added child partition table to index the data. First, use the gptext.partition() status function to find the names of child partition tables.

=# SELECT * FROM gptext.partition_status('demo.twitter.message');
           partition_name           |    inherits_name     | level |
                                     cons

------------------------------------+----------------------+-------+-------------------------------
---------------------------------------------------------------------------------------------------
----------
 demo.twitter.message_1_prt_1       | demo.twitter.message |     1 | ((created_at >= '2011-08-01 00
:00:00'::timestamp without time zone) AND (created_at < '2011-09-01 00:00:00'::timestamp without ti
me zone))
 demo.twitter.message_1_prt_2       | demo.twitter.message |     1 | ((created_at >= '2011-09-01 00
:00:00'::timestamp without time zone) AND (created_at < '2011-10-01 00:00:00'::timestamp without ti
me zone))
 demo.twitter.message_1_prt_3       | demo.twitter.message |     1 | ((created_at >= '2011-10-01 00
:00:00'::timestamp without time zone) AND (created_at < '2011-11-01 00:00:00'::timestamp without ti
me zone))
 demo.twitter.message_1_prt_4       | demo.twitter.message |     1 | ((created_at >= '2011-11-01 00
:00:00'::timestamp without time zone) AND (created_at < '2011-12-01 00:00:00'::timestamp without ti
me zone))
 demo.twitter.message_1_prt_dec2011 | demo.twitter.message |     1 | ((created_at >= '2011-12-01 00
:00:00'::timestamp without time zone) AND (created_at < '2112-01-01 00:00:00'::timestamp without ti
me zone))
(5 rows)

In the example above, a new partition with the name twitter.message_1_prt_dec2011 was added to the demo.twitter.message table. The following statements add the data from the new partition to the Tanzu Greenplum text index and commit the changes.

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM twitter.message_1_prt_dec2011), 'demo.twitter.message');
 dbid | num_docs
------+----------
    3 |      109
    2 |      128
(2 rows)

=# SELECT * FROM gptext.commit_index('demo.twitter.message');
 commit_index
--------------
 t
(1 row)

The name of the new child partition file (excluding the database and schema names) is saved in the __partition field in the index.

When a partition is deleted from a partitioned table, the data from the partition can be deleted from the Tanzu Greenplum text index by specifying the partition name in the <search> argument of the gptext.delete() function. Be sure to commit the index after deleting the partition.

=# SELECT * FROM gptext.delete('demo.twitter.message', '__partition:message_1_prt_dec2011');
delete
--------
t
(1 row)

=# SELECT * FROM gptext.commit_index('demo.twitter.message');
 commit_index
--------------
 t
(1 row)