Working With VMware Greenplum Text External Indexes

A VMware Greenplum Text external index is an Apache Solr index you create in VMware Greenplum to index and search documents that reside outside of VMware Greenplum. External documents can be of many types, for example, PDF, Microsoft Word, XML, or HTML. Solr recognizes document types automatically, using code included from the Apache Tika project.

Note: Read about how Solr and Tika (the "Solr Cell" framework) extract and index document text and metadata at Uploading Data with Solr Cell using Apache Tika. See a list of supported document types at Supported Document Formats.

VMware Greenplum Text supports indexing external documents that are accessible by URL with an HTTP GET request. VMware Greenplum Text also supports indexing external documents stored in Hadoop, provides functions and utilities to specify required Hadoop configuration and authentication information.

To add external documents to an index, you supply VMware Greenplum Text with a list of URLs in an array or as a SQL SELECT statement. The URL becomes the unique id field in the Solr index.

How VMware Greenplum Text External Indexes Differ From Regular VMware Greenplum Text Indexes

External indexes exist entirely in Solr—there is no associated database table in VMware Greenplum. Because of this, the index name does not follow the database.schema.table pattern required for regular VMware Greenplum Text indexes. You can choose any name for an external index, but it must not contain periods. You can access a VMware Greenplum Text external index from any database in the VMware Greenplum system that has the VMware Greenplum Text schema installed.

VMware Greenplum Text provides the following alternate functions for working with external indexes:

gptext.create_index_external() – create an external index.
gptext.index_external() – add documents to the external index.
gptext.search_external() – search an external index.
gptext.highlight_external() – returns fragments of documents with matching search terms highlighted with markup tags.

The distribution policy for a regular VMware Greenplum Text index is the same as the underlying VMware Greenplum table, so that segments manage the same VMware Greenplum Text table data as the Solr index shard. A VMware Greenplum Text external index also has one shard per segment, but the documents are distributed among the segments using Solr compositeId routing, which allows Solr to choose the shard for a document. See Shards and Indexing Data in SolrCloud.

A regular VMware Greenplum Text index only indexes and stores the database table columns you specify. A VMware Greenplum Text external index stores and indexes the textual content of the file, as well as metadata fields that are members of the document type.

When an external document is added to the index, the content of the document is saved in the content field. The content field is stored in the index but it is not indexed.

VMware Greenplum Text copies the following fields to the text field, the default search field, which is indexed but not stored.

title
author
description
keywords
content
content_type
resourcename
url

To search the document content, therefore, search the text field, but to retrieve or highlight document contents, use the content field.

The following common metadata fields are indexed and stored:

title
subject
description
comments
author
keywords
category
resourcename
url
content_type
last_modified
links

A dynamic field named meta_* is also indexed and stored. This is a multi-valued field where Solr stores document-type-specific metadata. In search results, this field is returned as a JSON-formatted columnValue string. You can extract individual metadata by name using the gptext.gptext_retrieve_field() function.

Search results for external indexes include all fields saved with the documents, including all metadata fields. You can use the Solr field list option (fl=<field-list>) to limit the fields returned. You can also use SELECT <field-list> FROM gptext.search_external() to limit the fields returned, but it is more efficient to filter out the fields in Solr with the fl option than in the database session.

Authenticating with an External Document Source

The information in this section is applicable only to external document sources that require authentication.

If the external document source that you want to index requires authentication, you must provide the authentication configuration to VMware Greenplum Text. You must also use VMware Greenplum Text functions to explicitly log in to the external document source before indexing, and log out of the source after indexing completes.

Note: Authenticating is not required for searching an external document.

Uploading a Configuration to ZooKeeper

Before you use VMware Greenplum Text to index an external document source that requires authentication, you must upload configuration information to ZooKeeper. Use the gptext-external upload command to upload this information:

gptext-external upload -t <type> -p <config_dir> -c <config_name>

This table describes the options to the gptext-external upload command:

Option	Description
<type>	The type of the external document source. The supported <type>s are `ftp`, `hdfs`, and `s3`.
<config_dir>	The path to a directory that contains the configuration files. The configuration information that you provide in this directory will depend on the external document source <type>.
<config_name>	The name that you assign to the configuration information. You will provide this name when you log in to the external document source.

Note: Retain a local copy of <config_dir>. Should you need to update the configuration, you must edit a local copy of the file(s) and re-upload.

Configuring and Uploading FTP Authentication

You can add documents from an FTP server that requires authentication to a VMware Greenplum Text external index. To authenticate with the server create a configuration directory and add a file to it named login.txt. Add three lines to the login.txt file:

The name of the user to log in to the FTP server.
The password for the FTP user, in clear text.
The maximum number of FTP connections allowed from VMware Greenplum Text.

To upload configuration information for an authenticated FTP server:

Create a directory for the authentication configuration.
```
$ mkdir ftp_config
```
Create the login.txt file.
```
$ touch ftp_config/login.txt
```
Add the FTP user name, password, and the maximum number of FTP connections to create to the login.txt file on separate lines. For example:
```
$ echo "bill" > ftp_config/login.txt
$ echo "changeme" >> ftp_config/login.txt
$ echo "10" >> ftp_config/login.txt
```
Upload the configuration directory to ZooKeeper using the gptext-external upload command.
```
$ gptext-external upload -t ftp -p ./ftp_config -c ftp_bill_auth
```
This command maps the login.txt file in the ftp_conf/ directory to the name ftp_bill_auth.

The password is base64-encoded when stored in ZooKeeper. To protect the password delete the login.txt file after you have uploaded the configuration to ZooKeeper.

Configuring and Uploading Hadoop Authentication

When you access a Hadoop external document source, <config_dir> must include the following configuration files for <type> hdfs:

The core-site.xml and hdfs-site.xml configuration files from the Hadoop server.
A file named user.txt. This file contains a single line identifying the Hadoop username to use for authentication. If Kerberos is enabled in the Hadoop cluster, the username in user.txt must identify the Kerberos principal for the user.
If the Hadoop cluster is secured with Kerberos, also include the user's keytab file and the krb5.conf file for the Kerberos realm.

For example, to upload configuration information for a Hadoop external document store:

Create a directory for the authentication configuration files. For example:
```
$ mkdir hdfs_conf
```
Copy the core-site.xml and hdfs-site.xml configuration files from the Hadoop server to the configuration directory. The location of these files will differ for different Hadoop distributions. For example:
```
$ scp hdfsuser@hdfsnamenode:/etc/hadoop/conf/core-site.xml hdfs_conf/
$ scp hdfsuser@hdfsnamenode:/etc/hadoop/conf/hdfs-site.xml hdfs_conf/
```
Construct the user.txt file. For example, if the Hadoop username is bill:
```
$ touch hdfs_conf/user.txt
$ echo "bill" > hdfs_conf/user.txt
```
Upload the Hadoop authentication configuration files for user bill to ZooKeeper. For example:
```
$ gptext-external upload -t hdfs -p ./hdfs_conf -c hdfs_bill_auth
```
This command maps the configuration information you provided in the hdfs_conf/ directory to the name hdfs_bill_auth.

Configuring and Uploading S3 Authentication

You can add documents stored as objects in an S3 bucket to a VMware Greenplum Text external index. To authenticate with Amazon S3 or other S3-compatible storage, you upload the credentials for an account with access to the S3 bucket to ZooKeeper.

To upload authentication credentials for S3:

Create a directory for the authentication configuration.
```
$ mkdir s3_conf
```
Create the credential file.
```
$ touch s3_conf/credential
```
Add your account's S3 access key and secret key to the credential file on separate lines. There must be exactly two lines in the file. For example:
```
$ echo "<my-access-key>" > s3_conf/credential
$ echo "<my-secret-key>" >> s3_conf/credential
```

Upload the configuration directory you created to ZooKeeper using the gptext-external upload command.

$ gptext-external upload -t 's3' -p /home/gpadmin/s3_conf -c s3_auth
20180619:17:44:21:006505 gptext-external:mdw:gpadmin-[INFO]:-Execute GPText config.
20180619:17:44:21:006505 gptext-external:mdw:gpadmin-[INFO]:-Check zookeeper cluster state ...
20180619:17:44:22:006505 gptext-external:mdw:gpadmin-[INFO]:-Upload '/home/gpadmin/s3_conf' success.
20180619:17:44:22:006505 gptext-external:mdw:gpadmin-[INFO]:-Done.

Logging In to the External Document Source

Prior to indexing, you must explictly log in to an external document source that requires authentication. Use the gptext.external_login() function for this purpose:

gptext.external_login('<type>', '<type>://<url>', '<config_name>')

The table below describes the arguments to the gptext.external_login() function:

Option	Description
<type>	The type of the external document source. Valid values are `ftp`, `hdfs`, and `s3`.
<type>://<url>	The URL of the external document source <type>.
<config_name>	The <config_name> you provided when you uploaded the authentication configuration with `gptext-external upload`.

When you invoke the gptext.external_login() function, VMware Greenplum Text logs you in to the external document source as the user or account identified in the configuration directory you provided in the <config_name> configuration.

Note: You can log in to only one VMware Greenplum Text external document source for one external type at a time. You must explicitly log out before you can log in to another external document source of the same type.

Logging In to Hadoop

For example, to log in to a Hadoop document source using the hdfs_bill_auth authentication configuration you uploaded in the prior section:

SELECT * FROM gptext.external_login('hdfs', 'hdfs://<namenode_host_or_ip>:<hdfs_port>', 'hdfs_bill_auth');

Logging In to an FTP Server

The command is similar to log in to an ftp server.

=# SELECT * FROM gptext.external_login('ftp', 'ftp://<ftpserver_host_or_ip:<ftp_port>'), 'ftp_bill_auth');

It is not necessary to include :<ftp_port> in the URL if the server uses the default ftp port 21.

Logging In to S3 Storage

The connection URL for S3 storage has this format:

<protocol>://<s3-endpoint>[/<region>][/]

The values that you specify in the connection URL will depend on the S3 storage server to which you want to log in.

Logging In to Amazon S3

When you are accessing Amazon S3, the <protocol> is s3. The <s3-endpoint> is an Amazon S3 endpoint. If the endpoint starts with s3. or s3- and is followed by a region code—for example, s3-us-west-2.amazonaws.com or s3.us-east-1.amazonaws.com—the /<region> part of the URL is optional, and VMware Greenplum Text determines the region from the endpoint. The connection URL for an endpoint such as s3.dualstack.us-east-1.amazonaws.com, however, must include the /<region>, for example s3://s3.dualstack.us-east-1.amazonaws.com/us-east-1.

=# SELECT * FROM gptext.external_login('s3', 's3://s3.us-west-1.amazonaws.com/', 's3_auth');
external_login
----------------
t
(1 row)

Logging In to Other S3-Compatible Storage

When you access other S3-compatible storage, <protocol> identifies the login protocol. For example, when you access MinIO, the <protocol> may specify http or https. The <s3-endpoint> is an endpoint that identifies the server. The <region>/ part of the URL is optional, and depends on the configuration of the S3 storage server. Example log in command for MinIO:

=# SELECT * FROM gptext.external_login('s3', 'https://my.minio.server:9000', 's3_auth');
external_login
----------------
t
(1 row)

Logging Out of the External Document Source

To log out of an external document source, use the gptext.external_logout('<type>') function. For example, to log out of the Hadoop cluster that you are currently logged in to:

SELECT * FROM gptext.external_logout('hdfs');

Troubleshooting Authenticated Document Stores

If you run into problems logging in to or accessing documents in an authenticated Hadoop external document store, refer to Troubleshooting Hadoop Connection Problems.

Creating External Indexes

Use the gptext.create_index_external() function to create an external index.

This example creates an external index named gptext-docs.

=# SELECT * FROM gptext.create_index_external('gptext-docs');

An external index does not have a corresponding VMware Greenplum table, so the index name does not follow the database.schema.table pattern required for regular VMware Greenplum Text indexes. The only restriction is that the name for an external table must not contain periods.

By default, the external index is created with one shard for each VMware Greenplum segment. You can specify fewer shards by setting the VMware Greenplum Text gptext.idx_num_shards configuration parameter to the number of shards you want before you create the index. See Specifying the Number of Index Shards for more information about this option.

Adding Documents to an External Index

To add external documents to an external index, supply a list of URLs where Solr can retrieve the document to the gptext.index_external() function. URLs may be specified either in an array or as a SQL result set.

A hash of the URL is the document's ID in the index. If a URL has already been added to the index, the file is not reindexed. If you add two identical files retrieved from different URLs, both files are added to the index.

This example adds a single PDF document, specified in an array, to the gptext-docs index.

=# SELECT * FROM gptext.index_external(
      '{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf}', 
      'gptext-docs');
 dbid | num_docs
------+----------
    3 |        0
    2 |        1
(2 rows)

=# SELECT * FROM gptext.commit_index('gptext-docs');
 commit_index
--------------
 t
(1 row)

This example adds several HTML documents by selecting URLs from a database table.

=# DROP TABLE IF EXISTS gptext_html_docs;
=# CREATE TABLE gptext_html_docs (
      id bigint,
      url text)
DISTRIBUTED BY (id);
CREATE TABLE
=# INSERT INTO gptext_html_docs VALUES 
    (1, 'http://gptext.docs.pivotal.io/latest/topics/administering.html'), 
    (2, 'http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html'), 
    (3, 'http://gptext.docs.pivotal.io/latest/topics/function_ref.html'),
    (4, 'http://gptext.docs.pivotal.io/latest/topics/guc_ref.html'),
    (5, 'http://gptext.docs.pivotal.io/latest/topics/ha.html'),
    (6, 'http://gptext.docs.pivotal.io/latest/topics/index.html'),
    (7, 'http://gptext.docs.pivotal.io/latest/topics/indexes.html'),
    (8, 'http://gptext.docs.pivotal.io/latest/topics/intro.html'), 
    (9, 'http://gptext.docs.pivotal.io/latest/topics/managed-schema.html'),
    (10, 'http://gptext.docs.pivotal.io/latest/topics/performance.html'),
    (11, 'http://gptext.docs.pivotal.io/latest/topics/queries.html'),
    (12, 'http://gptext.docs.pivotal.io/latest/topics/type_ref.html'),
    (13, 'http://gptext.docs.pivotal.io/latest/topics/upgrading.html'),
    (14, 'http://gptext.docs.pivotal.io/latest/topics/utility_ref.html'),
    (15, 'http://gptext.docs.pivotal.io/latest/topics/installing.html');
INSERT 0 15
=# SELECT * FROM gptext.index_external(
      TABLE(SELECT url FROM gptext_html_docs), 
      'gptext-docs');
dbid | num_docs
------+----------
   3 |        6
   2 |        8
(2 rows)
=# SELECT * FROM gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)

To add documents from an external document source that requires authentication, such as hdfs or an ftp server, log in to the external system with the gptext.external_login() function before you add the documents. With an authenticated document source, you can add all documents in a directory, using the gptext.external_index_dir() function. See the gptext.external_index_dir() function reference for an example.

Searching VMware Greenplum Text External Indexes

You can search VMware Greenplum Text external indexes with the standard gptext.search() function or with the gptext.search_external() function. The difference is that the gptext.search() function returns just the id, score, hs, and rf columns and the gptext.search_external() function by default also includes all of the content and metadata stored in the external index. You can use the Solr fl (field list) option with either function to set the actual fields that are included in the results.

Searching with gptext.search()

This simple gptext.search() example searches for "Solr" in the title field of the gptext-docs external index.

=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:Solr', null, null);
                            id                             |   score   | hs | rf
-----------------------------------------------------------+-----------+----+----
 http://gptext.docs.pivotal.io/latest/topics/type_ref.html | 0.9745732 |    |
(1 row)

To see the title of the document that matched the search, you must request the field with a fl option.

=# SELECT * FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 
'title:Solr', null, 'fl=title');
id                             |   score   | hs |
         rf
-----------------------------------------------------------+-----------+----+----------------
--------------------------------------------------------------------------------------------
http://gptext.docs.pivotal.io/latest/topics/type_ref.html | 0.9745732 |    | {"columnValue":
[{"name":"title","value":"VMware Greenplum Text and Solr Data Type Mappings |\n    VMware Greenplum Text Docs"}]}
(1 row)

The title field specified in the field list of the Solr options argument is returned in the rf column in a JSON document. If you want to return the title in its own result column, you can use the gptext.gptext_retrieve_field() function to extract the text from the JSON document. The expanded display (\x on) psql option in the following examples makes the results easier to read.

=# \x on
Expanded display is on.
demo=# SELECT id, score, gptext.gptext_retrieve_field(rf, 'title') title
FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs','title:Solr', 
     null, 'fl=title');
-[ RECORD 1 ]----------------------------------------------------
id    | http://gptext.docs.pivotal.io/latest/topics/type_ref.html
score | 0.9745732
title | GPText and Solr Data Type Mappings |
      |     VMware Greenplum Text Docs

Searching with gptext.search_external()

The gptext.search_external() function, by default, returns a standard set of metadata fields and the content of the document. Depending on the content type of the document, gptext.search_external() returns additional metadata as a JSON document in the meta column.

The following example search returns all fields stored in the gptext-docs index for the document with the word "Installing" in the title field. The content and meta column values in the example results are truncated.

=# SELECT * FROM gptext.search_external(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 
        'title:Installing', null, null);
-[ RECORD 1 ]------------------------------------------------------------------------------------
id            | http://gptext.docs.pivotal.io/latest/topics/installing.html
title         | Installing VMware Greenplum Text |
              |     VMware Greenplum Text Docs
subject       | 
description   | 
comments      | 
author        | 
keywords      | 
category      | 
resourcename  | 
url           | 
content_type  | text/html; charset=UTF-8
last_modified | 
links         | 
sha256        | F1182EE7D993CB494CAB8480DA47EA2F82DE8F7DCCC4E76745B6FA5FD7E73FC8
content       | ... 
score         | 1.4449482G
meta          | {"columnValue":[{"name":"meta_a","value":"..."},{"name":"meta_content_encoding","value":"UTF-8"},
{"name":"meta_dc_title","value":"Installing GPText |\n    VMware Greenplum Text Docs"},
{"name":"meta_div","value":"..."},{"name":"meta_form","value":"application/x-www-form-urlencoded,get,/search"},
...

You usually only want a subset of the fields in the index. You can specify the fields you want in the SELECT clause or by adding the fl Solr option in the options argument of the gptext.search_external() function. Even if you list the desired fields in the SELECT clause, specifying a field list in the options argument is more efficient because it reduces the amount of data Solr transfers to VMware Greenplum.

This example searches for HTML documents that have the word “Indexes” in the title field. A filter query chooses documents with “html” in the content_type field. The field list in the options argument contains just the title field. The id, score, and meta fields are always included in search results.

=# SELECT id, title, score FROM gptext.search_external(
    TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:indexes',
    '{content_type:*html*}', 'fl=title');
                               id                                |                 title                  |   score
-----------------------------------------------------------------+----------------------------------------+-----------
 http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html    | Working With VMware Greenplum Text External Indexes | | 1.1593812
                                                                 :     VMware Greenplum Text Docs
 http://gptext.docs.pivotal.io/latest/topics/managed-schema.html | Customizing VMware Greenplum Text Indexes |           | 1.1191859
                                                                 :     VMware Greenplum Text Docs
 http://gptext.docs.pivotal.io/latest/topics/indexes.html        | Working With VMware Greenplum Text Indexes |          | 0.8013617
                                                                 :     VMware Greenplum Text Docs
 http://gptext.docs.pivotal.io/latest/topics/queries.html        | Querying VMware Greenplum Text Indexes |              | 0.8013617
                                                                 :     VMware Greenplum Text Docs
(4 rows)

Highlighting External Index Search Results

Solr highlighting includes fragments of documents that match a search query in the search results, with the query terms highlighted with markup tags. Fragments are also called snippets or passages.

Highlighting with VMware Greenplum Text external indexes is a different process than highlighting with regular VMware Greenplum Text indexes. Because the text and all metadata of external documents are stored in an external index, the markup tags can be applied in Solr before returning search results to VMware Greenplum. With regular indexes, highlighting can be performed for stored fields and for fields with full term vectors and postings with term vectors enabled, and the search results must be joined with the database table so that the gptext.highlight() function can insert the markup tags into the text. You can, however, configure a regular VMware Greenplum Text index so that you store the fields in the index and perform highlighting in Solr. This requires editing the index's solrconfig.xml and managed-schema configuration files. See Highlighting Terms in Stored Fields for steps to enable this configuration.

Solr highlighting is performed by a search handler called a HighlightComponent, configured in the managed-schema configuration file. Solr provides highlighters that work somewhat differently and have different configurable options. VMware Greenplum Text uses the Unified Highlighter by default. See Highlighting at the Apache Solr website to learn more about Solr highlighting and the Unified Highlighter.

You can enable highlighting for VMware Greenplum Text external indexes in the Solr options argument of a gptext.search() query. Using this method, the highlighted text is returned in a result column named hs, which contains a JSON-formatted array of highlighted fragments. You can access the fragments using the gptext.gptext_retrieve_field() function.

In addition, VMware Greenplum Text provides the gptext.highlight_external() function, which unpacks highlighted fragments in the search results into separate columns in the VMware Greenplum search result set.

First, let's look at the results of a search query with highlighting enabled using the Solr options argument in the gptext.search() function. This statement searches the gptext-docs external index for documents containing the term "apache". The Solr options are:

hl=true – enables highlighting.
hl.fl=content title – the content field will be highlighted.
rows=1 – return just one document per segment.

=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
     'apache', '{content_type:*html*}', 'hl=true&hl.fl=content&rows=1')
-[ RECORD 1 ]--------------------------------------------------------------------------------
id    | http://gptext.docs.pivotal.io/latest/topics/ha.html
score | 0.4548784
hs    | {"columnValue":[{"name":"content","value":"Refer to the \u003cem\u003eApache\u003c/em\u003e 
SolrCloud documentation for help using the SolrCloud Dashboard. \n\n\n   "}]}
rf    | 
-[ RECORD 2 ]--------------------------------------------------------------------------------
id    | http://gptext.docs.pivotal.io/latest/topics/function_ref.html
score | 0.05978464
hs    | {"columnValue":[{"name":"content","value":"Remarks \n\n\n When you add an external document 
to the index, \u003cem\u003eApache\u003c/em\u003e Tika extracts a core set of metadata from the 
document, the columns listed in the   Return type   section. "}]}
rf    | 
-[ RECORD 3 ]--------------------------------------------------------------------------------
id    | http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html
score | 1.2426406
hs    | {"columnValue":[{"name":"content","value":"Solr recognizes document types automatically, 
using code included from the  \u003cem\u003eApache\u003c/em\u003e Tika  project. \n\n\n "}]}
rf    | 
-[ RECORD 4 ]--------------------------------------------------------------------------------
id    | http://gptext.docs.pivotal.io/latest/topics/administering.html
score | 0.8155949
hs    | {"columnValue":[{"name":"content","value":"ZooKeeper Administration \n\n\n 
\u003cem\u003eApache\u003c/em\u003e ZooKeeper enables coordination between the 
\u003cem\u003eApache\u003c/em\u003e Solr and VMware Greenplum Text distributed processes through a shared 
namespace that resembles a file system. "}]}
rf    |

In this example the hs column has a single fragment from each of the returned documents. You can use the hl.snippets and hl.fragsize Solr options to set, respectively, the maximum number of fragments to return and the approximate number of characters in each fragment. Other options you can use to control how the Unified Highlighter chooses fragments are hl.bs.type and hl.maxAnalyzedChars. The hl.bs.type option specifies how the highlighter breaks the text into fragments. The default is SENTENCE. Other valid choices are SEPARATOR, SENTENCE, WORD, CHARACTER, LINE, or WHOLE. The hl.maxAnalyzedChars option, default 51200, is the maximum number of characters to analyze for highlighting.

See Highlighting in the Solr documentation for tables of options you can set and their default values.

For an external index, Solr returns the highlighted fragments in a columnValue array in the hs result column. You can use the gptext.gptext_retrieve_field() function in the SELECT list to extract the fragments from the array.

=# SELECT id, score, gptext.gptext_retrieve_field(hs, 'content') AS content 
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
      'apache', '{content_type:*html*}', 
      'hl=true&hl.fl=content&hl.snippets=3&hl.fragsize=75&rows=1');
-[ RECORD 1 ]--------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/function_ref.html
score   | 0.05978464
content | Remarks 
        | 
        | 
        |  When you add an external document to the index, <em>Apache</em> Tika extracts a core 
set of metadata from the document, the columns listed in the   Return type   section. 
-[ RECORD 2 ]--------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html
score   | 1.2426406
content | Solr recognizes document types automatically, using code included from the  <em>Apache</em> 
Tika  project. 
        | 
        | 
        |  ,See  Highlighting  at the <em>Apache</em> Solr website to learn more about Solr highlighting 
and the Unified Highlighter. 
        | 
        | 
        |  ,This statement searches the  gptext-docs  external index for documents containing the term 
“<em>apache</em>”. 
-[ RECORD 3 ]--------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/administering.html
score   | 0.8155949
content | ZooKeeper Administration 
        | 
        | 
        |  <em>Apache</em> ZooKeeper enables coordination between the <em>Apache</em> Solr and VMware 
Greennplum Text distributed processes through a shared namespace that resembles a file system. 
-[ RECORD 4 ]--------------------------------------------------------------------------------
id      | http://gptext.docs.pivotal.io/latest/topics/ha.html
score   | 0.4548784
content | Refer to the <em>Apache</em> SolrCloud documentation for help using the SolrCloud Dashboard. 
        |

VMware Greenplum Text found four documents (1 for each segment) containing the string "apache" and extracted from each document a fragment of the content field containing the term.

Indexing Text Embedded in Images

Using optical character recognition (OCR), you can add text extracted from images to VMware Greenplum Text external indexes.

When the open source Tesseract OCR Engine software is installed on the hosts in the VMware Greenplum cluster, Apache Tika calls Tesseract to extract text from image files (for example, GIF, TIFF, JPG, or PNG files) and for images embedded in documents such as Word documents. The extracted text is added to the index with the document.

Running OCR on images adds significant overhead and time to VMware Greenplum Text indexing operations. You can prevent Tika from running OCR by uninstalling Tesseract or by configuring indexes to not run Tesseract. See Deactivating Tesseract with a Tika Configuration File for the steps to configure VMware Greenplum Text indexes to exclude OCR.

Install Tesseract OCR Engine on All VMware Greenplum Hosts

You can install Tesseract OCR by compiling and installing the source code or by installing a Tesseract package with yum.

To compile and install Tesseract from source, follow the compilation instructions on the Tesseract OCR GitHub wiki.
To install with a package manager, follow installation instructions on the Tesseract OCR GitHub wiki.

Be sure to install Tesseract on every host in the VMware Greenplum cluster.

Indexing Extracted Image Content

You can test the Tesseract OCR engine at the command line with an image file containing embedded text.

$ tesseract <image-file> ocr-out
$ cat ocr-out.txt

Any text Tesseract recognizes is saved in the ocr-out.txt file.

To test that Tesseract is called when adding documents to a VMware Greenplum Text external index, you can use the gptext.extract_rich_doc() function. This VMware Greenplum Text function returns the content Apache Tika extracts from a document but does not add it to the index. You need a VMware Greenplum Text external index and a URL for an image file containing text for Tesseract to extract. If the image file is in an authenticated ftp, hdfs, or s3 document store, make sure you first authenticate with gptext.external_login().

This example creates an external index and calls gptext.extract_rich_doc() to extract text from an image file with an HTML URL.

=# SELECT * FROM gptext.create_index_external('ocr-test');
INFO:  Created index ocr-test
 create_index_external
-----------------------
 t
(1 row)
=# SELECT * FROM gptext.extract_rich_doc('ocr-test', 'http://gptext.docs.pivotal.io/300/ocrtest.png');
stream_name                              | title | author | keywords | created | modified |           content
-----------------------------------------------------------------------+-------+--------+----------+---------+----------+------------------------------
http://docs-gptext-develop-staging.cfapps.io/300/graphics/ocrtest.png |       |        |          |         |          |
... (empty lines omitted)
                                                                                          : The quick brown fox
                                                                                          : jumps over the lazy dog.
                                                                                          :
                                                                                          : The five boxing wizards jump
                                                                                          : quickly.
... (empty lines omitted)
(1 row)

When Tesseract is installed, Apache Tika automatically runs Tesseract when a document URL references an image file or when an image is embedded in another document type. Text extracted from images is included with documents whenever you call the VMware Greenplum Text gptext.index_external() or gptext.index_external_dir() functions.

Deactivating OCR With a Tika Configuration File

If you do not need to index text embedded in images for a VMware Greenplum Text index, follow these steps to exclude Tesseract OCR from the document indexing process.

On the master host, create a tika.xml file with the following content.

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
  </parsers>
</properties>

Copy the tika.xml file to each VMware Greenplum host. This gpscp example copies the tika.xml file to the /home/gpadmin directory on every host listed in the hostlist file.
```
$ gpscp -f hostlist tika.xml =:/home/gpadmin
```

Modify the solrconfig.xml configuration file for each external index that does not require OCR.

$ gptext-config edit -i <external-index-name> -f solrconfig.xml

Search for the following /update/extract request handler, and insert a <str> element containing the path to the tika.xml file as shown here.

<requestHandler name="/update/extract" startup="lazy" class="com.emc.solr.handler.extraction.SHA256CheckExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">meta_</str>
    <str name="captureAttr">true</str>
  </lst>
  <str name="tika.config">/home/gpadmin/tika.xml</str>
</requestHandler>