A Tanzu Greenplum text external index is an Apache Solr index you create in VMware Greenplum to index and search documents that reside outside of VMware Greenplum. External documents can be of many types, for example, PDF, Microsoft Word, XML, or HTML. Solr recognizes document types automatically, using code included from the Apache Tika project.
Note: Read about how Solr and Tika (the "Solr Cell" framework) extract and index document text and metadata at Uploading Data with Solr Cell using Apache Tika. See a list of supported document types at Supported Document Formats.
Tanzu Greenplum text supports indexing external documents that are accessible by URL with an HTTP GET request. Tanzu Greenplum text also supports indexing external documents stored in Hadoop, provides functions and utilities to specify required Hadoop configuration and authentication information.
To add external documents to an index, you supply Tanzu Greenplum text with a list of URLs in an array or as a SQL SELECT
statement. The URL becomes the unique id field in the Solr index.
External indexes exist entirely in Solr—there is no associated database table in VMware Greenplum. Because of this, the index name does not follow the database.schema.table
pattern required for regular Tanzu Greenplum text indexes. You can choose any name for an external index, but it must not contain periods. You can access a Tanzu Greenplum text external index from any database in the VMware Greenplum system that has the Tanzu Greenplum text schema installed.
Tanzu Greenplum text provides the following alternate functions for working with external indexes:
gptext.create_index_external()
– create an external index.gptext.index_external()
– add documents to the external index.gptext.search_external()
– search an external index.gptext.highlight_external()
– returns fragments of documents with matching search terms highlighted with markup tags.The distribution policy for a regular Tanzu Greenplum text index is the same as the underlying VMware Greenplum table, so that segments manage the same Tanzu Greenplum text table data as the Solr index shard. A Tanzu Greenplum text external index also has one shard per segment, but the documents are distributed among the segments using Solr compositeId routing, which allows Solr to choose the shard for a document. See Shards and Indexing Data in SolrCloud.
A regular Tanzu Greenplum text index only indexes and stores the database table columns you specify. A Tanzu Greenplum text external index stores and indexes the textual content of the file, as well as metadata fields that are members of the document type.
When an external document is added to the index, the content of the document is saved in the content
field. The content
field is stored in the index but it is not indexed.
Tanzu Greenplum text copies the following fields to the text
field, the default search field, which is indexed but not stored.
title
author
description
keywords
content
content_type
resourcename
url
To search the document content, therefore, search the text
field, but to retrieve or highlight document contents, use the content
field.
The following common metadata fields are indexed and stored:
title
subject
description
comments
author
keywords
category
resourcename
url
content_type
last_modified
links
A dynamic field named meta_*
is also indexed and stored. This is a multi-valued field where Solr stores document-type-specific metadata. In search results, this field is returned as a JSON-formatted columnValue
string. You can extract individual metadata by name using the gptext.gptext_retrieve_field()
function.
Search results for external indexes include all fields saved with the documents, including all metadata fields. You can use the Solr field list option (fl=<field-list>
) to limit the fields returned. You can also use SELECT <field-list> FROM gptext.search_external()
to limit the fields returned, but it is more efficient to filter out the fields in Solr with the fl
option than in the database session.
The information in this section is applicable only to external document sources that require authentication.
If the external document source that you want to index requires authentication, you must provide the authentication configuration to Tanzu Greenplum text. You must also use Tanzu Greenplum text functions to explicitly log in to the external document source before indexing, and log out of the source after indexing completes.
NoteAuthenticating is not required for searching an external document.
Before you use Tanzu Greenplum text to index an external document source that requires authentication, you must upload configuration information to ZooKeeper. Use the gptext-external upload
command to upload this information:
gptext-external upload -t <type> -p <config_dir> -c <config_name>
This table describes the options to the gptext-external upload
command:
Option | Description |
---|---|
<type> | The type of the external document source. The supported <type>s are ftp , hdfs , and s3 . |
<config_dir> | The path to a directory that contains the configuration files. The configuration information that you provide in this directory will depend on the external document source <type>. |
<config_name> | The name that you assign to the configuration information. You will provide this name when you log in to the external document source. |
NoteRetain a local copy of <config_dir>. Should you need to update the configuration, you must edit a local copy of the file(s) and re-upload.
You can add documents from an FTP server that requires authentication to a Tanzu Greenplum text external index. To authenticate with the server create a configuration directory and add a file to it named login.txt
. Add three lines to the login.txt
file:
To upload configuration information for an authenticated FTP server:
Create a directory for the authentication configuration.
$ mkdir ftp_config
Create the login.txt
file.
$ touch ftp_config/login.txt
Add the FTP user name, password, and the maximum number of FTP connections to create to the login.txt
file on separate lines. For example:
$ echo "bill" > ftp_config/login.txt
$ echo "changeme" >> ftp_config/login.txt
$ echo "10" >> ftp_config/login.txt
Upload the configuration directory to ZooKeeper using the gptext-external upload
command.
$ gptext-external upload -t ftp -p ./ftp_config -c ftp_bill_auth
This command maps the login.txt
file in the ftp_conf/
directory to the name ftp_bill_auth
.
The password is base64-encoded when stored in ZooKeeper. To protect the password delete the login.txt
file after you have uploaded the configuration to ZooKeeper.
When you access a Hadoop external document source, <config_dir> must include the following configuration files for <type> hdfs
:
core-site.xml
and hdfs-site.xml
configuration files from the Hadoop server.user.txt
. This file contains a single line identifying the Hadoop username to use for authentication. If Kerberos is enabled in the Hadoop cluster, the username in user.txt
must identify the Kerberos principal for the user.keytab
file and the krb5.conf
file for the Kerberos realm.For example, to upload configuration information for a Hadoop external document store:
Create a directory for the authentication configuration files. For example:
$ mkdir hdfs_conf
Copy the core-site.xml
and hdfs-site.xml
configuration files from the Hadoop server to the configuration directory. The location of these files will differ for different Hadoop distributions. For example:
$ scp hdfsuser@hdfsnamenode:/etc/hadoop/conf/core-site.xml hdfs_conf/
$ scp hdfsuser@hdfsnamenode:/etc/hadoop/conf/hdfs-site.xml hdfs_conf/
Construct the user.txt
file. For example, if the Hadoop username is bill
:
$ touch hdfs_conf/user.txt
$ echo "bill" > hdfs_conf/user.txt
Upload the Hadoop authentication configuration files for user bill
to ZooKeeper. For example:
$ gptext-external upload -t hdfs -p ./hdfs_conf -c hdfs_bill_auth
This command maps the configuration information you provided in the hdfs_conf/
directory to the name hdfs_bill_auth
.
You can add documents stored as objects in an S3 bucket to a Tanzu Greenplum text external index. To authenticate with Amazon S3 or other S3-compatible storage, you upload the credentials for an account with access to the S3 bucket to ZooKeeper.
To upload authentication credentials for S3:
Create a directory for the authentication configuration.
$ mkdir s3_conf
Create the credential
file.
$ touch s3_conf/credential
Add your account's S3 access key and secret key to the credential
file on separate lines. There must be exactly two lines in the file. For example:
$ echo "<my-access-key>" > s3_conf/credential
$ echo "<my-secret-key>" >> s3_conf/credential
Upload the configuration directory you created to ZooKeeper using the gptext-external upload
command.
$ gptext-external upload -t 's3' -p /home/gpadmin/s3_conf -c s3_auth
20180619:17:44:21:006505 gptext-external:mdw:gpadmin-[INFO]:-Execute GPText config.
20180619:17:44:21:006505 gptext-external:mdw:gpadmin-[INFO]:-Check zookeeper cluster state ...
20180619:17:44:22:006505 gptext-external:mdw:gpadmin-[INFO]:-Upload '/home/gpadmin/s3_conf' success.
20180619:17:44:22:006505 gptext-external:mdw:gpadmin-[INFO]:-Done.
Prior to indexing, you must explictly log in to an external document source that requires authentication. Use the gptext.external_login()
function for this purpose:
gptext.external_login('<type>', '<type>://<url>', '<config_name>')
The table below describes the arguments to the gptext.external_login()
function:
Option | Description |
---|---|
<type> | The type of the external document source. Valid values are ftp , hdfs , and s3 . |
<type>://<url> | The URL of the external document source <type>. |
<config_name> | The <config_name> you provided when you uploaded the authentication configuration with gptext-external upload . |
When you invoke the gptext.external_login()
function, Tanzu Greenplum text logs you in to the external document source as the user or account identified in the configuration directory you provided in the <config_name> configuration.
NoteYou can log in to only one Tanzu Greenplum text external document source for one external type at a time. You must explicitly log out before you can log in to another external document source of the same type.
For example, to log in to a Hadoop document source using the hdfs_bill_auth
authentication configuration you uploaded in the prior section:
SELECT * FROM gptext.external_login('hdfs', 'hdfs://<namenode_host_or_ip>:<hdfs_port>', 'hdfs_bill_auth');
The command is similar to log in to an ftp server.
=# SELECT * FROM gptext.external_login('ftp', 'ftp://<ftpserver_host_or_ip:<ftp_port>'), 'ftp_bill_auth');
It is not necessary to include :<ftp_port>
in the URL if the server uses the default ftp port 21.
The connection URL for S3 storage has this format:
<protocol>://<s3-endpoint>[/<region>][/]
The values that you specify in the connection URL will depend on the S3 storage server to which you want to log in.
When you are accessing Amazon S3, the <protocol>
is s3
. The <s3-endpoint>
is an Amazon S3 endpoint. If the endpoint starts with s3.
or s3-
and is followed by a region code—for example, s3-us-west-2.amazonaws.com
or s3.us-east-1.amazonaws.com
—the /<region>
part of the URL is optional, and Tanzu Greenplum text determines the region from the endpoint. The connection URL for an endpoint such as s3.dualstack.us-east-1.amazonaws.com
, however, must include the /<region>
, for example s3://s3.dualstack.us-east-1.amazonaws.com/us-east-1
.
=# SELECT * FROM gptext.external_login('s3', 's3://s3.us-west-1.amazonaws.com/', 's3_auth');
external_login
----------------
t
(1 row)
When you access other S3-compatible storage, <protocol>
identifies the login protocol. For example, when you access MinIO, the <protocol>
may specify http
or https
. The <s3-endpoint>
is an endpoint that identifies the server. The <region>/
part of the URL is optional, and depends on the configuration of the S3 storage server. Example log in command for MinIO:
=# SELECT * FROM gptext.external_login('s3', 'https://my.minio.server:9000', 's3_auth');
external_login
----------------
t
(1 row)
To log out of an external document source, use the gptext.external_logout('<type>')
function. For example, to log out of the Hadoop cluster that you are currently logged in to:
SELECT * FROM gptext.external_logout('hdfs');
If you run into problems logging in to or accessing documents in an authenticated Hadoop external document store, refer to Troubleshooting Hadoop Connection Problems.
Use the gptext.create_index_external()
function to create an external index.
This example creates an external index named gptext-docs
.
=# SELECT * FROM gptext.create_index_external('gptext-docs');
An external index does not have a corresponding VMware Greenplum table, so the index name does not follow the database.schema.table
pattern required for regular Tanzu Greenplum text indexes. The only restriction is that the name for an external table must not contain periods.
By default, the external index is created with one shard for each VMware Greenplum segment. You can specify fewer shards by setting the Tanzu Greenplum text gptext.idx_num_shards
configuration parameter to the number of shards you want before you create the index. See Specifying the Number of Index Shards for more information about this option.
To add external documents to an external index, supply a list of URLs where Solr can retrieve the document to the gptext.index_external()
function. URLs may be specified either in an array or as a SQL result set.
A hash of the URL is the document's ID in the index. If a URL has already been added to the index, the file is not reindexed. If you add two identical files retrieved from different URLs, both files are added to the index.
This example adds a single PDF document, specified in an array, to the gptext-docs
index.
=# SELECT * FROM gptext.index_external(
'{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf}',
'gptext-docs');
dbid | num_docs
------+----------
3 | 0
2 | 1
(2 rows)
=# SELECT * FROM gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)
This example adds several HTML documents by selecting URLs from a database table.
=# DROP TABLE IF EXISTS gptext_html_docs;
=# CREATE TABLE gptext_html_docs (
id bigint,
url text)
DISTRIBUTED BY (id);
CREATE TABLE
=# INSERT INTO gptext_html_docs VALUES
(1, 'http://gptext.docs.pivotal.io/latest/topics/administering.html'),
(2, 'http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html'),
(3, 'http://gptext.docs.pivotal.io/latest/topics/function_ref.html'),
(4, 'http://gptext.docs.pivotal.io/latest/topics/guc_ref.html'),
(5, 'http://gptext.docs.pivotal.io/latest/topics/ha.html'),
(6, 'http://gptext.docs.pivotal.io/latest/topics/index.html'),
(7, 'http://gptext.docs.pivotal.io/latest/topics/indexes.html'),
(8, 'http://gptext.docs.pivotal.io/latest/topics/intro.html'),
(9, 'http://gptext.docs.pivotal.io/latest/topics/managed-schema.html'),
(10, 'http://gptext.docs.pivotal.io/latest/topics/performance.html'),
(11, 'http://gptext.docs.pivotal.io/latest/topics/queries.html'),
(12, 'http://gptext.docs.pivotal.io/latest/topics/type_ref.html'),
(13, 'http://gptext.docs.pivotal.io/latest/topics/upgrading.html'),
(14, 'http://gptext.docs.pivotal.io/latest/topics/utility_ref.html'),
(15, 'http://gptext.docs.pivotal.io/latest/topics/installing.html');
INSERT 0 15
=# SELECT * FROM gptext.index_external(
TABLE(SELECT url FROM gptext_html_docs),
'gptext-docs');
dbid | num_docs
------+----------
3 | 6
2 | 8
(2 rows)
=# SELECT * FROM gptext.commit_index('gptext-docs');
commit_index
--------------
t
(1 row)
To add documents from an external document source that requires authentication, such as hdfs or an ftp server, log in to the external system with the gptext.external_login()
function before you add the documents. With an authenticated document source, you can add all documents in a directory, using the gptext.external_index_dir()
function. See the gptext.external_index_dir()
function reference for an example.
You can search Tanzu Greenplum text external indexes with the standard gptext.search()
function or with the gptext.search_external()
function. The difference is that the gptext.search()
function returns just the id
, score
, hs
, and rf
columns and the gptext.search_external()
function by default also includes all of the content and metadata stored in the external index. You can use the Solr fl
(field list) option with either function to set the actual fields that are included in the results.
This simple gptext.search()
example searches for "Solr" in the title
field of the gptext-docs
external index.
=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:Solr', null, null);
id | score | hs | rf
-----------------------------------------------------------+-----------+----+----
http://gptext.docs.pivotal.io/latest/topics/type_ref.html | 0.9745732 | |
(1 row)
To see the title of the document that matched the search, you must request the field with a fl
option.
=# SELECT * FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
'title:Solr', null, 'fl=title');
id | score | hs |
rf
-----------------------------------------------------------+-----------+----+----------------
--------------------------------------------------------------------------------------------
http://gptext.docs.pivotal.io/latest/topics/type_ref.html | 0.9745732 | | {"columnValue":
[{"name":"title","value":"VMware Greenplum Text and Solr Data Type Mappings |\n VMware Greenplum Text Docs"}]}
(1 row)
The title
field specified in the field list of the Solr options argument is returned in the rf
column in a JSON document. If you want to return the title in its own result column, you can use the gptext.gptext_retrieve_field()
function to extract the text from the JSON document. The expanded display (\x on
) psql option in the following examples makes the results easier to read.
=# \x on
Expanded display is on.
demo=# SELECT id, score, gptext.gptext_retrieve_field(rf, 'title') title
FROM gptext.search (TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs','title:Solr',
null, 'fl=title');
-[ RECORD 1 ]----------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/type_ref.html
score | 0.9745732
title | GPText and Solr Data Type Mappings |
| VMware Greenplum Text Docs
The gptext.search_external()
function, by default, returns a standard set of metadata fields and the content of the document. Depending on the content type of the document, gptext.search_external()
returns additional metadata as a JSON document in the meta
column.
The following example search returns all fields stored in the gptext-docs
index for the document with the word "Installing" in the title field. The content
and meta
column values in the example results are truncated.
=# SELECT * FROM gptext.search_external(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
'title:Installing', null, null);
-[ RECORD 1 ]------------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/installing.html
title | Installing VMware Greenplum Text |
| VMware Greenplum Text Docs
subject |
description |
comments |
author |
keywords |
category |
resourcename |
url |
content_type | text/html; charset=UTF-8
last_modified |
links |
sha256 | F1182EE7D993CB494CAB8480DA47EA2F82DE8F7DCCC4E76745B6FA5FD7E73FC8
content | ...
score | 1.4449482G
meta | {"columnValue":[{"name":"meta_a","value":"..."},{"name":"meta_content_encoding","value":"UTF-8"},
{"name":"meta_dc_title","value":"Installing GPText |\n VMware Greenplum Text Docs"},
{"name":"meta_div","value":"..."},{"name":"meta_form","value":"application/x-www-form-urlencoded,get,/search"},
...
You usually only want a subset of the fields in the index. You can specify the fields you want in the SELECT
clause or by adding the fl
Solr option in the options
argument of the gptext.search_external()
function. Even if you list the desired fields in the SELECT
clause, specifying a field list in the options argument is more efficient because it reduces the amount of data Solr transfers to VMware Greenplum.
This example searches for HTML documents that have the word “Indexes” in the title
field. A filter query chooses documents with “html” in the content_type
field. The field list in the options
argument contains just the title
field. The id
, score
, and meta
fields are always included in search results.
=# SELECT id, title, score FROM gptext.search_external(
TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:indexes',
'{content_type:*html*}', 'fl=title');
id | title | score
-----------------------------------------------------------------+----------------------------------------+-----------
http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html | Working With VMware Greenplum Text External Indexes | | 1.1593812
: VMware Greenplum Text Docs
http://gptext.docs.pivotal.io/latest/topics/managed-schema.html | Customizing VMware Greenplum Text Indexes | | 1.1191859
: VMware Greenplum Text Docs
http://gptext.docs.pivotal.io/latest/topics/indexes.html | Working With VMware Greenplum Text Indexes | | 0.8013617
: VMware Greenplum Text Docs
http://gptext.docs.pivotal.io/latest/topics/queries.html | Querying VMware Greenplum Text Indexes | | 0.8013617
: VMware Greenplum Text Docs
(4 rows)
Solr highlighting includes fragments of documents that match a search query in the search results, with the query terms highlighted with markup tags. Fragments are also called snippets or passages.
Highlighting with Tanzu Greenplum text external indexes is a different process than highlighting with regular Tanzu Greenplum text indexes. Because the text and all metadata of external documents are stored in an external index, the markup tags can be applied in Solr before returning search results to VMware Greenplum. With regular indexes, highlighting can be performed for stored fields and for fields with full term vectors and postings with term vectors enabled, and the search results must be joined with the database table so that the gptext.highlight()
function can insert the markup tags into the text. You can, however, configure a regular Tanzu Greenplum text index so that you store the fields in the index and perform highlighting in Solr. This requires editing the index's solrconfig.xml
and managed-schema
configuration files. See Highlighting Terms in Stored Fields for steps to enable this configuration.
Solr highlighting is performed by a search handler called a HighlightComponent
, configured in the managed-schema
configuration file. Solr provides highlighters that work somewhat differently and have different configurable options. Tanzu Greenplum text uses the Unified Highlighter by default. See Highlighting at the Apache Solr website to learn more about Solr highlighting and the Unified Highlighter.
You can enable highlighting for Tanzu Greenplum text external indexes in the Solr options argument of a gptext.search()
query. Using this method, the highlighted text is returned in a result column named hs
, which contains a JSON-formatted array of highlighted fragments. You can access the fragments using the gptext.gptext_retrieve_field()
function.
In addition, Tanzu Greenplum text provides the gptext.highlight_external()
function, which unpacks highlighted fragments in the search results into separate columns in the VMware Greenplum search result set.
First, let's look at the results of a search query with highlighting enabled using the Solr options argument in the gptext.search()
function. This statement searches the gptext-docs
external index for documents containing the term "apache". The Solr options are:
hl=true
– enables highlighting.hl.fl=content title
– the content
field will be highlighted.rows=1
– return just one document per segment.=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
'apache', '{content_type:*html*}', 'hl=true&hl.fl=content&rows=1')
-[ RECORD 1 ]--------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/ha.html
score | 0.4548784
hs | {"columnValue":[{"name":"content","value":"Refer to the \u003cem\u003eApache\u003c/em\u003e
SolrCloud documentation for help using the SolrCloud Dashboard. \n\n\n "}]}
rf |
-[ RECORD 2 ]--------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/function_ref.html
score | 0.05978464
hs | {"columnValue":[{"name":"content","value":"Remarks \n\n\n When you add an external document
to the index, \u003cem\u003eApache\u003c/em\u003e Tika extracts a core set of metadata from the
document, the columns listed in the Return type section. "}]}
rf |
-[ RECORD 3 ]--------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html
score | 1.2426406
hs | {"columnValue":[{"name":"content","value":"Solr recognizes document types automatically,
using code included from the \u003cem\u003eApache\u003c/em\u003e Tika project. \n\n\n "}]}
rf |
-[ RECORD 4 ]--------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/administering.html
score | 0.8155949
hs | {"columnValue":[{"name":"content","value":"ZooKeeper Administration \n\n\n
\u003cem\u003eApache\u003c/em\u003e ZooKeeper enables coordination between the
\u003cem\u003eApache\u003c/em\u003e Solr and VMware Greenplum Text distributed processes through a shared
namespace that resembles a file system. "}]}
rf |
In this example the hs
column has a single fragment from each of the returned documents. You can use the hl.snippets
and hl.fragsize
Solr options to set, respectively, the maximum number of fragments to return and the approximate number of characters in each fragment. Other options you can use to control how the Unified Highlighter chooses fragments are hl.bs.type
and hl.maxAnalyzedChars
. The hl.bs.type
option specifies how the highlighter breaks the text into fragments. The default is SENTENCE
. Other valid choices are SEPARATOR
, SENTENCE
, WORD
, CHARACTER
, LINE
, or WHOLE
. The hl.maxAnalyzedChars
option, default 51200, is the maximum number of characters to analyze for highlighting.
See Highlighting in the Solr documentation for tables of options you can set and their default values.
For an external index, Solr returns the highlighted fragments in a columnValue
array in the hs
result column. You can use the gptext.gptext_retrieve_field()
function in the SELECT
list to extract the fragments from the array.
=# SELECT id, score, gptext.gptext_retrieve_field(hs, 'content') AS content
FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs',
'apache', '{content_type:*html*}',
'hl=true&hl.fl=content&hl.snippets=3&hl.fragsize=75&rows=1');
-[ RECORD 1 ]--------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/function_ref.html
score | 0.05978464
content | Remarks
|
|
| When you add an external document to the index, <em>Apache</em> Tika extracts a core
set of metadata from the document, the columns listed in the Return type section.
-[ RECORD 2 ]--------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html
score | 1.2426406
content | Solr recognizes document types automatically, using code included from the <em>Apache</em>
Tika project.
|
|
| ,See Highlighting at the <em>Apache</em> Solr website to learn more about Solr highlighting
and the Unified Highlighter.
|
|
| ,This statement searches the gptext-docs external index for documents containing the term
“<em>apache</em>”.
-[ RECORD 3 ]--------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/administering.html
score | 0.8155949
content | ZooKeeper Administration
|
|
| <em>Apache</em> ZooKeeper enables coordination between the <em>Apache</em> Solr and VMware
Greennplum Text distributed processes through a shared namespace that resembles a file system.
-[ RECORD 4 ]--------------------------------------------------------------------------------
id | http://gptext.docs.pivotal.io/latest/topics/ha.html
score | 0.4548784
content | Refer to the <em>Apache</em> SolrCloud documentation for help using the SolrCloud Dashboard.
|
Tanzu Greenplum text found four documents (1 for each segment) containing the string "apache" and extracted from each document a fragment of the content
field containing the term.
Using optical character recognition (OCR), you can add text extracted from images to Tanzu Greenplum text external indexes.
When the open source Tesseract OCR Engine software is installed on the hosts in the VMware Greenplum cluster, Apache Tika calls Tesseract to extract text from image files (for example, GIF, TIFF, JPG, or PNG files) and for images embedded in documents such as Word documents. The extracted text is added to the index with the document.
Running OCR on images adds significant overhead and time to Tanzu Greenplum text indexing operations. You can prevent Tika from running OCR by uninstalling Tesseract or by configuring indexes to not run Tesseract. See Deactivating Tesseract with a Tika Configuration File for the steps to configure Tanzu Greenplum text indexes to exclude OCR.
You can install Tesseract OCR by compiling and installing the source code or by installing a Tesseract package with yum
.
To compile and install Tesseract from source, follow the compilation instructions on the Tesseract OCR GitHub wiki.
To install with a package manager, follow installation instructions on the Tesseract OCR GitHub wiki.
Be sure to install Tesseract on every host in the VMware Greenplum cluster.
You can test the Tesseract OCR engine at the command line with an image file containing embedded text.
$ tesseract <image-file> ocr-out
$ cat ocr-out.txt
Any text Tesseract recognizes is saved in the ocr-out.txt file.
To test that Tesseract is called when adding documents to a Tanzu Greenplum text external index, you can use the gptext.extract_rich_doc()
function. This Tanzu Greenplum text function returns the content Apache Tika extracts from a document but does not add it to the index. You need a Tanzu Greenplum text external index and a URL for an image file containing text for Tesseract to extract. If the image file is in an authenticated ftp, hdfs, or s3 document store, make sure you first authenticate with gptext.external_login()
.
This example creates an external index and calls gptext.extract_rich_doc()
to extract text from an image file with an HTML URL.
=# SELECT * FROM gptext.create_index_external('ocr-test');
INFO: Created index ocr-test
create_index_external
-----------------------
t
(1 row)
=# SELECT * FROM gptext.extract_rich_doc('ocr-test', 'http://gptext.docs.pivotal.io/300/ocrtest.png');
stream_name | title | author | keywords | created | modified | content
-----------------------------------------------------------------------+-------+--------+----------+---------+----------+------------------------------
http://docs-gptext-develop-staging.cfapps.io/300/graphics/ocrtest.png | | | | | |
... (empty lines omitted)
: The quick brown fox
: jumps over the lazy dog.
:
: The five boxing wizards jump
: quickly.
... (empty lines omitted)
(1 row)
When Tesseract is installed, Apache Tika automatically runs Tesseract when a document URL references an image file or when an image is embedded in another document type. Text extracted from images is included with documents whenever you call the Tanzu Greenplum text gptext.index_external()
or gptext.index_external_dir()
functions.
If you do not need to index text embedded in images for a Tanzu Greenplum text index, follow these steps to exclude Tesseract OCR from the document indexing process.
On the master host, create a tika.xml
file with the following content.
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
</parsers>
</properties>
Copy the tika.xml
file to each VMware Greenplum host. This gpscp
example copies the tika.xml
file to the /home/gpadmin
directory on every host listed in the hostlist
file.
$ gpscp -f hostlist tika.xml =:/home/gpadmin
Modify the solrconfig.xml
configuration file for each external index that does not require OCR.
$ gptext-config edit -i <external-index-name> -f solrconfig.xml
Search for the following /update/extract
request handler, and insert a <str>
element containing the path to the tika.xml
file as shown here.
<requestHandler name="/update/extract" startup="lazy" class="com.emc.solr.handler.extraction.SHA256CheckExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">meta_</str>
<str name="captureAttr">true</str>
</lst>
<str name="tika.config">/home/gpadmin/tika.xml</str>
</requestHandler>