VMware GemFire Search Integration with GemFire

This topic describes how VMware GemFire integrates with VMware GemFire Search.

VMware GemFire Search is a search engine that provides indexing and searching capabilities when used with VMware GemFire. GemFire Search is built on the widely-used Java full-text search engine Apache Lucene® version 9. GemFire Search query and index definitions use the Lucene name in syntax and APIs.

This topic requires you to have some familiarity with Apache Lucene’s indexing and search capabilities. For more information about Apache Lucene, see the Apache Lucene website.

Overview

VMware GemFire Search integration:

Enables users to create GemFire Search indexes on data stored in VMware GemFire
Provides high availability of indexes using VMware GemFire’s HA capabilities to store the indexes in memory
Colocates indexes with data
For persistent regions, persists GemFire Search indexes to disk
Updates the indexes asynchronously to minimize impacting write latency
Provides scalability by partitioning index data

For more details, see the GemFire Search Javadocs for the classes and interfaces that implement Apache Lucene indexes and searches, including the following:

LuceneService
LuceneSerializer
LuceneIndexFactory
LuceneQuery
LuceneQueryFactory
LuceneQueryProvider
LuceneResultStruct

Requirements and Caveats

Minimum JDK version required: JDK11
Join queries between regions are not supported.
Queries on multiple indexes within one region are not supported. You can create multiple indexes in a region, but each query must only use one index.
GemFire Search indexes are stored in on-heap memory only.
GemFire Search queries from within transactions are not supported. On an attempt to query from within a transaction, a LuceneQueryException is thrown, issuing an error message on the client (accessor) similar to the following:
```
Exception in thread "main" org.apache.geode.cache.lucene.LuceneQueryException:
  Lucene Query cannot be executed within a transaction
...
```
GemFire Search does not allow mixed type fields. GemFire Search does not allow a field’s data to consist of both numeric and String field types. For example, if an index on the field SSN has the following entries:
- Object_1 object_1 has String SSN = “1111”
- Object_2 object_2 has Integer SSN = 1111
- Object_3 object_3 has Float SSN = 1111.0
Then during indexing it will throw exception, because numeric value can not mix with String value for the same field.
The “@” symbol must be escaped. GemFire Search uses the “@” character as a minimum-should-match operator. When querying for email using KeywordAnalyzer, you must escape the “@” character. For example, "field1:john\@example.com".
The order of server creation with respect to index and region creation is important. The cluster configuration service cannot work if servers are created after index creation, but before region creation, because GemFire Search indexes are propagated to the cluster configuration after region creation. To start servers at multiple points within the start-up process, use this ordering:
1. Start servers
2. Create GemFire Search index
3. Create region
4. Start additional servers
An invalidate operation on a region entry does not invalidate a corresponding GemFire Search index entry. A query on a GemFire Search index that contains values that have been invalidated can return results that no longer exist. Therefore, do not combine entry invalidation with queries on GemFire Search indexes.
GemFire Search indexes are not supported for regions that have eviction configured with a local destroy. Eviction can be configured with overflow to disk, but only the region data is overflowed to disk, not the GemFire Search index. On an attempt to create a region with eviction configured to do local destroy (with a GemFire Search index), an UnsupportedOperationException is thrown, issuing an error message similar to the following:
```
[error 2017/05/02 16:12:32.461 PDT <main> tid=0x1] 
  java.lang.UnsupportedOperationException:
Exception in thread "main" java.lang.UnsupportedOperationException:
  Lucene indexes on regions with eviction and action local destroy are not supported
...
```
Backups should only be made for regions with GemFire Search indexes when there are no puts, updates, or deletes in progress. A backup might cause an inconsistency between region data and a GemFire Search index. Both the region operation and the associated index operation cause disk operations, but these disk operations are not done atomically. Therefore, if a backup is taken between the persisted write to a region and the resulting persisted write to the GemFire Search index, then the backup represents inconsistent data in the region and GemFire Search index.

You can install VMware GemFire Search on a GemFire Server or on a GemFire client.

Install on a GemFire Server

To install VMware GemFire Search on a GemFire server:

Log in to Broadcom Customer Support Portal with your customer credentials. For more information, see the Download Broadcom products and software article.
Go to the VMware Tanzu GemFire Search downloads page. Select VMware Tanzu GemFire Search, and select a version.
Click I agree to Terms and Conditions. Click the HTTPS Download icon next to VMware GemFire Search. This downloads the VMware GemFire Search .gfm file.
Do one of the following:
- Set the GEMFIRE_EXTENSIONS_REPOSITORY_PATH environment variable to the VMware GemFire Search extension path. For example, if your vmware-gemfire-search-VERSION.gfm file is located in /gemfire-extensions, run the following command:
```
export GEMFIRE_EXTENSIONS_REPOSITORY_PATH=/gemfire-extensions
```
- Copy the downloaded file to the extensions directory of your GemFire installation. By default, the extensions folder is the vmware-gemfire-XXX/extensions directory of your GemFire installation.
Start the locator:
```
gfsh>start locator --name locator1
```
Start the server:
```
gfsh>start server --name server1
```

Install on a GemFire Client

You can use Maven to install VMware GemFire Search on a GemFire client.

To use VMware GemFire Search on a GemFire client, you must add the appropriate dependencies.

Prerequisites

Log in to the Broadcom Customer Support Portal with your customer credentials. For more information about login requirements, see the Download Broadcom products and software article.
Go to the VMware Tanzu GemFire downloads page, select VMware Tanzu GemFire, click Show All Releases.
Find the release named Click Green Token for Repository Access and click the Token Download icon on the right. This opens the instructions on how to use the GemFire artifact repository. At the top, the Access Token is provided. Click Copy to Clipboard. You will use this Access Token as the password.

Repository and Credential Setup for Maven

Add the following to the pom.xml file:

<repositories>
    <repository>
        <id>gemfire-release-repo</id>
        <name>GemFire Release Repository</name>
        <url>https://packages.broadcom.com/artifactory/gemfire/</url>
    </repository>
</repositories>

To access the artifacts, add an entry to your .m2/settings.xml file:

  <settings>
    <servers>
      <server>
        <id>gemfire-release-repo</id>
        <username>EXAMPLE-USERNAME</username>
        <password>MY-PASSWORD</password>
      </server>
    </servers>
  </settings>

Where:

EXAMPLE-USERNAME is your support.broadcom.com user name.
MY-PASSWORD is the Access Token you copied in step 3 in Prerequisites.

Add the dependencies to the project by adding the following to your pom.xml file.

<dependencies>
    <dependency>
        <groupId>com.vmware.gemfire</groupId>
        <artifactId>gemfire-search</artifactId>
        <version>1.1.0</version>
    </dependency>
</dependencies>

Repository and Credential Setup for Gradle

Add the following to the build.gradle file:

repositories {
    maven {
        credentials {
            username "$gemfireRepoUsername"
            password "$gemfireRepoPassword"
        }
        url = uri("https://packages.broadcom.com/artifactory/gemfire/")
    }
}

Add the following to the local .gradle/gradle.properties or project gradle.properties file:
```
gemfireRepoUsername=MY-USERNAME
gemfireRepoPassword=MY-PASSWORD
```
Where:
- EXAMPLE-USERNAME is your support.broadcom.com username.
- MY-PASSWORD is the Access Token you copied in step 3 in Prerequisites.
Add the dependencies to the project by adding the following to your build.gradle file.
```
dependencies {
    implementation "com.vmware.gemfire:gemfire-search:1.1.0"
}
```

Upgrading VMware GemFire Search

To upgrade from VMware GemFire Search v1.0 to v1.1, you must destroy any existing Lucene indexes before upgrading, then recreate the indexes after upgrading.

With VMware GemFire Search v1.0 installed, use gfsh to destroy the Lucene indexes. See Destroying an Index.
Upgrade to GemFire Search v1.1.
With VMware GemFire Search v1.1 installed, use gfsh to recreate the indexes. See Creating an Index.

Upgrading VMware GemFire

Upgrading from VMware GemFire 9.x

To upgrade to VMware GemFire 10.1 from VMware GemFire 9.x by restarting the cluster with its previously-generated cluster configuration, you must first use gfsh to destroy the Lucene indexes, then recreate the indexes after upgrading.

Specifically:

In GemFire 9.x, use gfsh to destroy the Lucene indexes. See Destroying an Index.
Upgrade to VMware GemFire 10.1 with GemFire Search deployed.
In GemFire 10.1, use gfsh to recreate the indexes. See Creating an Index.

Note: Rolling upgrades from previous GemFire versions to GemFire 10.1 with GemFire Search deployed are not supported.

Upgrading from VMware GemFire 10.x

When upgrading to VMware GemFire 10.1 from VMware GemFire 10.0, you do not need to destroy or recreate the indexes after upgrading.

Using the GemFire Search Integration

You can interact with GemFire Search indexes through a Java API, through the gfsh command-line utility, or by means of the cache.xml configuration file.

Key Points

GemFire Search indexes are supported only on partitioned regions. Replicated region types are not supported.
GemFire Search indexes reside on servers. You cannot create a GemFire Search index on a client.
A GemFire Search index applies to only one region. Multiple indexes can be defined for a single region.
Heterogeneous objects in a single region are supported.

Creating an Index

When you create a GemFire Search index, you must provide three pieces of information:

The name of the index that you wish to create
The name of the region to be indexed and searched
The names of the fields that you wish to index

You must specify at least one field to be indexed.

If the object value for the entries in the region comprises a primitive type value without a field name, use __REGION_VALUE_FIELD to specify the field to be indexed. __REGION_VALUE_FIELD serves as the field name for entry values of all primitive types, including String, Long, Integer, Float, and Double.

Each field has a corresponding analyzer to extract terms from text. When no analyzer is specified, the org.apache.lucene.analysis.standard.StandardAnalyzer is used.

The index has an associated serializer that renders the indexed object as a GemFire Search document comprised of searchable fields. The default serializer is a simple one that handles top-level fields, but does not render collections or nested objects.

VMware GemFire supplies a built-in serializer, FlatFormatSerializer, that handles collections and nested objects. For more information about GemFire Search indexes for nested objects, see Using FlatFormatSerializer to Index Fields Within Nested Objects.

As a third alternative, you can create your own serializer. This serializer must implement the LuceneSerializer interface.

gfsh Example

In gfsh, use the create lucene index command to create GemFire Search indexes.

Example 1: This example creates a GemFire Search index with two fields. In this example:

The default analyzer handles both fields.

The default serializer is used.

gfsh>create lucene index --name=indexName --region=/orders --field=customer,tags

Example 2: This example creates a GemFire Search index with two fields. In this example:

“DEFAULT” in the first analyzer position specifies that the default analyzer will be used for the first field, and a custom analyzer is specified for the second field.

The default serializer is used.

gfsh>create lucene index --name=indexName --region=/orders
  --field=customer,tags --analyzer=DEFAULT,org.apache.lucene.analysis.bg.BulgarianAnalyzer

XML Example

For this example, the XML configuration file below specifies a GemFire Search index with three fields and three analyzers:

<cache
  xmlns="http://geode.apache.org/schema/cache"
  xmlns:lucene="http://geode.apache.org/schema/lucene"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://geode.apache.org/schema/cache
    http://geode.apache.org/schema/cache/cache-1.0.xsd
    http://geode.apache.org/schema/lucene
    http://geode.apache.org/schema/lucene/lucene-1.0.xsd"
  version="1.1">
 
  <region name="region" refid="PARTITION">
    <lucene:index name="myIndex">
      <lucene:field name="a" 
                    analyzer="org.apache.lucene.analysis.core.KeywordAnalyzer"/>
      <lucene:field name="b" 
                    analyzer="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
      <lucene:field name="c" 
                    analyzer="org.apache.lucene.analysis.standard.ClassicAnalyzer"/>
      <lucene:field name="d" />
    </lucene:index>
  </region>
</cache>

Memory Management with Event Batching During Index Update

GemFire Search provides Lucene indexing search capabilities over the data stores in GemFire regions. Lucene indexes are managed and stored in GemFire regions. Lucene indexing can generate numerous “tombstones” that consume memory before they can be collected and removed. The rate of tombstone generation depends on the use-case, with the rate of generation higher in write-heavy indexed regions.

Tombstones

“Tombstones” are created by GemFire when consistency checking is enabled for a region. GemFire members do not immediately remove entries from the region when an application destroys the entry. Instead, the member retains the entry with its current version stamp for a period of time to detect possible conflicts with operations that have occurred. The retained entry is referred to as a tombstone.

To manage the memory consumed by tombstones, GemFire Search provides configuration settings related to the GemFire AsyncEventQueue used as the queuing mechanism to update the Lucene indexes. These settings help in batching and processing the region events that are applied to or passed to Lucene Index. Applications can change the values of these settings from the default based on the use case.

Configurable Settings

The following are the configurable settings (java system properties) that help to manage tombstones:

gemfire.search.batch-size: Determines the maximum number of events in a single batch. Default value: 10000. Decreasing this increases the rate of tombstone creation.
gemfire.search.batch-time-interval: Determines the maximum amount of time in milliseconds. Default value: 1000. Decreasing this increases the rate of tombstone creation.
gemfire.search.dispatcher-threads: Determines the number of threads that process batches. Default value: 1. Increasing this increases the rate of tombstone creation.

When either the gemfire.search.batch-size or the gemfire.search.batch-time-interval is reached during the creation of a batch, the batch is delivered to one of the gemfire.search.dispatcher-threads for processing.

Several statistics can help decide how to tune these properties.

CachePerfStats puts for the data region. For example, RegionStats-partition-<region_name>: The rate of put operations per second.
CachePerfStats tombstones for the files region for the index and data region, for example, RegionStats-partition-<index_name>#_<region_name>.files: The number of tombstones. GemFire Search commits cause entries from previous commits to be destroyed. These destroys generated the tombstones which are used to help facilitate region consistency across members.
LuceneIndexStatistics commits for the index and data region, for example, <index_name>-/<region_name>: The number of GemFire Search commits. One or more commits will occur for each batch processed by the LuceneEventListener. Each commit causes approximately 35 destroys and tombstones.
AsyncEventQueueStatistics eventQueueSize for the index and data region, for example, asyncEventQueueStats-<index_name>#_<region_name>: The size of the AsyncEventQueue’s queue of events to be processed. The batch processing must be fast enough to keep up with the put rate so that the queue does not grow continuously.

Using FlatFormatSerializer to Index Fields Within Nested Objects

VMware GemFire supplies a built-in serializer, org.apache.geode.cache.lucene.FlatFormatSerializer. This serializer renders collections and nested objects as searchable fields, which you can access using the syntax fieldnameAtLevel1.fieldnameAtLevel2 for both indexing and querying.

For example, in the following data model, the Customer object contains both a Person object and a collection of Page objects. The Person object also contains a Page object.

public class Customer implements Serializable {
  private String name;
  private Collection<String> phoneNumbers;
  private Collection<Person> contacts;
  private Page[] myHomePages;
  ......
}
public class Person implements Serializable {
  private String name;
  private String email;
  private int revenue;
  private String address;
  private String[] phoneNumbers;
  private Page homepage;
  .......
}
public class Page implements Serializable {
  private int id; // search integer in int format
  private String title;
  private String content;
  ......
}

The FlatFormatSerializer creates one document for each parent object, adding an indexed field for each data field in a nested object, identified by its qualified name. Similarly, collections are flattened and treated as tokens in a single field.

For example, the FlatFormatSerializer could convert a Customer object, with the structure described above, into a document containing fields such as name, contacts.name, and contacts.homepage.title, based on the indexed fields specified at index creation. Each segment is a field name, not a field type, because a class (such as Customer) could have more than one field of the same type (such as Person).

The serializer creates and indexes the fields that you specify when you request index creation.

gfsh Example

The gfsh equivalent of the above Java code uses the create lucene index command with:

Options specifying the index name, region name, and field names

The FlatFormatSerializer, specified using its fully qualified name, org.apache.geode.cache.lucene.FlatFormatSerializer

gfsh>create lucene index --name=customerIndex --region=Customer
  --field=name,contacts.name,contacts.email,contacts.address,contacts.homepage.title
  --serializer=org.apache.geode.cache.lucene.FlatFormatSerializer

Querying Nested Fields

The syntax for querying a nested field is the same as for querying a top level field with the additional qualifying parent field name, such as contacts.name:Jones77*. This distinguishes which “name” field is intended when there can be more than one “name” field at different hierarchical levels in the object.

Example Java query:

LuceneQuery query = luceneService.createLuceneQueryFactory()
    .create("customerIndex", "Customer", "contacts.name:Jones77*", "name");
 
PageableLuceneQueryResults<K,Object> results = query.findPages();

Example gfsh query:

gfsh>search lucene --name=customerIndex --region=Customer
  --queryString="contacts.name:Jones77*"
  --defaultField=name

Querying an Index

Java API Example

LuceneQuery<String, Person> query = luceneService.createLuceneQueryFactory()
  .create(indexName, regionName, "name:John AND zipcode:97006", defaultField);

Collection<Person> results = query.findValues();

gfsh Example

gfsh>search lucene --name=indexName --region=/orders --queryString="Jones*"
   --defaultField=customer

For more information, see the gfsh search lucene command reference page.

Destroying an Index

A region-destroy operation does not cause the destruction of any GemFire Search indexes. You must destroy any GemFire Search indexes prior to destroying the associated region.

Java API Example

luceneService.destroyIndex(indexName, regionName);

An attempt to destroy a region before destroying its associated GemFire Search index will result in an error message similar to the following:

Region /orders cannot be destroyed because it defines Lucene index(es)
  [/ordersIndex]. Destroy all Lucene indexes before destroying the region.

gfsh Example

gfsh>destroy lucene index --name=indexName --region=/orders

For more information, see the gfsh destroy lucene index command reference page.

An attempt to destroy a region before destroying its associated GemFire Search index will result in an error message similar to the following:

Region /orders cannot be destroyed because it defines Lucene index(es)
  [/ordersIndex]. Destroy all Lucene indexes before destroying the region.

Changing an Index

Destroy the GemFire Search index.
Recreate a new GemFire Search index.

Additional gfsh Commands

The gfsh describe lucene index command displays details about a specified index. For more information about this command, see the gfsh describe lucene index command reference page.

The gfsh list lucene index command displays the list of GemFire Search indexes created for all members. For more information about this command, see the gfsh list lucene index command reference page.