Each VMware Greenplum Text/Apache Solr node is a Java Virtual Machine (JVM) process and is allocated memory at startup. The maximum amount of memory the JVM will use is set with the -Xmx
parameter on the Java command line. Performance problems and out of memory failures can occur when the nodes have insufficient memory.
Other performance problems can result from resource contention between the VMware Greenplum, Solr, and ZooKeeper clusters.
This topic discusses VMware Greenplum Text use cases that stress Solr JVM memory in different ways and the best practices for preventing or alleviating performance problems from insufficient JVM memory and other causes.
Indexing documents consumes data in Solr JVM memory. When the index is committed, parts of the memory are released, but some data remains in memory to support fast search. By default, Solr performs an automatic soft commit when 1,000,000 documents are indexed or 20 minutes (1,200,000 milliseconds) have passed. A soft commit pushes documents from memory to the index, freeing JVM memory. A soft commit also makes the documents visible in searches. A soft commit does not, however, make the index updates durable; it is still necessary to commit the index with the gptext.commit()
user-defined function.
You can configure an index to perform a more frequent automatic soft commit by editing the solrconfig.xml
file for the index:
$ gptext-config edit -f solrconfig.xml -i <db>.<schema>.<index-name>
The <autoSoftCommit>
element is a child of the <updateHandler>
element. Edit the <maxDocs>
and <maxTime>
values to reduce the time between automatic commits. For example, the following settings perform an autocommit every 100,000 documents or 10 minutes.
<autoSoftCommit>
<maxDocs>100000</maxDocs>
<maxTime>600000</maxTime>
</autoSoftCommit>
Indexing very large documents can use a large amount of JVM memory. To manage this, you can set the gptext.idx_buffer_size
configuration parameter to reduce the size of the indexing buffer.
See Changing VMware Greenplum Text Server Configuration Parameters for instructions to change configuration parameter values.
A VMware Greenplum Text node is a Solr instance managed by VMware Greenplum Text. The Solr nodes can be deployed on the VMware Greenplum cluster hosts or on separate hosts accessible to the VMware Greenplum cluster. The number of nodes is configured during VMware Greenplum Text installation.
The maximum recommended number of Solr nodes you can deploy is the number of VMware Greenplum primary segments. However, the best practice recommendation is to deploy fewer nodes with more memory rather than to divide the memory available to VMware Greenplum Text among the maximum number of nodes. Use the JAVA_OPTS
installation parameter to set memory size for the nodes.
A single Solr node per host can easily handle several indexes. Each additional node consumes additional CPU and memory resources, so it is desirable to limit the number of nodes per host. For most VMware Greenplum Text installations, a single Solr node per host is sufficient.
If the JVM has a very large amount of memory, however, garbage collection can cause long pauses while the JVM reorganizes memory. Also, the JVM employs a memory address optimization that cannot be used when JVM memory exceeds 32GB, so at more than 32GB, a Solr node loses capacity and performance. Therefore, no node should have more than 32GB of memory.
When you determine the number of Solr nodes to deploy and how much memory to allocate to them, it is useful to estimate how much memory Solr will need to create and query your indexes. This is important because if you do not allow enough memory, Solr nodes may crash with out of memory errors. This section helps to estimate the resources Solr will need on each host based on the number and characteristics of the documents you index and how many queries you plan to execute concurrently. When you know how much memory you need per host for your indexes, you can decide how many Solr nodes to deploy.
You can estimate the total memory required per host using the following formula.
TotalMem = TermDictIndexMem + DocMem + IndexBufMem + (PerQueryMem * ConcurrentQueries) + CacheMem
The terms in this formula are defined as follows:
TermDictIndexMem Memory for the Solr term dictionary index.
DocMem Memory for each document's metadata.
DocNum * 1
IndexBufMem Memory for the replica index buffer.
100 * 1024 * 1024 * R
PerQueryMem Memory for each running query.
8 * 1024 * 1024 * Rows / 10000
rows
parameter in the gptext.search()
function. Use an estimate of 8MB for 10000 query results.ConcurrentQueries Number of concurrent queries.
CacheMem Memory for the Solr cache.
256 * PostingLen * 8
A 4-host cluster has 2 VMware Greenplum Text indexes.
gptext.search("index_name", "search_word", "filters", "rows=250000")
.The memory units in the these calculations are bytes unless qualified with a unit specifier.
TermDictIndex = (20000000 * 20 + 150000000 * 5) * 0.1 = 115000000
DocMem = 10000000 + 100000000 = 110000000
The initial memory used on each host after Solr starts up is (115000000 + 110000000) = 215MB / 4 = 54MB
.
100 * 1024 * 1024 * 16 * 2 = 3200MB
The maximum additional memory used during a VMware Greenplum Text index operation is 3200MB.
PerQueryMem = 8 * 1024 * 1024 * 250000 / 1000 = 200MB
ConcurrentQueries = 10
CacheMem = 0
The maximum additional memory used for VMware Greenplum Text searches is 200 * 10 = 2000MB
. This memory is consumed on each Solr node.
The total memory needed per Solr node depends on how many nodes are deployed on each host.
54MB + 3200MB + (2000MB * SolrNodeNum)
One Solr node deployed on each host needs (54MB + 3200MB + 2000MB * 1) = 5.3GB
memory per node.
Two Solr nodes on each host need (54MB + 3200MB + 2000MB * 2) / 2 = 3.6GB
memory per node.
Both of these plans could use a JVM size of 8GB or 16GB.
With high indexing or search load, JVM garbage collection pauses can cause the Solr overseer queue to back up. For a heavily loaded VMware Greenplum Text system, you can prevent some performance problems by scheduling document indexing for times when search activity is low.
The gptext.terms()
function retrieves terms vectors from documents that match a query. An out of memory error may occur if the documents are large, or if the query matches a large number of documents on each node. Other factors can contribute to out of memory errors when running a gptext.terms()
query, including the maximum memory available to the Solr nodes (-Xmx value in JAVA_OPTS
) and concurrent queries.
If you experience out of memory errors with gptext.terms()
you can set a lower value for the term_batch_size
VMware Greenplum Text configuration variable. The default value is 1000. For example, you could try running the failing query with term_batch_size
set to 500. Lowering the value may prevent out of memory errors, but performance of terms queries can be affected.
See VMware Greenplum Text Configuration Parameters for help setting VMware Greenplum Text configuration parameters.
Good VMware Greenplum Text performance requires fast responses for ZooKeeper requests. Under heavy load, ZooKeeper can become slow or pause during JVM garbage collections. This can cause SolrCloud timeout issues and affects VMware Greenplum Text query performance, so ensuring good ZooKeeper performance is crucial for the VMware Greenplum Text cluster.
Suggestions for optimizing ZooKeeper performance: