This topic describes GemFire cluster sizing.

Overview

Sizing a GemFire deployment is a process that involves calculation, as well as experimentation and testing. Some experimentation and testing is required to arrive at reasonably accurate values for the key sizing parameters that will work well in practice. This experimentation and testing involves representative data and workload, starting at a very small scale.

Experimentation and testing is required because memory overhead can vary greatly due to variations in data and workload. This makes it impractical to calculate the overhead precisely, as it is a function of too many variables, many of which stem from the Java runtime environment (JVM) and its memory management mechanism.

Resource Considerations

Memory is the primary means of data storage in GemFire and is the first resource to consider for sizing purposes.

Horizontal scaling to satisfy memory requirements also scales out all the other hardware resources, the CPU, network, and disk. Because of this, when memory requirements are satisfied and the adequate cluster size is determined, often only small adjustments are needed to cover all the other required resources and complete the sizing process.

Typically, memory drives horizontal scaling, but it can be any of the hardware resources. In addition to hardware resources, soft resources should be considered. The most important software resources to consider are file descriptors, mostly for sockets in use, and threads (processes).

Experimentation and Testing

To size a GemFire cluster:

  1. Deploy a small representative data set and workload in a small cluster.
  2. Tune the cluster to the desired performance.
  3. Scale out the cluster while ensuring that key performance metrics stay within the desired SLA.

Testing at full scale is ideal, if you have sufficient resources available to use in testing. If sufficient resources are not available to use in testing, you can scale out multiple times, a few nodes at a time, to provide data points to use to project resource usage at full scale. This is an iterative process that involves analysis and tuning at each step. GemFire statistics can assist in this analysis. can be aided greatly by GemFire statistics.

For large scale deployments that involve large data volumes, the general guideline is to scale vertically as much as possible to fit as much data as possible in a single GemFire instance. This helps minimize the size of the cluster. The limit to vertical scaling may depend on the desired SLA around node failure.

Requirements and Assumptions

To maximize the accuracy of your GemFire cluster sizing, and to minimize unexpected situations in your production environment, VMware recommends that you first run tests to characterize memory and other resource usage under a representative workload.

Requirements:

  • A subset of representative data. Typically, representative data that more closely matches the real data produces more accurate results.
  • A matching subset of workload that matches the production workload as closely as possible.
  • Hardware resources for testing, ideally the same category as would be used in production: the same CPU, memory, network, and disk resources per node. At a minimum, you must be able to run three GemFire data nodes to start, then be able to add a few more node to validate the scalability. In addition, you must have the same number of hosts for GemFire clients to be able to create an adequate workload.
  • Familiarity with key GemFire concepts and features, such as partitioned regions, serialization, etc.

You should follow the documented best practices, such as the JVM GC configuration (CMS and ParNew), and use the currently supported platforms.

Architectural and Design Considerations

Before a sizing effort can start, the overarching architectural decisions have to be made, such as which GemFire regions to use for different types of data or what redundancy level to use. The results of sizing can inform architectural and design decisions for which multiple options are possible.

Serialization

Serialization can make a significant difference in the per-entry data overhead in memory, and subsequently in the overall memory requirements.

GemFire’s PDX serialization is a serialization format that keeps data in a usable serialized form. This allows most operations on data entries without having to deserialize them, resulting in both space and performance benefits. These qualities make the PDX serialization the recommended serialization approach for most use cases.

DataSerializable is another GemFire serialization mechanism that is more space efficient than either PDX or Java Serializable. However, unlike PDX, it requires deserialization on any kind of access.

Per-entry Memory Overhead

Listed below are factors that can have significant impact on the memory overhead for data on a per entry basis, as well as performance:

  • Choice of GemFire region type: Different regions have different per-entry overheads. This overhead is documented below and is included in the sizing spreadsheet.
  • Choice of the serialization mechanism: GemFire offers multiple serialization options, as well as the ability to have values stored serialized. As mentioned above, GemFire PDX serialization is the generally recommended serialization mechanism due to its space and performance benefits.
  • Choice of Keys: Smaller and simpler keys are more efficient in terms of both space and performance.
  • Use of indexes: Indexing incurs a per-entry overhead, as documented in Memory Requirements for Cached Data.

For more detailed information and guidelines, see Memory Requirements for Cached Data.

If the data value objects are small but great in number, the per-entry overhead can add up to a significant memory requirement. You can reduce this overhead by grouping multiple data values into a single entry or by using containment relationships. For example, you can choose to have your Order objects contain their line items instead of having a separate OrderLineItems region. If this option is available, using it may yield performance improvements in addition to space savings.

Partitioned Region Scalability

GemFire partitioned regions scale out by rebalancing their data buckets (partitions) to distribute the data evenly across all available nodes in a cluster. When new nodes are added to the cluster, rebalancing causes some buckets to move from the old to the new nodes such that the data is evenly balanced across all the nodes. For this to work effectively, with the end result being a well-balanced cluster, there should be at least one order of magnitude more buckets than data nodes for each partitioned region.

Typically, increasing the number of buckets improves data distribution. However, since the number of buckets cannot be changed dynamically and without downtime, the projected horizontal scale-out taken into account when determining the optimal number of buckets. Otherwise, as the system scales out over time, the data may become less evenly distributed. In the extreme case, when the number of nodes exceeds the number of buckets, adding new nodes has no effect, and the ability to scale out is lost.

Related to this is the choice of data partitioning scheme, the goal of which is to yield even data and workload distribution in the cluster. If problem with the partitioning scheme exists, the data, and likely the workload, will not be evenly balanced, and scalability will be lost.

Redundancy

Typically, choice of redundancy may be driven by data size and by whether or not data can be retrieved from some other backing store besides GemFire. Other considerations might also be a factor in the decision.

For example, GemFire can be deployed in an active/active configuration in two data centers such that each can take on the entire load, but will do so only if necessitated by a failure. Typically, in such deployments there are four live copies of the data at any time, with two in each datacenter. If two nodes failed in a single datacenter, the other datacenter would take over the entire workload until those two nodes were restored. This avoids data loss in the first datacenter. You could instead set redundancy to two, for a total of three copies of data). This would provide high availability even in case of a single node failure, and avoids paying the price of rebalancing when a single node fails. In this case, instead of rebalancing, a single failed node is restarted, while two copies of data still exist.

Relationship Between Horizontal and Vertical Scale

For deployments that can grow very large, you should allow for the growth by taking advantage of not just horizontal scalability, but also the ability to store as much data as possible in a single node. GemFire has been deployed in clusters of over 100 nodes. However, smaller clusters are easier to manage. So, as a general rule, you should store as much data as possible in a single node while maintaining a comfortable data movement requirement for re-establishing the redundancy SLA after a single point of failure. GemFire has been used with heaps of well over 64GB in size..

NUMA Considerations

You should understand the Non-Uniform Memory Architecture (NUMA) memory boundaries when deciding on the JVM size, and VM size in virtualized deployments.

Most modern CPUs implement this kind of architecture where memory is divided across the CPUs such that memory directly connected to the bus of each CPU has very fast access whereas memory accesses by that same CPU on the other portions of memory (directly connected to the other CPUs) can pay a significant wait-state penalty for accessing data. An example is a system that has four CPUs with eight cores each and a Non-Uniform Memory Architecture that assigns each CPU its own portion of the memory. As an example, assume that the total memory on the machine is 256GB. In this case, each NUMA node is 64GB. Growing a JVM larger than 64GB on this machine will cause wait-states to be induced when the CPUs must cross NUMA node boundaries to access memory within the heap. For this reason, you should size GemFire JVMs to fit within a single NUMA node to optimize performance.

GemFire Queues

If any GemFire queueing capabilities are used, such as for WAN distribution, client subscription, or asynchronous event listeners, you should evaluate the queues’ capacity in the context of the desired SLA. For example, for how long should gateway or client subscription queues be able to keep queueing events when the connection is lost? Given that, how large should the queues be able to grow? An effective way to learn the answers to these kinds of questions is to watch the queues’ growth during sizing experiments, using GemFire statistics. For more information about this, see Step 3: Vertical Sizing below.

For WAN distribution, you should evaluate the distribution volume requirements and ensure adequate network bandwidth sizing. If sites connected by the WAN gateway may be down for extended periods of time, such as days or weeks, you must overflow the gateway queues to disk and ensure that you have sufficient disk space for those queues. If you have insufficient disk space for the queues, you may need to shut off the Gateway senders to prevent running out of space.

Sizing Process

To size a GemFire cluster:

  1. Domain object sizing: Produce an entry size estimate for all the domain objects. Use this with number of entries to estimate the total memory requirements.

  2. Estimating total memory and system requirements: Based on the data sizes, estimate the total memory and system requirements using the sizing spreadsheet, which takes into account GemFire region overhead. This does not account for other overhead, but provides a starting point.

  3. Vertical sizing: Use the results of the previous step as the starting point in configuring a three-node cluster. Vertical sizing determines the “building block” – the sizing, configuration, and workload for a single node – by experimentation.

  4. Scale-out validation: Iteratively test and adjust the single “building block” node from the previous step to verify near-linear scalability and performance.

  5. Projection to full scale: Use the results of scale-out validation to arrive at the sizing configuration that meets your desired capacity and SLA.

The following sections provide details about each step.

Step 1: Domain object sizing

Before you can make any other estimates, you must estimate the size of the domain objects to be stored in the cluster.

An effective way to size a domain object is to run a single instance GemFire test with GemFire statistics enabled. In this instance, store each domain object to be sized in a dedicated partitioned region. The test loads a number of instances of each domain object, making sure they all stay in memory, with no overflow. After running the test, load the statistics file from it into VSD and examine dataStoreBytesInUse and dataStoreEntryCount partition region stats for each partitioned region. Dividing the value of dataStoreBytesInUse by the value of dataStoreEntryCount provides an estimate for the average value size that is as accurate as is possible.

Note: This estimate does not include the key size and entry overhead.

Another way to size domain objects is to use a heap histogram. In this approach you should run a separate test for each domain object. This simplifies the process of determining what classes are associated with data entries, based on the number of entries in memory, to figure out what classes

Step 2: Estimating total memory and system requirements

You can use the System Sizing Worksheet to approximate your total memory and system requirements. The System Sizing Worksheet takes into account all the GemFire region related per-entry overhead, and the desired memory headroom.

The spreadsheet formulas are rough approximations that serve to inform a high-level estimate, as they do not account for any other overhead such as buffers, threads, queues, application workload, etc. Additionally, the results obtained from the spreadsheet do not have any performance context. For this reason, the steps in Step 3: Vertical Sizing use the results for memory allocation per server obtained from the spreadsheet as the starting point for the vertical sizing process.

Step 3: Vertical Sizing

Use vertical sizing to determine what fraction of the total requirements for storage and workload can be satisfied with a single data node, and with what resources. This represents a “building block” (a unit of scale) and includes both the size of the resources and the workload capacity. It also includes the complete configuration of the building block (system, VM if present, JVM, and GemFire).

For example, a result of this step for a simple read-only application might be that a single data node with a JVM sized to 64G can store 40G of data and support a workload of 5000 get operations per second within the required latency SLA, without exhausting any resources. You should capture all the key performance indicators for the application, and verify that they meet the desired SLA. A complete output of the vertical sizing step includes all the relevant details such as hardware resources per node, peak capacity, and performance at peak capacity, and notes which resource becomes a bottleneck at peak capacity.

This approach uses experimentation to determine the optimal values for all relevant configuration settings, including the system, VM if virtualization is used, JVM, and GemFire configuration to be used.

To run experiments and tests, you musy have a cluster of three data nodes and a locator, as well as additional hosts to run clients to generate the application workload. Three data nodes are required to fully exercise the partitioning of data in partitioned regions across multiple nodes in presence of data redundancy. As a starting point, the data nodes should be sized based on the estimates obtained from the sizing spreadsheet completed in Step 2: Estimating total memory and system requirements.

Typically, the following configuration is used to begin:

  • A heap headroom of 50% of the old generation
  • CMSInitiatingOccupancyFraction is set to 65%
  • The young generation is sized to 10% of the total heap

GemFire logging and statistics should be enabled for all the test runs. The logs should be routinely checked for problems. The statistics are analyzed for problems, verification of resources, and performance. Performance metrics can be collected by the application test logic as well. Any relevant latency metrics must be collected by the test application.

If WAN distribution is needed, you should set up an identical twin cluster and configure the WAN distribution between the two clusters. You should also size WAN capacity.

Test runs should exercise a representative application workload, with duration long enough to incur multiple GC cycles, so that stable resource usage can be confirmed. If any GemFire queues are used, run tests to determine adequate queue sizes that meet the SLA. If disk storage is used, determine adequate disk store size and configuration per node as part of this exercise.

After each test run, examine the latency metrics collected by the application. Use VSD to examine the statistics and correlate the resource usage with latencies and throughput observed. You should examine the following:

  • Memory (heap, and non-heap, GC)
  • CPU
  • System load
  • Network
  • File descriptors
  • Threads
  • Queue statistics

For information about VSD and these statistics, see Quick Guide to Useful Statistics.

One of the objectives of vertical sizing is to determine the headroom required to accomplish the desired performance. This might take several tests to tune the headroom to no more and no less than needed. A much larger headroom than needed could amount to a significant waste of resources. A smaller headroom could cause higher GC activity and CPU usage and hurt performance.

Locator Sizing

Locator JVM sizing may be necessary when JMX Manager is running in the locator JVM, and JMX is used for monitoring. An effective way to do this is to set the locator heap to 0.5G and monitor it during the scale-out.

Notes on GC

The most important goal of GC is to avoid full GCs, as they cause pauses which can result in a GemFire data node to be unresponsive, and, as a result, be expelled from the cluster. The permanent generation space can trigger a full GC as well, which happens when it fills completely. You should size this appropriately to avoid this. Additionally, you can instruct the JVM to garbage collect the permanent generation space along with CMS GC using the following option:

 -XX:+CMSClassUnloadingEnabled 

You can tune GC for two of the following three:

  • Latency
  • Throughput
  • Memory footprint

Heap headroom is important because with GemFire we sacrifice the memory footprint to accomplish latency and throughput goals.

Long minor GC pauses can be shortened by reducing the young generation. This will likely increase the frequency of minor collections. Additionally, for very large heaps, for example those of 64G and above, the old generation impact on minor GC pauses may be reduced by using the following GC settings:

 -XX:+UnlockDiagnosticVMOptions XX:ParGCCardsPerStrideChunk=32768 

Step 4: Scale-out Validation

During this step, you scale out the initial three node cluster at least twice, adding at least a few nodes each time. You should also scale out the client hosts accordingly to be able to create adequate workload at each step. You should increase the workload proportionally to the scale-out.

There is no definitive rule about how much to increase the cluster size, or in what increments. Typically, this determination is dictated by available hardware resources.

The goal of this step is to validate the “building block” configuration and capacity at some, larger than initial, scale. This allows you to project the capacity to full scale with confidence. You may need to tune the configuration at various points. For example, when you add more nodes to the cluster, more socket connections, buffers, and threads will be in use on each node, resulting in higher memory usage per node (both heap and non-heap), as well as increased file descriptors in use.

If you use JMX for monitoring, watch the heap usage of the locator running the JMX Manager.

Step 5: Projection to Full Scale

After you have completed Step 4: Scale-out Validation, you can determine the total cluster size. You know the storage and workload capacity of a single node and that you can scale horizontally to meet the full requirements. Additionally, you have already tuned the cluster configuration to meet the demands of the application workload.

Sizing Quick Reference

General recommendations to use as a starting point in capacity planning and sizing
Data Node Heap Size Use
Up to 32GB Smaller data volumes (up to a few hundred GB); very low latency required
64GB+ Larger data volumes (500GB+)
CPU Cores per Data Node Use
2 to 4 Development; smaller heaps
6 to 8 Production; performance/system testing; larger heaps
Network Bandwidth Use
1GbE Development
High bandwidth (e.g. 10GbE) Production; performance/system testing
Disk Storage Use
DAS, or SAN Always
NAS Do not use; performance and resilience issues
  • Memory/CPU relationship: mind the NUMA boundary
  • Virtualization: Do not oversubscribe resources (memory, CPU, storage). Run a single GemFire data node JVM per VM.
check-circle-line exclamation-circle-line close-line
Scroll to top icon