This topic provides information about and suggestions for diagnosing and resolving VMware Tanzu GemFire system problems.
Invocation of a locator with gfsh
fails with an error like this:
Starting a GemFire Locator in C:\devel\gfcache\locator\locator
The Locator process terminated unexpectedly with exit status 1.
Please refer to the log file in C:\devel\gfcache\locator\locator for full details.
Exception in thread "main" java.lang.RuntimeException: An IO error occurred while
starting a Locator in C:\devel\gfcache\locator\locator on 192.0.2.0[10999]: Network is
unreachable; port (10999) is not available on 192.0.2.0.
...
This indicates a mismatch somewhere in the address, port pairs used for locator startup and configuration. The address you use for locator startup must match the address you list for the locator in the gemfire.properties
locators specification. Every member of this cluster, including the locator itself, must have the complete locators specification in the gemfire.properties
.
Response:
If the process tries to start and then silently disappears, on Windows this indicates a memory problem.
Response:
On a Windows host, decrease the maximum JVM heap size. This property is specified on the gfsh
command line:
gfsh>start server --name=server_name --max-heap=1024m
For details, see JVM Memory Settings and System Performance.
If this does not work, try rebooting.
Response: Check these possible causes.
gemfire.properties
has the correct IP address for the locator.gemfire version
command.Response:
Either the process cannot find the configuration file or, if it is an application, it may be doing programmatic configuration.
Response:
gemfire.properties
file is in the right directory.gemfire.properties
file earlier in the search path. Tanzu GemFire looks for a gemfire.properties
file in the current working directory, the home directory, and the class path, in that order.gemfire.properties
file. See your application’s customer support group for configuration changes.System member startup fails with an error like one of these:
Exception in thread "main" org.apache.geode.cache.CacheXmlException:
While reading Cache XML file:/C:/gemfire/client_cache.xml.
Error while parsing XML, caused by org.xml.sax.SAXParseException:
Document root element "client-cache", must match DOCTYPE root "cache".
Exception in thread "main" org.apache.geode.cache.CacheXmlException:
While reading Cache XML file:/C:/gemfire/cache.xml.
Error while parsing XML, caused by org.xml.sax.SAXParseException:
Document root element "cache", must match DOCTYPE root "client-cache".
Tanzu GemFire declarative cache creation uses one of two root element pairs: cache
or client-cache
. The name must be the same in both places.
Response:
cache.xml
file so it has the proper XML namespace and schema definition.For peers and servers:
<?xml version="1.0" encoding="UTF-8"?>
<cache
xmlns="http://geode.apache.org/schema/cache"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://geode.apache.org/schema/cache http://geode.apache.org/schema/cache/cache-1.0.xsd"
version="1.0">
...
</cache>
For clients:
<?xml version="1.0" encoding="UTF-8"?>
<client-cache
xmlns="http://geode.apache.org/schema/cache"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://geode.apache.org/schema/cache http://geode.apache.org/schema/cache/cache-1.0.xsd"
version="1.0">
...
</client-cache>
An empty cache can be a normal condition. Some applications start with an empty cache and populate it programmatically, but others are designed to bulk load data during initialization.
Response:
If your application should start with a full cache but it comes up empty, check these possible causes:
gemfire.properties
. If they match, the process may not be reading gemfire.properties
. See Member process does not read settings from the gemfire.properties file.Client calls to keySetOnServer and containsKeyOnServer can return incomplete or inconsistent results if your server regions are not configured as partitioned or replicated regions.
A non-partitioned, non-replicate server region may not hold all data for the distributed region, so these methods would operate on a partial view of the data set.
In addition, the client methods use the least loaded server for each method call, so may use different servers for two calls. If the servers do not have a consistent view in their local data set, responses to client requests will vary.
The consistent view is only guaranteed by configuring the server regions with partitioned or replicate data-policy settings. Non-server members of the server system can use any allowable configuration as they are not available to take client requests.
The following server region configurations give inconsistent results. These configurations allow different data on different servers. There is no additional messaging on the servers, so no union of keys across servers or checking other servers for the key in question.
These configurations provide consistent results:
Response: Use a partitioned or replicate data-policy for your server regions. This is the only way to provide a consistent view to clients of your server data set. See Region Data Storage and Distribution Options.
In partitioned regions that are persisted to disk, if you have any members offline, the partitioned region will still be available but may have some buckets represented only in offline disk stores. In this case, methods that access the bucket entries return a PartitionOfflineException, similar to this:
org.apache.geode.cache.persistence.PartitionOfflineException:
Region /__PR/_B__root_partitioned__region_7 has persistent data that is no
longer online stored at these locations:
[/192.0.2.1:/export/straw3/users/jpearson/bugfix_Apr10/testCL/hostB/backupDirectory
created at timestamp 1270834766733 version 0]
Response: Bring the missing member online, if possible. This restores the buckets to memory and you can work with them again. If the missing member cannot be brought back online, or the disk stores for the member are corrupt, you may need to revoke the member, which will allow the system to create the buckets in new members and resume operations with the entries. See Handling Missing Disk Stores.
Check these possible causes.
Operating without a log file can be a normal condition, so the process does not log a warning.
Response:
gemfire.properties
. If not, logging defaults to standard output, and on Windows it may not be visible at all.gemfire.properties
. See Member process does not read settings from the gemfire.properties file.An application gets an OutOfMemoryError if it needs more object memory than the process is able to give. The messages include java.lang.OutOfMemoryError.
Response:
The process may be hitting its virtual address space limits. The virtual address space has to be large enough to accommodate the heap, code, data, and dynamic link libraries (DLLs).
You may need to lower the thread stack size. The default thread stack size is large: 512kb on Sparc and 256kb on Intel for 1.3 and 1.4 32-bit JVMs, 1mb with the 64-bit Sparc 1.4 JVM; and 128k for 1.2 JVMs. If you have thousands of threads then you might be wasting a significant amount of stack space. If this is your problem, the error may be this:
OutOfMemoryError: unable to create new native thread
The minimum setting in 1.3 and 1.4 is 64kb, and in 1.2 is 32kb. You can change the stack size using the -Xss flag, like this: -Xss64k
You can also control memory use by setting entry limits for the regions.
The org.apache.geode.cache.PartitionedRegionDistributionException appears when Tanzu GemFire fails after many attempts to complete a distributed operation. This exception indicates that no data store member can be found to perform a destroy, invalidate, or get operation.
Response:
The org.apache.geode.cache.PartitionedRegionStorageException appears when Tanzu GemFire cannot create a new entry. This exception arises from a lack of storage space for put and create operations or for get operations with a loader. PartitionedRegionStorageException often indicates data loss or impending data loss.
The text string indicates the cause of the exception, as in these examples:
Unable to allocate sufficient stores for a bucket in the partitioned region....
Ran out of retries attempting to allocate a bucket in the partitioned region....
Response:
If an application crashes without any exception, this may be caused by an object memory problem. The process is probably hitting its virtual address space limits. For details, see OutOfMemoryError.
Response: Control memory use by setting entry limits for the regions.
If a distributed message does not get a response within a specified time, it sends an alert to signal that something might be wrong with the system member that has not responded. The alert is logged in the sender’s log as a warning.
A timeout alert can be considered normal.
Response:
A client and server produces a SocketTimeoutException when it stops waiting for a response from the other side of the connection and closes the socket. This exception typically happens on the handshake or when establishing a callback connection.
Response:
Increase the default socket timeout setting for the member. This timeout is set separately for the client Pool. For a client/server configuration, adjust the “read-timeout” value as described in <pool> or use the org.apache.geode.cache.client.PoolFactory.setReadTimeout
method.
A cluster member’s Cache and DistributedSystem are forcibly closed by the system membership coordinator if it becomes sick or too slow to respond to heartbeat requests. When this happens, listeners receive RegionDestroyed notification with an opcode of FORCED_DISCONNECT. The Tanzu GemFire log file for the member shows a ForcedDisconnectException with the message
This member has been forced out of the cluster because it did not respond
within member-timeout milliseconds
Response:
To minimize the chances of this happening, you can increase the DistributedSystem property member-timeout. Take care, however, as this setting also controls the length of time required to notice a network failure. It should not be set too high.
Suspect a network problem or a problem in the configuration of transport for memory and discovery.
Response:
This situation can leave your caches in an inconsistent state. In networking circles, this kind of network outage is called the “split brain problem.”
Response:
Also see Understanding and Recovering from Network Outages.
Suspect a problem with the network, or the locator.
Response:
Check the health of your system members. Search the logs for this string:
Uncaught exception
An uncaught exception means a severe error, often an OutOfMemoryError. See OutOfMemoryError.
Check your network monitoring tools to see whether the network is down or flooded.
gemfire.properties
files.
gemfire.properties
files to change this locator’s IP address in the locators attribute.This problem can occur in systems with a great number of distributed-no-ack operations. That is, the presence of many no-ack operations can cause ack operation to take a long time to complete.
Response:
For information about alleviating this problem, see Slow distributed-ack Messages.
Slow system performance is sometimes caused by a buffer size that is too small for the objects being distributed.
Response:
If you are experiencing slow performance and are sending large objects (multiple megabytes), try increasing the socket buffer size settings in your system. For more information, see Socket Communication.
Attempting to run performance measurements for Tanzu GemFire on Windows can produce this error message:
cannot get Windows performance data. RegQueryValueEx returned 5
This error can occur because incorrect information is returned when a Win32 application calls the ANSI version of RegQueryValueEx Win32 API with HKEY_PERFORMANCE_DATA. This error is described in Microsoft KB article ID 226371 at http://support.microsoft.com/kb/226371/en-us.
Response:
To successfully acquire Windows performance data, you need to verify that you have the proper registry key access permissions in the system registry. In particular, make sure that Perflib in the following registry path is readable (KEY_READ access) by the Tanzu GemFire process:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Perflib
An example of reasonable security on the performance data would be to grant administrators KEY_ALL_ACCESS access and interactive users KEY_READ access. This particular configuration would prevent non-administrator remote users from querying performance data.
See http://support.microsoft.com/kb/310426 and http://support.microsoft.com/kb/146906 for instructions about how to ensure that Tanzu GemFire processes have access to the registry keys associated with performance.
If your Java applications suddenly start to use 100% CPU, you may be experiencing the leap second bug. This bug is found in the Linux kernel and can severely affect Java programs. In particular, you may notice that method invocations using Thread.sleep(n)
where n
is a small number will actually sleep for much longer period of time than defined by the method. To verify that you are experiencing this bug, check the host’s dmesg
output for the following message:
[10703552.860274] Clock: inserting leap second 23:59:60 UTC
To fix this problem, issue the following commands on your affected Linux machines:
prompt> /etc/init.d/ntp stop
prompt> date -s "$(date)"
See the following web site for more information:
http://blog.wpkg.org/2012/07/01/java-leap-second-bug-30-june-1-july-2012-fix/