Diagnosing System Problems

This section provides possible causes and suggested responses for system problems.

Locator does not start

Invocation of a locator with gfsh fails with an error like this:

Starting a GemFire Locator in C:\devel\gfcache\locator\locator
The Locator process terminated unexpectedly with exit status 1. Please refer to the log
        file in C:\devel\gfcache\locator\locator for full details.
Exception in thread "main" java.lang.RuntimeException: An IO error occurred while
        starting a Locator in C:\devel\gfcache\locator\locator on 192.0.2.0[10999]: Network is
        unreachable; port (10999) is not available on 192.0.2.0.
at
org.apache.geode.distributed.LocatorLauncher.start(LocatorLauncher.java:622)
at
org.apache.geode.distributed.LocatorLauncher.run(LocatorLauncher.java:513)
at
org.apache.geode.distributed.LocatorLauncher.main(LocatorLauncher.java:188)
Caused by: java.net.BindException: Network is unreachable; port (10999) is not available on
        192.0.2.0.
at
org.apache.geode.distributed.AbstractLauncher.assertPortAvailable(AbstractLauncher.java:136)
at
org.apache.geode.distributed.LocatorLauncher.start(LocatorLauncher.java:596)
...

This indicates a mismatch somewhere in the address, port pairs used for locator startup and configuration. The address you use for locator startup must match the address you list for the locator in the gemfire.properties locators specification. Every member of this cluster, including the locator itself, must have the complete locators specification in the gemfire.properties.

Response:

Check that your locators specification includes the address you are using to start your locator.
If you use a bind address, you must use numeric addresses for the locator specification. The bind address will not resolve to the machine’s default address.
If you are using a 64-bit Linux system, check whether your system is experiencing the leap second bug. See Java applications on 64-bit platforms hang or use 100% CPU for more information.

Application or cache server process does not start

If the process tries to start and then silently disappears, on Windows this indicates a memory problem.

Response:

On a Windows host, decrease the maximum JVM heap size. This property is specified on the gfsh command line:
```
gfsh>start server --name=server_name --max-heap=1024m
```
For details, see JVM Memory Settings and System Performance.
If this doesn’t work, try rebooting.

Application or cache server does not join the cluster

Response: Check these possible causes.

Network problem—the most common cause. First, try to ping the other hosts.
Firewall problems. If members of your distributed Tanzu GemFire system are located outside the LAN, check whether the firewall is blocking communication. Tanzu GemFire is a network-centric distributed system, so if you have a firewall running on your machine, it could cause connection problems. For example, your connections may fail if your firewall places restrictions on inbound or outbound permissions for Java-based sockets. You may need to modify your firewall configuration to permit traffic to Java applications running on your machine. The specific configuration depends on the firewall you are using.
Wrong multicast port when using multicast for membership. Check the gemfire.properties file of this application or cache server to see that the mcast-port is configured correctly. If you are running multiple clusters at your site, each cluster must use a unique multicast port.
Can not connect to locator (when using TCP for discovery).
- Check that the locators attribute in this process’s gemfire.properties has the correct IP address for the locator.
- Check that the locator process is running. If not, see instructions for related problem, Data distribution has stopped, although member processes are running.
- Bind address set incorrectly on a multi-homed host. When you specify the bind address, use the IP address rather than the host name. Sometimes multiple network adapters are configured with the same hostname. See Topology and Communication General Concepts for more information about using bind addresses.
Wrong version of Tanzu GemFire . A version mismatch can cause the process to hang or crash. Check the software version with the gemfire version command.

Member process seems to hang

Response:

During initialization—For persistent regions, the member may be waiting for another member with more recent data to start and load from its disk stores. See Disk Storage. Wait for the initialization to finish or time out. The process could be busy—some caches have millions of entries, and they can take a long time to load. Look for this especially with cache servers, because their regions are typically replicas and therefore store all the entries in the region. Applications, on the other hand, typically store just a subset of the entries. For partitioned regions, if the initialization eventually times out and produces an exception, the system architect needs to repartition the data.
For a running process—Investigate whether another member is initializing. Under some optional cluster configurations, a process can be required to wait for a response from other processes before it proceeds.

Member process does not read settings from the gemfire.properties file

Either the process can’t find the configuration file or, if it is an application, it may be doing programmatic configuration.

Response:

Check that the gemfire.properties file is in the right directory.
Make sure the process is not picking up settings from another gemfire.properties file earlier in the search path. Tanzu GemFire looks for a gemfire.properties file in the current working directory, the home directory, and the CLASSPATH, in that order.
For an application, check the documentation to see whether it does programmatic configuration. If so, the properties that are set programmatically cannot be reset in a gemfire.properties file. See your application’s customer support group for configuration changes.

Cache creation fails - must match schema definition root

System member startup fails with an error like one of these:

Exception in thread "main" org.apache.geode.cache.CacheXmlException:
While reading Cache XML file:/C:/gemfire/client_cache.xml.
Error while parsing XML, caused by org.xml.sax.SAXParseException:
Document root element "client-cache", must match DOCTYPE root "cache".

Exception in thread "main" org.apache.geode.cache.CacheXmlException:
While reading Cache XML file:/C:/gemfire/cache.xml.
Error while parsing XML, caused by org.xml.sax.SAXParseException:
Document root element "cache", must match DOCTYPE root "client-cache".

Tanzu GemFire declarative cache creation uses one of two root element pairs: cache or client-cache. The name must be the same in both places.

Response:

Modify your cache.xml file so it has the proper XML namespace and schema definition.

For peers and servers:

<?xml version="1.0" encoding="UTF-8"?>
<cache
    xmlns="http://geode.apache.org/schema/cache"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://geode.apache.org/schema/cache http://geode.apache.org/schema/cache/cache-1.0.xsd"
    version="1.0”>
...
</cache>

For clients:

<?xml version="1.0" encoding="UTF-8"?>
<client-cache
    xmlns="http://geode.apache.org/schema/cache"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://geode.apache.org/schema/cache http://geode.apache.org/schema/cache/cache-1.0.xsd"
    version="1.0">
...
</client-cache>

Cache is not configured properly

An empty cache can be a normal condition. Some applications start with an empty cache and populate it programmatically, but others are designed to bulk load data during initialization.

Response:

If your application should start with a full cache but it comes up empty, check these possible causes:

No regions—If the cache has no regions, the process isn’t reading the cache configuration file. Check that the name and location of the cache configuration file match those configured in the cache-xml-file attribute in gemfire.properties. If they match, the process may not be reading gemfire.properties. See Member process does not read settings from the gemfire.properties file.
Regions without data—If the cache starts with regions, but no data, this process may not have joined the correct cluster. Check the log file for messages that indicate other members. If you don’t see any, the process may be running alone in its own cluster. In a process that is clearly part of the correct cluster, regions without data may indicate an implementation design error.

Unexpected results for keySetOnServer and containsKeyOnServer

Client calls to keySetOnServer and containsKeyOnServer can return incomplete or inconsistent results if your server regions are not configured as partitioned or replicated regions.

A non-partitioned, non-replicate server region may not hold all data for the distributed region, so these methods would operate on a partial view of the data set.

In addition, the client methods use the least loaded server for each method call, so may use different servers for two calls. If the servers do not have a consistent view in their local data set, responses to client requests will vary.

The consistent view is only guaranteed by configuring the server regions with partitioned or replicate data-policy settings. Non-server members of the server system can use any allowable configuration as they are not available to take client requests.

The following server region configurations give inconsistent results. These configurations allow different data on different servers. There is no additional messaging on the servers, so no union of keys across servers or checking other servers for the key in question.

Normal
Mix (replicated, normal, empty) for a single distributed region. Inconsistent results depending on which server the client sends the request to

These configurations provide consistent results:

Partitioned server region
Replicated server region
Empty server region: keySetOnServer returns the empty set and containsKeyOnServer returns false

Response: Use a partitioned or replicate data-policy for your server regions. This is the only way to provide a consistent view to clients of your server data set. See Region Data Storage and Distribution Options.

Data operation returns PartitionOfflineException

In partitioned regions that are persisted to disk, if you have any members offline, the partitioned region will still be available but may have some buckets represented only in offline disk stores. In this case, methods that access the bucket entries return a PartitionOfflineException, similar to this:

org.apache.geode.cache.persistence.PartitionOfflineException:
Region /__PR/_B__root_partitioned__region_7 has persistent data that is no
longer online stored at these locations:
[/192.0.2.1:/export/straw3/users/jpearson/bugfix_Apr10/testCL/hostB/backupDirectory 
created at timestamp 1270834766733 version 0]

Response: Bring the missing member online, if possible. This restores the buckets to memory and you can work with them again. If the missing member cannot be brought back online, or the disk stores for the member are corrupt, you may need to revoke the member, which will allow the system to create the buckets in new members and resume operations with the entries. See Handling Missing Disk Stores.

Entries are not being evicted or expired as expected

Check these possible causes.

Transactions—Entries that are due to be expired may remain in the cache if they are involved in a transaction. Further, transactions never time out, so if a transaction hangs, the entries involved in the transaction will remain stuck in the cache. If you have a process with a hung transaction, you may need to end the process to remove the transaction. In your application programming, do not leave transactions open ended. Program all transactions to end with a commit or a rollback.
Partitioned regions—For performance reasons, eviction and expiration behave differently in partitioned regions and can cause entries to be removed before you expect. See Eviction and Expiration.

Cannot find the log file

Operating without a log file can be a normal condition, so the process does not log a warning.

Response:

Check whether the log-file attribute is configured in gemfire.properties. If not, logging defaults to standard output, and on Windows it may not be visible at all.
If log-file is configured correctly, the process may not be reading gemfire.properties. See Member process does not read settings from the gemfire.properties file.

OutOfMemoryError

An application gets an OutOfMemoryError if it needs more object memory than the process is able to give. The messages include java.lang.OutOfMemoryError.

Response:

The process may be hitting its virtual address space limits. The virtual address space has to be large enough to accommodate the heap, code, data, and dynamic link libraries (DLLs).

If your application is out of memory frequently, you may want to profile it to determine the cause.
If you suspect your heap size is set too low, you can increase direct memory by resetting the maximum heap size, using -Xmx. For details, see JVM Memory Settings and System Performance.
You may need to lower the thread stack size. The default thread stack size is quite large: 512kb on Sparc and 256kb on Intel for 1.3 and 1.4 32-bit JVMs, 1mb with the 64-bit Sparc 1.4 JVM; and 128k for 1.2 JVMs. If you have thousands of threads then you might be wasting a significant amount of stack space. If this is your problem, the error may be this:
```
OutOfMemoryError: unable to create new native thread
```
The minimum setting in 1.3 and 1.4 is 64kb, and in 1.2 is 32kb. You can change the stack size using the -Xss flag, like this: -Xss64k
You can also control memory use by setting entry limits for the regions.

PartitionedRegionDistributionException

The org.apache.geode.cache.PartitionedRegionDistributionException appears when Tanzu GemFire fails after many attempts to complete a distributed operation. This exception indicates that no data store member can be found to perform a destroy, invalidate, or get operation.

Response:

Check the network for traffic congestion or a broken connection to a member.
Look at the overall installation for problems, such as operations at the application level set to a higher priority than the Tanzu GemFire processes.
If you keep seeing PartitionedRegionDistributionException, you should evaluate whether you need to start more members.

PartitionedRegionStorageException

The org.apache.geode.cache.PartitionedRegionStorageException appears when Tanzu GemFire can’t create a new entry. This exception arises from a lack of storage space for put and create operations or for get operations with a loader. PartitionedRegionStorageException often indicates data loss or impending data loss.

The text string indicates the cause of the exception, as in these examples:

Unable to allocate sufficient stores for a bucket in the partitioned region....

Ran out of retries attempting to allocate a bucket in the partitioned region....

Response:

Check the network for traffic congestion or a broken connection to a member.
Look at the overall installation for problems, such as operations at the application level set to a higher priority than the Tanzu GemFire processes.
If you keep seeing PartitionedRegionStorageException, you should evaluate whether you need to start more members.

Application crashes without producing an exception

If an application crashes without any exception, this may be caused by an object memory problem. The process is probably hitting its virtual address space limits. For details, see OutOfMemoryError.

Response: Control memory use by setting entry limits for the regions.

Timeout alert

If a distributed message does not get a response within a specified time, it sends an alert to signal that something might be wrong with the system member that hasn’t responded. The alert is logged in the sender’s log as a warning.

A timeout alert can be considered normal.

Response:

If you’re seeing a lot of timeouts and you haven’t seen them before, check whether your network is flooded.
If you see these alerts constantly during normal operation, consider raising the ack-wait-threshold above the default 15 seconds.

Member produces SocketTimeoutException

A client and server produces a SocketTimeoutException when it stops waiting for a response from the other side of the connection and closes the socket. This exception typically happens on the handshake or when establishing a callback connection.

Response:

Increase the default socket timeout setting for the member. This timeout is set separately for the client Pool. For a client/server configuration, adjust the “read-timeout” value as described in <pool> or use the org.apache.geode.cache.client.PoolFactory.setReadTimeout method.

Member logs ForcedDisconnectException, Cache and DistributedSystem forcibly closed

A cluster member’s Cache and DistributedSystem are forcibly closed by the system membership coordinator if it becomes sick or too slow to respond to heartbeat requests. When this happens, listeners receive RegionDestroyed notification with an opcode of FORCED_DISCONNECT. The Tanzu GemFire log file for the member shows a ForcedDisconnectException with the message

This member has been forced out of the cluster because it did not respond
within member-timeout milliseconds

Response:

To minimize the chances of this happening, you can increase the DistributedSystem property member-timeout. Take care, however, as this setting also controls the length of time required to notice a network failure. It should not be set too high.

Members cannot see each other

Suspect a network problem or a problem in the configuration of transport for memory and discovery.

Response:

Check your network monitoring tools to see whether the network is down or flooded.
If you are using multi-homed hosts, make sure a bind address is set and consistent for all system members. For details about using bind addresses, see Topology and Communication General Concepts.
Check that all the applications and cache servers are using the same locator address.

One part of the cluster cannot see another part

This situation can leave your caches in an inconsistent state. In networking circles, this kind of network outage is called the “split brain problem.”

Response:

Restart all the processes to ensure data consistency.
Going forward, set up network monitoring tools to detect these kinds of outages quickly.
Enable network partition detection.

Also see Understanding and Recovering from Network Outages.

Data distribution has stopped, although member processes are running

Suspect a problem with the network, the locator, or the multicast configuration, depending on the transport your cluster is using.

Response:

Check the health of your system members. Search the logs for this string:
```
Uncaught exception
```
An uncaught exception means a severe error, often an OutOfMemoryError. See OutOfMemoryError.
Check your network monitoring tools to see whether the network is down or flooded.
If you are using multicast, check whether the existing configuration is no long appropriate for the current network traffic.
Check whether the locators have stopped. For a list of the locators in use, check the locators property in one of the application gemfire.properties files.
- Restart the locator processes on the same hosts, if possible. The cluster begins normal operation, and data distribution restarts automatically.
- If a locator must be moved to another host or a different IP address, complete these steps:
  1. Shut down all the members of the cluster in the usual order.
  2. Restart the locator process in its new location.
  3. Edit all the gemfire.properties files to change this locator’s IP address in the locators attribute.
  4. Restart the applications and cache servers in the usual order.
Create a watchdog daemon or service on each locator host to restart the locator process when it stops

Distributed-ack operations take a very long time to complete

This problem can occur in systems with a great number of distributed-no-ack operations. That is, the presence of many no-ack operations can cause ack operation to take a long time to complete.

Response:

For information on alleviating this problem, see Slow distributed-ack Messages.

Slow system performance

Slow system performance is sometimes caused by a buffer size that is too small for the objects being distributed.

Response:

If you are experiencing slow performance and are sending large objects (multiple megabytes), try increasing the socket buffer size settings in your system. For more information, see Socket Communication.

Can’t get Windows performance data

Attempting to run performance measurements for Tanzu GemFire on Windows can produce this error message:

Can't get Windows performance data. RegQueryValueEx returned 5

This error can occur because incorrect information is returned when a Win32 application calls the ANSI version of RegQueryValueEx Win32 API with HKEY_PERFORMANCE_DATA. This error is described in Microsoft KB article ID 226371 at http://support.microsoft.com/kb/226371/en-us.

Response:

To successfully acquire Windows performance data, you need to verify that you have the proper registry key access permissions in the system registry. In particular, make sure that Perflib in the following registry path is readable (KEY_READ access) by the Tanzu GemFire process:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Perflib

An example of reasonable security on the performance data would be to grant administrators KEY_ALL_ACCESS access and interactive users KEY_READ access. This particular configuration would prevent non-administrator remote users from querying performance data.

See http://support.microsoft.com/kb/310426 and http://support.microsoft.com/kb/146906 for instructions about how to ensure that Tanzu GemFire processes have access to the registry keys associated with performance.

Java applications on 64-bit platforms hang or use 100% CPU

If your Java applications suddenly start to use 100% CPU, you may be experiencing the leap second bug. This bug is found in the Linux kernel and can severely affect Java programs. In particular, you may notice that method invocations using Thread.sleep(n) where n is a small number will actually sleep for much longer period of time than defined by the method. To verify that you are experiencing this bug, check the host’s dmesg output for the following message:

[10703552.860274] Clock: inserting leap second 23:59:60 UTC

To fix this problem, issue the following commands on your affected Linux machines:

prompt> /etc/init.d/ntp stop
prompt> date -s "$(date)"

See the following web site for more information:

http://blog.wpkg.org/2012/07/01/java-leap-second-bug-30-june-1-july-2012-fix/