A rolling upgrade eliminates system downtime by keeping your existing distributed system running while you upgrade one member at a time. Each upgraded member can communicate with other members that are still running the earlier version of GemFire, so servers can respond to client requests even as the upgrade is underway. Interdependent data members can be stopped and started without mutually blocking, a problem that can occur when multiple data members are stopped at the same time.
Versions
Rolling upgrade requires that the older and newer versions of Tanzu GemFire are mutually compatible, which usually means that they share the same major version number. Therefore, you can perform a rolling upgrade to upgrade from earlier versions of 9.x up to the most recent version of 9.15.
Components
Rolling upgrades apply to the peer members or cache servers within a distributed system. Under some circumstances, rolling upgrades can also be applied within individual sites of multi-site (WAN) deployments.
Redundancy
All partitioned regions in your system must have full redundancy. Check the redundancy state of all your regions before you begin the rolling upgrade and before stopping any members. See Checking Redundancy in Partitioned Regions for details.
If a rolling update is not possible for your system, follow the Off-Line Upgrade procedure.
Do not create or destroy regions
When you perform a rolling upgrade, your online cluster will have a mix of members running different versions of GemFire. During this time period, do not execute region operations such as region creation or region destruction.
Region rebalancing affects the restart process
If you have startup-recovery-delay
deactivated (set to -1) for your partitioned region, you will need to perform a rebalance on your region after you restart each member. If rebalance occurs automatically, as it will if startup-recovery-delay
is enabled (set to a value other than -1), make sure that the rebalance completes before you stop the next server. If you have startup-recovery-delay
enabled and set to a high number, you may need to wait extra time until the region has recovered redundancy, because rebalance must complete before new servers are restarted. The partitioned region attribute startup-recovery-delay
is described in Configure Member Join Redundancy Recovery for a Partitioned Region.
Checking component versions while upgrading
During a rolling upgrade, you can check the current GemFire version of all members in the cluster by looking at the server or locator logs.
When an upgraded member reconnects to the distributed system, it logs all the members it can see as well as the GemFire version of those members. For example, an upgraded locator will now detect GemFire members running the older version of GemFire (in this case, the version being upgraded: GFE 9.0.0) :
[info 2013/06/03 10:03:29.206 PDT frodo <vm_1_thr_1_frodo> tid=0x1a]
DistributionManager frodo(locator1:21869:locator)<v16>:28242 started
on frodo[15001]. There were 2 other DMs. others:
[frodo(server2:21617)<v4>:14973( version:GFE 9.0.0 ),
frodo(server1:21069)<v1>:60929( version:GFE 9.0.0 )] (locator)
After some members have been upgraded, non-upgraded members will log the following message when they receive a new membership view:
Membership: received new view [frodo(locator1:20786)<v0>:32240|4]
[frodo(locator1:20786)<v0>:32240/51878,
frodo(server1:21069)<v1>:60929/46949,
frodo(server2:21617)<v4>( version:UNKNOWN[ordinal=23] ):14973/33919]
Non-upgraded members identify members that have been upgraded to the next version with version: UNKNOWN
.
Cluster configuration affects save and restore
The way in which your cluster configuration was created determines which commands you use to save and restore that cluster configuration during the upgrade procedure.
gfsh
commands, relying on the underlying cluster configuration service, the configuration can be saved in one central location, then applied to all newly-upgraded members. See Exporting and Importing Cluster Configurations.Begin by installing the new version of the software alongside the older version of the software on all hosts. You will need both versions of the software during the upgrade procedure. See Installing Pivotal GemFire.
Upgrade locators first, then data members, then clients.
On the machine hosting the first locator you wish to upgrade, open a terminal console.
Start a gfsh
prompt, using the version from your current GemFire installation, and connect to the currently running locator. For example:
gfsh>connect --locator=locator_hostname_or_ip_address[port]
Use gfsh
commands to characterize your current installation so you can compare your post-upgrade system to the current one. For example, use the list members
command to view locators and data members:
Name | Id
-------- | ------------------------------------------------
locator1 | 172.16.71.1(locator1:26510:locator)<ec><v0>:1024
locator2 | 172.16.71.1(locator2:26511:locator)<ec><v1>:1025
server1 | 172.16.71.1(server1:26514)<v2>:1026
server2 | 172.16.71.1(server2:26518)<v3>:1027
Save your cluster configuration.
export cluster-configuration
command. You only need to do this once, as the newly-upgraded locator will propagate the configuration to newly-upgraded members as they come online.cache.xml
, gemfire.properties
, and any other relevant configuration files to a well-known location. You must repeat this step for each member you upgrade.Stop the locator. For example:
gfsh>stop locator --name=locator1
Stopping Locator running in /Users/username/sandbox/locator
on 172.16.71.1[10334] as locator...
Process ID: 96686
Log File: /Users/username/sandbox/locator/locator.log
....
No longer connected to 172.16.71.1[1099].
Start gfsh
from the new GemFire installation. Verify that you are running the newer version with
gfsh>version
Start a locator and import the saved configuration. If you are using the cluster configuration service, use the same name and directory as the older version you stopped, and the new locator will access the old locator’s cluster configuration without having to import it in a separate step:
gfsh>start locator --name=locator1 --enable-cluster-configuration=true \
--dir=/data/locator1
Otherwise, use the gfsh import cluster-configuration
command or explicitly import .xml
and .properties
files, as appropriate.
The new locator should reconnect to the same members as the older locator. Use list members
to verify:
gfsh>list members
Name | Id
-------- | ----------------------------------------------------
locator1 | 172.16.71.1(locator1:26752:locator)<ec><v17>:1024(version:UNKNOWN[ordinal=65])
locator2 | 172.16.71.1(locator2:26511:locator)<ec><v1>:1025
server1 | 172.16.71.1(server1:26514)<v2>:1026
server2 | 172.16.71.1(server2:26518)<v3>:1027
Upgrade the remaining locators by stopping and restarting them. When you have completed that step, the system gives a more coherent view of version numbers:
gfsh>list members
Name | Id
-------- | ----------------------------------------------------
locator1 | 172.16.71.1(locator1:26752:locator)<ec><v17>:1024
locator2 | 172.16.71.1(locator2:26808:locator)<ec><v30>:1025
server1 | 172.16.71.1(server1:26514)<v2>:1026(version:GFE 9.0)
server2 | 172.16.71.1(server2:26518)<v3>:1027(version:GFE 9.0)
The server entries show that the servers are running an older version of gemfire, in this case (version:GFE 9.0)
.
After you have upgraded all of the system’s locators, upgrade the servers.
Upgrade each server, one at a time, by stopping it and restarting it. Restart the server with the same command-line options with which it was originally started in the previous installation. For example:
gfsh>stop server --name=server1
Stopping Cache Server running in /Users/share/server1
on 172.16.71.1[52139] as server1...
gfsh>start server --name=server1 --use-cluster-configuration=true \
--server-port=0 --dir=/data/server1
Starting a Geode Server in /Users/share/server1...
Use the list members
command to verify that the server is now running the new version of GemFire:
gfsh>list members
Name | Id
-------- | ----------------------------------------------------
locator1 | 172.16.71.1(locator1:26752:locator)<ec><v17>:1024
locator2 | 172.16.71.1(locator2:26808:locator)<ec><v30>:1025
server1 | 172.16.71.1(server1:26835)<v32>:1026
server2 | 172.16.71.1(server2:26518)<v3>:1027(version:GFE 9.0)
Restore data to the data member. If automatic rebalancing is enabled (partitioned region attribute startup-recovery-delay
is set to a value other than -1), data restoration will start automatically. If automatic rebalancing is deactivated (partitioned region attribute startup-recovery-delay=-1
), you must initiate data restoration by issuing the gfsh rebalance
command.
Wait until the newly-started server has been restored before upgrading the next server. You can repeat various gfsh show metrics
command with the --member
option or the --region
option to verify that the data member is hosting data and that the amount of data it is hosting has stabilized.
Shut down,restart, and rebalance servers until all data members are running the new version of GemFire.
Upgrade Tanzu GemFire clients, following the guidelines described in Upgrading Clients.