Here are several configuration options for VMware Tanzu Application Service for VMs (TAS for VMs) that can help ensure successful upgrades. In addition to following the Upgrade Preparation Checklist, review this topic to better understand how to prepare for TAS for VMs upgrades.
Caution If you are upgrading from v6.0.5 or v6.0.6 to a later patch version, upgrade TAS for VMs and apply those changes before you upgrade TAS for VMs [Windows]. Failure to upgrade TAS for VMs first will result in TCP route outages on TAS for VMs [Windows].
For more information, see Downtime for TCP Routes When Upgrading IST, TASW, and Services Tiles before TAS in TPCF in the Broadcom Support Knowledge Base.
The max_in_flight
variable limits how many instances of a component can restart simultaneously during updates or upgrades. Increasing the value of max_in_flight
can make updates run faster, but setting it too high risks overloading VMs and causing failure. For guidance on setting max_in_flight
values, see Basic Advice.
Values for max_in_flight
can be any integer between 1 and 100, or a percentage of the total number of instances. For example, a max_in_flight
value of 20%
in a deployment with 10 Diego Cell instances can make no more than two Diego Cell instances restart at once.
The max_in_flight
variable is a system-wide value with optional component-specific overrides. You can override the default value for individual jobs using an API endpoint.
Use the max_in_flight
API endpoint to configure the maximum value for component instances that can start at a given time. This endpoint overrides product defaults. You can specify values as a percentage or an integer.
Use the string default
as the max_in_flight
value to force the component to use the deployment’s default value.
The following example lists three JOB_GUID
s. These three GUIDs are examples of the three different types of values you can use to configure max_in_flight
. The endpoint only requires one GUID.
curl "https://EXAMPLE.com/api/v0/staged/products/PRODUCT-TYPE1-GUID/max_in_flight" \ -X PUT \ -H "Authorization: Bearer UAA_ACCESS_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "max_in_flight": { "JOB_1_GUID": 1, "JOB_2_GUID": "20%", "JOB_3_GUID": "default" } }'
To upgrade TAS for VMs, BOSH must drain all Diego Cell VMs that host app instances. BOSH manages this process by upgrading a batch of Diego Cells at a time.
The number of Diego Cells that undergo an upgrade simultaneously (either in a state of shutting down or coming back online) is controlled by the max_in_flight
value of the Diego Cell job. For example, if max_in_flight
is set to 10%
and your deployment has 20 Diego Cell job instances, then the maximum number of Diego Cells that BOSH can upgrade at a single time is 2
.
When BOSH triggers an upgrade, each Diego Cell undergoing upgrade enters “evacuation” mode. Evacuation mode relates to the Diego Cell stop accepting new work and signals the rest of the Diego system to schedule replacements for its app instances. This scheduling is managed by the Diego auctioneer process. For more information, see How Diego Balances App Processes.
The evacuating Diego Cells continue to interact with the Diego system as replacements come online. The Diego Cell undergoing upgrade waits until either its app instance replacements run successfully or the evacuation process timesout before shutting down the original local instances. This “evacuation timeout” defaults to 10 minutes.
If Diego Cell evacuation exceeds this timeout, then the Diego Cell stops its app instances and shuts down. The Diego system continues to re-emit start requests for the app replacements.
A potential issue arises if too many app instance replacements are slow to start or do not start successfully at all.
If too many app instances are starting concurrently, then the load of these starts on the rest of the system can cause other apps that are already running to crash and be rescheduled. These events can result in a cascading failure.
To prevent this issue, TAS for VMs provides two throttle configurations:
The maximum number of in-flight Diego Cell instances.
The maximum number of starting containers.
The values of the throttle configurations depend on the version of TAS for VMs that you have deployed and whether you have overridden the default values.
The following list describes the existing defaults for determining the override values in your deployment:
Starting container count maximum: 200
Starting container count overridable: Yes
Maximum in-flight Diego cell instances: 4% of total instances
Maximum in-flight Diego cell instances overridable: Yes
Set the max_in_flight
variable low enough that the remaining component instances are not overloaded by typical use. If component instances are overloaded during updates, upgrades, or typical use, users might experience downtime.
Some more precise guidelines include:
Increase the count of Diego Cells before an upgrade, and then set max_in_flight
to the number of Diego Cells that you added. For example, if you have 200 Diego Cells, adding 20 Diego Cells and then setting max_in_flight
to 20 activates the Diego Cells to be upgraded in 10 batches of 20 Diego Cells each. The default settings might require 25 batches of 8 Diego Cells each. If each batch takes 10 minutes, using this process might save you over 2.5 hours of upgrade time. Ensure that you scale the number of Diego Cells down after the upgrade.
Quorum-based components are components with odd-numbered settings in the manifest. For quorum-based components such as etcd and Diego BBS, set max_in_flight
to 1. This preserves quorum and prevents a split-brain scenario from occurring as jobs restart. For more information, about split-brain scenarios, see Split-brain (computing) on Wikipedia.
For other components, set max_in_flight
to the number of instances that you can afford to have down at any one time. The best values for your deployment vary based on your capacity planning. In a highly redundant deployment, you can set the number to a higher value to allow updates to run faster. However, if your components are at high utilization, you can keep the number low to prevent downtime.
Setting max_in_flight
to a value greater than or equal to the number of instances you have running can reduce capability.
If your TAS for VMs uses an internal MySQL database cluster, run the mysql-diag tool to validate your cluster is healthy before you proceed with any TAS upgrade. An unhealthy cluster can trigger other TAS components to fail during an upgrade.
To run mysql-diag, follow the instructions in Running mysql-diag.
If your cluster appears unhealthy, follow instructions in Recovering from MySQL cluster downtime to restore your cluster to a healthy state before upgrading TAS.
You can use the Diego Auctioneer job to configure the maximum number of app instances starting at a given time. This prevents Diego from scheduling too much new work for your platform to handle concurrently. A lower default can prevent server overload during cold start, which can be important if your infrastructure is not sized for a large number of concurrent cold starts.
The Diego Auctioneer only schedules a fixed number of app instances to start concurrently. This limit applies to both single and multiple Diego Cells. For example, if you set the limit to five starting instances, it does not matter if you have one Diego Cell with ten instances or five Diego Cells with two instances each. The auctioneer does not allow more than five instances to start at the same time.
If you are using a cloud-based IaaS, rather than a smaller on-premises solution, VMware recommends setting a larger default. By default, the maximum number of started instances is 200.
To configure the maximum number of started instances in the Settings tab of the TAS for VMs tile:
Log in to Tanzu Operations Manager.
Click the TAS for VMs tile.
Select App Containers.
In the Max-in-flight container starts field, enter the maximum number of started instances.
Click Save.
There are critical factors to consider when you evaluate the type of file storage to use in your TAS for VMs deployment. The TAS for VMs blobstore relies on the file storage system to read and write resources, app packages, and droplets.
For more information, see Blobstore in Cloud Controller.
During an upgrade, file storage with insufficient IOPS numbers can negatively impact the performance and stability of your TAS for VMs deployment.
If disk processing time is longer than the evacuation timeout for Diego Cells, then Diego Cells and app instances can take too long to start up, resulting in a cascading failure.
However, the minimum required IOPS depends upon a number of deployment-specific factors and configuration choices. Use this section as a guide when deciding on the file storage configuration for your deployment.
When you deploy TAS for VMs, you can select internal file storage or external file storage, either network-accessible or IaaS-provided, as an option in the TAS for VMs tile.
Selecting internal storage causes TAS for VMs to deploy a dedicated VM that uses either NFS or WebDAV for file storage. Selecting external storage allows you to configure file storage provided in network-accessible location or by an IaaS, such as Amazon S3, Google Cloud Storage, or Azure Storage.
Whenever possible, VMware recommends using external file storage.
As a best-effort calculation, estimate the total number of bits needed to move during a system upgrade to determine the IOPS performance required of your file storage.
Number of Diego Cells: As a first calculation, determine the number of Diego Cells that your deployment currently uses. To view the number of Diego Cell instances currently running in your deployment, see the Resource Config pane of the TAS for VMs tile. If you expect to scale up the number of instances, use the anticipated scaled number.
If your deployment uses more than 20 Diego Cells, avoid using internal file storage. Instead, always select external or IaaS-provided file storage.
Maximum In-Flight Load and Container Starts for Diego Cells: Operators can limit the number of containers and Diego Cell instances that Diego starts concurrently. If operators impose no limits, your file storage might experience exceptionally heavy load during an upgrade.