Here are several configuration options for VMware Tanzu Application Service for VMs (TAS for VMs) that can help ensure successful upgrades. In addition to following the Upgrade Preparation Checklist, review this topic to better understand how to prepare for TAS for VMs upgrades.
Breaking Change: The Service Mesh feature was removed in TAS for VMs v2.11. You must deactivate the Service Mesh feature before upgrading to TAS for VMs v2.11.
You can upgrade to TAS for VMs v2.13.9 from the following patch versions and later:
For more information, see Jump Upgrading to TAS for VMs v2.13.
The max_in_flight
variable limits how many instances of a component can restart simultaneously during updates or upgrades. Increasing the value of max_in_flight
can make updates run faster, but setting it too high risks overloading VMs and causing failure. For guidance on setting max_in_flight
values, see Basic Advice.
Values for max_in_flight
can be any integer between 1 and 100, or a percentage of the total number of instances. For example, a max_in_flight
value of 20%
in a deployment with 10 Diego Cell instances would make no more than two Diego Cell instances restart at once.
The max_in_flight
variable is a system-wide value with optional component-specific overrides. You can override the default value for individual jobs using an API endpoint.
Use the max_in_flight
API endpoint to configure the maximum value for component instances that can start at a given time. This endpoint overrides product defaults. You can specify values as a percentage or an integer.
Use the string default
as the max_in_flight
value to force the component to use the deployment’s default value.
The following example lists three JOB_GUID
s. These three GUIDs are examples of the three different types of values you can use to configure max_in_flight
. The endpoint only requires one GUID.
curl "https://EXAMPLE.com/api/v0/staged/products/PRODUCT-TYPE1-GUID/max_in_flight" \ -X PUT \ -H "Authorization: Bearer UAA_ACCESS_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "max_in_flight": { "JOB_1_GUID": 1, "JOB_2_GUID": "20%", "JOB_3_GUID": "default" } }'
To upgrade TAS for VMs, BOSH must drain all Diego Cell VMs that host app instances. BOSH manages this process by upgrading a batch of Diego Cells at a time.
The number of Diego Cells that undergo upgrade simultaneously (either in a state of shutting down or coming back online) is controlled by the max_in_flight
value of the Diego Cell job. For example, if max_in_flight
is set to 10%
and your deployment has 20 Diego Cell job instances, then the maximum number of Diego Cells that BOSH can upgrade at a single time is 2
.
When BOSH triggers an upgrade, each Diego Cell undergoing upgrade enters “evacuation” mode. Evacuation mode means that the Diego Cell stops accepting new work and signals the rest of the Diego system to schedule replacements for its app instances. This scheduling is managed by the Diego auctioneer process. For more information, see How Diego Balances App Processes.
The evacuating Diego Cells continue to interact with the Diego system as replacements come online. The Diego Cell undergoing upgrade waits until either its app instance replacements run successfully before shutting down the original local instances, or for the evacuation process to time out. This “evacuation timeout” defaults to 10 minutes.
If Diego Cell evacuation exceeds this timeout, then the Diego Cell stops its app instances and shuts down. The Diego system continues to re-emit start requests for the app replacements.
A potential issue arises if too many app instance replacements are slow to start or do not start successfully at all.
If too many app instances are starting concurrently, then the load of these starts on the rest of the system can cause other apps that are already running to crash and be rescheduled. These events can result in a cascading failure.
To prevent this issue, TAS for VMs provides two throttle configurations:
The maximum number of in-flight Diego Cell instances
The maximum number of starting containers
The values of the above throttle configurations depend on the version of TAS for VMs that you have deployed and whether you have overridden the default values.
The following list describes the existing defaults for determining the override values in your deployment:
Starting container count maximum: 200
Starting container count overridable: Yes
Maximum in-flight Diego cell instances: 4% of total instances
Maximum in-flight Diego cell instances overridable: Yes
Set the max_in_flight
variable low enough that the remaining component instances are not overloaded by typical use. If component instances are overloaded during updates, upgrades, or typical use, users may experience downtime.
Some more precise guidelines include:
For jobs with high resource usage, set max_in_flight
to a low value low. For example, for Diego Cells, max_in_flight
allows non-migrating Diego Cells to pick up the work of Diego Cells stopping and restarting during migration. If resource usage is already close to 100%, scale up your jobs before making any updates.
Quorum-based components are components with odd-numbered settings in the manifest. For quorum-based components such as etcd and Diego BBS, set max_in_flight
to 1. This preserves quorum and prevents a split-brain scenario from occurring as jobs restart. For more information, about split-brain scenarios, see Split-brain (computing) on Wikipedia.
For other components, set max_in_flight
to the number of instances that you can afford to have down at any one time. The best values for your deployment vary based on your capacity planning. In a highly redundant deployment, you can set the number to a higher value to allow updates to run faster. However, if your components are at high utilization, you can keep the number low to prevent downtime.
Setting max_in_flight
to a value greater than or equal to the number of instances you have running may reduce functionality.
If your TAS for VMs uses an internal MySQL database cluster, run the mysql-diag tool to validate your cluster is healthy before you proceed with any TAS upgrade. An unhealthy cluster can trigger other TAS components to fail during an upgrade.
To run mysql-diag, follow the instructions in Running mysql-diag.
If your cluster appears unhealthy, follow instructions in Recovering from MySQL cluster downtime to restore your cluster to a healthy state before upgrading TAS.
This section describes how to use the Diego Auctioneer job to configure the maximum number of app instances starting at a given time. This prevents Diego from scheduling too much new work for your platform to handle concurrently. A lower default can prevent server overload during cold start, which may be important if your infrastructure is not sized for a large number of concurrent cold starts.
The Diego Auctioneer only schedules a fixed number of app instances to start concurrently. This limit applies to both single and multiple Diego Cells. For example, if you set the limit to five starting instances, it does not matter if you have one Diego Cell with ten instances or five Diego Cells with two instances each. The auctioneer does not allow more than five instances to start at the same time.
If you are using a cloud-based IaaS, rather than a smaller on-premise solution, VMware recommends setting a larger default. By default, the maximum number of started instances is 200.
To configure the maximum number of started instances in the Settings tab of the TAS for VMs tile:
Log in to Ops Manager.
Click the TAS for VMs tile.
Select App Containers.
In the Max-in-flight container starts field, enter the maximum number of started instances.
Click Save.
This section describes critical factors to consider when evaluating the type of file storage to use in your TAS for VMs deployment. The TAS for VMs blobstore relies on the file storage system to read and write resources, app packages, and droplets. For more information, see Blobstore in Cloud Controller.
During an upgrade, file storage with insufficient IOPS numbers can negatively impact the performance and stability of your TAS for VMs deployment.
If disk processing time takes longer than the evacuation timeout for Diego Cells, then Diego Cells and app instances may take too long to start up, resulting in a cascading failure.
However, the minimum required IOPS depends upon a number of deployment-specific factors and configuration choices. Use this section as a guide when deciding on the file storage configuration for your deployment.
When you deploy TAS for VMs, you can select internal file storage or external file storage, either network-accessible or IaaS-provided, as an option in the TAS for VMs tile.
Selecting internal storage causes TAS for VMs to deploy a dedicated VM that uses either NFS or WebDAV for file storage. Selecting external storage allows you to configure file storage provided in network-accessible location or by an IaaS, such as Amazon S3, Google Cloud Storage, or Azure Storage.
Whenever possible, VMware recommends using external file storage.
As a best-effort calculation, estimate the total number of bits needed to move during a system upgrade to determine how IOPS-performant your file storage needs to be.
Number of Diego Cells: As a first calculation, determine the number of Diego Cells that your deployment currently uses. To view the number of Diego Cell instances currently running in your deployment, see the Resource Config pane of the TAS for VMs tile. If you expect to scale up the number of instances, use the anticipated scaled number.
ImportantIf your deployment uses more than 20 Diego Cells, avoid using internal file storage. Instead, always select external or IaaS-provided file storage.
Maximum In-Flight Load and Container Starts for Diego Cells: Operators can limit the number of containers and Diego Cell instances that Diego starts concurrently. If operators impose no limits, your file storage may experience exceptionally heavy load during an upgrade.