Here are several configuration options for VMware Tanzu Application Service for VMs (TAS for VMs) that can help ensure successful upgrades. In addition to following the Upgrade Preparation Checklist, review this topic to better understand how to prepare for TAS for VMs upgrades.

Limit component instance restarts

The max_in_flight variable limits how many instances of a component can restart simultaneously during updates or upgrades. Increasing the value of max_in_flight can make updates run faster, but setting it too high risks overloading VMs and causing failure. For guidance on setting max_in_flight values, see Basic Advice.

Values for max_in_flight can be any integer between 1 and 100, or a percentage of the total number of instances. For example, a max_in_flight value of 20% in a deployment with 10 Diego Cell instances can make no more than two Diego Cell instances restart at once.

Set max_in_flight

The max_in_flight variable is a system-wide value with optional component-specific overrides. You can override the default value for individual jobs using an API endpoint.

Use the max_in_flight API endpoint

Use the max_in_flight API endpoint to configure the maximum value for component instances that can start at a given time. This endpoint overrides product defaults. You can specify values as a percentage or an integer.

Use the string default as the max_in_flight value to force the component to use the deployment’s default value.

The following example lists three JOB_GUIDs. These three GUIDs are examples of the three different types of values you can use to configure max_in_flight. The endpoint only requires one GUID.

	curl "https://EXAMPLE.com/api/v0/staged/products/PRODUCT-TYPE1-GUID/max_in_flight" \
    -X PUT \
    -H "Authorization: Bearer UAA_ACCESS_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
          "max_in_flight": {
            "JOB_1_GUID": 1,
            "JOB_2_GUID": "20%",
            "JOB_3_GUID": "default"
          }
        }'

Guidance for Diego Cells

To upgrade TAS for VMs, BOSH must drain all Diego Cell VMs that host app instances. BOSH manages this process by upgrading a batch of Diego Cells at a time.

The number of Diego Cells that undergo an upgrade simultaneously (either in a state of shutting down or coming back online) is controlled by the max_in_flight value of the Diego Cell job. For example, if max_in_flight is set to 10% and your deployment has 20 Diego Cell job instances, then the maximum number of Diego Cells that BOSH can upgrade at a single time is 2.

When BOSH triggers an upgrade, each Diego Cell undergoing upgrade enters “evacuation” mode. Evacuation mode relates to the Diego Cell stop accepting new work and signals the rest of the Diego system to schedule replacements for its app instances. This scheduling is managed by the Diego auctioneer process. For more information, see How Diego Balances App Processes.

The evacuating Diego Cells continue to interact with the Diego system as replacements come online. The Diego Cell undergoing upgrade waits until either its app instance replacements run successfully or the evacuation process timesout before shutting down the original local instances. This “evacuation timeout” defaults to 10 minutes.

If Diego Cell evacuation exceeds this timeout, then the Diego Cell stops its app instances and shuts down. The Diego system continues to re-emit start requests for the app replacements.

Prevent overload

A potential issue arises if too many app instance replacements are slow to start or do not start successfully at all.

If too many app instances are starting concurrently, then the load of these starts on the rest of the system can cause other apps that are already running to crash and be rescheduled. These events can result in a cascading failure.

To prevent this issue, TAS for VMs provides two throttle configurations:

  • The maximum number of in-flight Diego Cell instances.

  • The maximum number of starting containers.

The values of the throttle configurations depend on the version of TAS for VMs that you have deployed and whether you have overridden the default values.

The following list describes the existing defaults for determining the override values in your deployment:

  • Starting container count maximum: 200

  • Starting container count overridable: Yes

  • Maximum in-flight Diego cell instances: 4% of total instances

  • Maximum in-flight Diego cell instances overridable: Yes

Best practices

Set the max_in_flight variable low enough that the remaining component instances are not overloaded by typical use. If component instances are overloaded during updates, upgrades, or typical use, users might experience downtime.

Some more precise guidelines include:

  • Increase the count of Diego Cells before an upgrade, and then set max_in_flight to the number of Diego Cells that you added. For example, if you have 200 Diego Cells, adding 20 Diego Cells and then setting max_in_flight to 20 activates the Diego Cells to be upgraded in 10 batches of 20 Diego Cells each. The default settings might require 25 batches of 8 Diego Cells each. If each batch takes 10 minutes, using this process might save you over 2.5 hours of upgrade time. Ensure that you scale the number of Diego Cells down after the upgrade.

  • Quorum-based components are components with odd-numbered settings in the manifest. For quorum-based components such as etcd and Diego BBS, set max_in_flight to 1. This preserves quorum and prevents a split-brain scenario from occurring as jobs restart. For more information, about split-brain scenarios, see Split-brain (computing) on Wikipedia.

  • For other components, set max_in_flight to the number of instances that you can afford to have down at any one time. The best values for your deployment vary based on your capacity planning. In a highly redundant deployment, you can set the number to a higher value to allow updates to run faster. However, if your components are at high utilization, you can keep the number low to prevent downtime.

  • Setting max_in_flight to a value greater than or equal to the number of instances you have running can reduce capability.

Set a maximum number of starting containers

You can use the Diego Auctioneer job to configure the maximum number of app instances starting at a given time. This prevents Diego from scheduling too much new work for your platform to handle concurrently. A lower default can prevent server overload during cold start, which can be important if your infrastructure is not sized for a large number of concurrent cold starts.

The Diego Auctioneer only schedules a fixed number of app instances to start concurrently. This limit applies to both single and multiple Diego Cells. For example, if you set the limit to five starting instances, it does not matter if you have one Diego Cell with ten instances or five Diego Cells with two instances each. The auctioneer does not allow more than five instances to start at the same time.

If you are using a cloud-based IaaS, rather than a smaller on-premises solution, VMware recommends setting a larger default. By default, the maximum number of started instances is 200.

To configure the maximum number of started instances in the Settings tab of the TAS for VMs tile:

  1. Log in to Tanzu Operations Manager.

  2. Click the TAS for VMs tile.

  3. Select App Containers.

  4. In the Max-in-flight container starts field, enter the maximum number of started instances.

  5. Click Save.

Configure file storage

There are critical factors to consider when you evaluate the type of file storage to use in your TAS for VMs deployment. The TAS for VMs blobstore relies on the file storage system to read and write resources, app packages, and droplets.

For more information, see Blobstore in Cloud Controller.

During an upgrade, file storage with insufficient IOPS numbers can negatively impact the performance and stability of your TAS for VMs deployment.

If disk processing time is longer than the evacuation timeout for Diego Cells, then Diego Cells and app instances can take too long to start up, resulting in a cascading failure.

However, the minimum required IOPS depends upon a number of deployment-specific factors and configuration choices. Use this section as a guide when deciding on the file storage configuration for your deployment.

Select internal or external file storage

When you deploy TAS for VMs, you can select internal file storage or external file storage, either network-accessible or IaaS-provided, as an option in the TAS for VMs tile.

Selecting internal storage causes TAS for VMs to deploy a dedicated VM that uses either NFS or WebDAV for file storage. Selecting external storage allows you to configure file storage provided in network-accessible location or by an IaaS, such as Amazon S3, Google Cloud Storage, or Azure Storage.

Whenever possible, VMware recommends using external file storage.

Calculate disk load requirements

As a best-effort calculation, estimate the total number of bits needed to move during a system upgrade to determine the IOPS performance required of your file storage.

  • Number of Diego Cells: As a first calculation, determine the number of Diego Cells that your deployment currently uses. To view the number of Diego Cell instances currently running in your deployment, see the Resource Config pane of the TAS for VMs tile. If you expect to scale up the number of instances, use the anticipated scaled number.

    If your deployment uses more than 20 Diego Cells, avoid using internal file storage. Instead, always select external or IaaS-provided file storage.

  • Maximum In-Flight Load and Container Starts for Diego Cells: Operators can limit the number of containers and Diego Cell instances that Diego starts concurrently. If operators impose no limits, your file storage might experience exceptionally heavy load during an upgrade.

check-circle-line exclamation-circle-line close-line
Scroll to top icon