High Availability in TAS for VMs

This topic tells you about the components used to ensure high availability in VMware Tanzu Application Service for VMs (TAS for VMs), vertical and horizontal scaling, and the infrastructure required to support scaling component VMs for high availability.

A system with high availability provides higher than typical uptime through redundancy of apps and component VMs. You can create the redundancy required for high availability in several ways, such as running VMs in multiple availability zones and using external blob storage solutions.

This topic provides guidance on configuring your TAS for VMs deployment for high availability.

Components of high availability deployments

You can use availability zones, external load balancers, and external blob storage to ensure high availability for your deployment.

Availability zones

Availability Zones (AZs) are locations where public cloud services offer data centers.

You can assign and scale components in multiple AZs to help maintain high availability through redundancy. To configure sufficient redundancy, deploy TAS for VMs across three or more AZs and assign multiple component instances to different AZs.

Always use an odd number of AZs. This ensures that your deployment remains available as long as greater than half of the AZs are available.

For example, a deployment with three AZs stays available when one AZ is unavailable. A deployment with five AZs stays available when two AZs are unavailable.

External load balancers

External load balancers distribute traffic coming from the internet to your internal network.

To ensure high availability for production environments, use a highly-available customer-provided external load balancing solution that does the following:

Provides load balancing to each of the TAS for VMs Router IP addresses
Supports SSL termination with wildcard DNS location
Adds appropriate x-forwarded-for and x-forwarded-proto HTTP headers to incoming requests
(Optional) Supports WebSockets

For lab and test environments, the use-haproxy.yml ops file enables HAProxy for your foundation.

For more information, see Using Your Own Load Balancer.

External blob storage

Blobs are large binary files, such as PDFs or images. To store blobs for high availability, use external storage such as Amazon S3 or an S3-compatible service.

You can also store blobs internally using WebDAV or NFS. These components run as single instances and you cannot scale them. For these deployments, use the high availability features of your IaaS to immediately recover your WebDAV or NFS server VM if it fails. Contact Support if you need assistance.

The singleton compilation components do not affect platform availability.

Scaling platform capacity

You can scale platform capacity in the following ways:

Vertical scaling: Add memory and disk to each VM.
Horizontal scaling: Add more VMs that run instances of TAS for VMs components.

The type of apps you host on TAS for VMs determines whether you scale vertically or horizontally.

For more information about scaling applications and maintaining app uptime, see the following topics:

Scaling vertically

Scaling vertically means adding memory and disk to your component VMs.

To scale vertically, allocate and maintain the following:

Free space on host Diego cell VMs so that apps expected to deploy can successfully be staged and run.
Disk space and memory in your deployment so that if one host VM is down, all instances of apps can be placed on the remaining host VMs.
Free space to handle one AZ going down if deploying in multiple AZs.

Scaling horizontally

Scaling horizontally means increasing the number of instances of VMs that run a functional component of a system.

You can horizontally scale most TAS for VMs component VMs to multiple instances for high availability.

You should also distribute the instances of components across different AZs to minimize downtime during ongoing operation, product updates, and platform upgrades. For more information about using AZs, see Availability Zones.

Recommended instance counts for high availability

For more information regarding rolling app deployments, see Scaling TAS for VMs.

The following table provides the instance counts that VMware recommends for a high-availability deployment and the minimum instances for a functional deployment:

VMware Tanzu Application Service for VMs (TAS for VMs) Job	Recommended Instance Number for HA	Minimum Instance Number	Notes
Diego Cell	≥ 3	1	The optimal balance between CPU and memory sizing and instance count depends on the performance characteristics of the apps that run on Diego Cells. Scaling vertically with larger Diego Cells makes for larger points of failure, and more apps go down when a Diego Cell fails. On the other hand, scaling horizontally decreases the speed at which the system re-balances apps. Re-balancing 100 Diego Cells takes longer and demands more processing overhead than re-balancing 20 Diego Cells.
Diego Brain	≥ 2	1	For high availability, use at least one per AZ, or at least two if only one AZ.
Diego BBS	≥ 2	1	For high availability in a multi-AZ deployment, use at least one instance per AZ. Scale Diego BBS to at least two instances for high availability in a single-AZ deployment.
MySQL Server	3	1	If you use an external database in your deployment, then you can set the MySQL Server instance count to `0`. For instructions about scaling down an internal MySQL cluster, see Scaling Down Your MySQL Cluster.
MySQL Proxy	2	1	If you use an external database in your deployment, then you can set the MySQL Proxy instance count to `0`.
NATS	≥ 2	1	In a high-availability deployment, you might run a single NATS instance if your deployment lacks the resources to deploy two stable NATS servers. Components using NATS are resilient to message failures and the BOSH Resurrector recovers the NATS VM quickly if it becomes non-responsive.
Cloud Controller	≥ 2	1	Scale the Cloud Controller to accommodate the number of requests to the API and the number of apps in the system.
Clock Global	≥ 2	1	For a high-availability deployment, scale the Clock Global job to a value greater than 1 or to the number of AZs you have.
Router	≥ 2	1	Scale the Gorouter to accommodate the number of incoming requests. Additional instances increase available bandwidth. In general, this load is much less than the load on Diego Cells.
UAA	≥ 2	1
Doppler Server	≥ 2	1	Deploying additional Doppler servers splits traffic across them. For a high-availability deployment, VMware recommends at least two per AZ.
Loggregator TrafficController	≥ 2	1	Deploying additional Loggregator TrafficController instances allows you to direct traffic to them in a round-robin manner. For a high-availability deployment, VMware recommends at least two per AZ.
Log Cache	≥ 3	1	Deploying additional Log Cache instances increases the total storage, sharding data by app ID. If app logs and metrics are sharded to an unavailable instance, they are unavailable when the designated instance is unavailable regardless of the number of instances or AZs. Data is not re-balanced unless you increase or decrease the number of desired instances in Tanzu Operations Manager and apply changes.
CredHub	≥ 3	2	CredHub is a scalable component. For high availability, use at least one instance per AZ, or at least three instances if only one AZ is present.

Infrastructure for component scaling

The ability to scale component VMs is important for high availability. To scale component VMs, you must ensure that the surrounding infrastructure of your deployment supports VM scaling.

Learn about the infrastructure required to support scaling component VMs for high availability.

BOSH Resurrector

The BOSH Resurrector increases VMware Tanzu Application Service for VMs (TAS for VMs) availability in the following ways:

Reacts to hardware failure and network disruptions by recreating VMs on active, stable hosts
Detects operating system failures by continuously monitoring VMs and recreating them as required
Continuously monitors the BOSH Agent running on each VM and recreates the VMs as required

The BOSH Resurrector continuously monitors the status of all VMs in a TAS for VMs deployment. The Resurrector also monitors the BOSH Agent on each VM. If either the VM or the BOSH Agent fail, the Resurrector recreates the VM on another active host. To enable the BOSH Resurrector, see the Enable BOSH Resurrector section of the Using the BOSH Resurrector topic.

Resource pools

Each IaaS has different ways of limiting resource consumption for scaling VMs. Consult with your IaaS administrator to ensure additional VMs and related resources, like IPs and storage, are available to scale.

For more information about configuring your resource pools according to the requirements of your deployment, see the Ops Manager configuration topic for your IaaS.

For information about configuring resource pools for Amazon Web Services, see Amazon EC2 FAQs in the Amazon documentation.

For information about configuring resource pools for OpenStack, see Manage projects and users in the OpenStack documentation.

For information about configuring resource pools for vSphere, see the Resource Config Pane section of the Configuring BOSH Director on vSphere topic.

Databases

For database services deployed outside TAS for VMs, use the high availability features included with your infrastructure. Also, configure backup and restore where possible. For more information about scaling internal database components, see Scaling TAS for VMs.

Data services may have single points of failure depending on their configuration.

If you need assistance, contact Support.