How the Elastic DRS Algorithm Works

The Elastic DRS algorithm monitors resource utilization in a cluster over time. After allowing for spikes and randomness in the utilization, it makes a recommendation to scale out or scale in a cluster and generates an alert. This alert is processed immediately by provisioning a new host or removing a host from the cluster.

The algorithm runs every 5 minutes and uses the following parameters:

Minimum and maximum number of hosts the algorithm should scale up or down to.
Thresholds for CPU, memory and storage utilization such that host allocation is optimized for cost or performance. These thresholds, which we list on the Manage Elasticity in SDDC Clusters page, are predefined for each DRS policy type and cannot be altered by user.

Scale-out Recommendation

A scale-out recommendation is generated when any of CPU, memory, or storage utilization remains consistently above thresholds. For example, if storage utilization goes above the high threshold but memory and CPU utilization remain below their respective thresholds, a scale-out recommendation is generated. A vCenter event is posted to indicate the start, completion, or failure of scaling out on the cluster.

For CPU and memory recommendations, a scale-out action is only initiated if you have sufficient subscription coverage for the relevant region and instance type. For storage recommendations, the scale-out action is initiated regardless of available subscriptions. However, if you do not have sufficient subscriptions to cover the new host or hosts, you must purchase such sufficient subscriptions within 48 hours. If a host is added beyond available subscriptions and you do not purchase sufficient subscription coverage within 48 hours, you will lose access to your SDDC and workloads until you purchase sufficient subscriptions.

To simplify planning for possible scale-out events, Broadcom recommends purchasing 1 additional subscription for every 26 hosts in a single availability zone SDDC, and 2 additional subscriptions for every stretched cluster SDDC.

Scale-in Recommendation

A scale-in recommendation is generated when CPU, memory, and storage utilization all remain consistently below thresholds. The scale-in recommendation is not acted upon if the number of hosts in the cluster is at the minimum specified value. A vCenter event is posted to indicate the start, completion, or failure of the scaling in operation on the cluster.

Note:

Whenever you reduce cluster size, storage latency increases due to process overhead introduced by host removal. The duration of this overhead varies with the amount of data involved. It can take as little as an hour, though an extreme case could require more than 48 hours. While cluster-size reduction (scale-in) is underway, workload VMs supported by the affected clusters can experience significant increases in storage latency.

Scaling Stretched Clusters

When Elastic DRS generates a scale-in or scale-out event for a stretched (multiple availability zone) cluster, hosts are removed or added in both availability zones.

If a host fails in any stretched cluster, Elastic DRS attempts to replace it in its original availability zone. If it is unable to do that because of a full or partial availability zone failure, Elastic DRS scales out the cluster in the remaining availability zone. It adds non-billable hosts in the remaining availability zone until the cluster reaches its original host count. This scale-out workflow depends on available capacity and is not guaranteed. When the failed availability zone is restored, Elastic DRS scales in the cluster to remove the extra hosts, restring the original host count.

There's more information abut how Elastic DRS handles scaling and zone failure scenarios in the VMware Cloud Tech Zone article VMware Cloud on AWS: Stretched Clusters.

Time Delays Between Two Recommendations

A safety check is included in the algorithm to avoid processing frequently generated events and to provide some time to the cluster to cool off with changes due to last event processed. The following time intervals between events are enforced:

A 30 minute delay between two successive scale-out events.
A three hour delay to process a scale-in event after scaling out the cluster.
A four hour delay between two successive scale-in events (unless the cluster has the Rapid Scaling policy).

Interactions of Recommendations with Other Operations

The following operations might interact with Elastic DRS recommendations:

User-initiated addition or removal of hosts.
Normally, you would not need to manually add or remove hosts from a cluster with Elastic DRS enabled. You can still perform these operations, but an Elastic DRS recommendation might revert them at some point.

If a user-initiated add or remove host operation is in progress, the current recommendation by the Elastic DRS algorithm is ignored. After the user-initiated operation completes, the algorithm may recommend a scale-in or scale-out operation based on the changes in the resource utilization and current selected policy.

If you start an add or remove host operation while an Elastic DRS recommendation is being applied, the add or remove host operation fails with an error indicating a concurrent update exception.
Planned Maintenance Operation
A planned maintenance operation means a particular host needs to be replaced by a new host. While a planned maintenance operation is in progress, current recommendations by the Elastic DRS algorithm are ignored. After the planned maintenance completes, the algorithm runs again and fresh recommendations are applied. If a planned maintenance event is initiated on a cluster while an Elastic DRS recommendation is being applied to that cluster, the planned maintenance task is queued. After the Elastic DRS recommendation task completes, the planned maintenance task starts.
Auto-remediation
During auto-remediation, a failed host is replaced by a new host, and its host tags are applied to the replacement host. While auto-remediation is in progress, the current recommendations by the Elastic DRS algorithm are ignored. After auto-remediation completes, the algorithm runs again and fresh recommendations are applied. If an auto-remediation event is initiated for a cluster while an Elastic DRS recommendation is being applied to that cluster, the auto-remediation task is queued. After the Elastic DRS recommendation task completes, the auto-remediation task starts.
SDDC maintenance window
If an SDDC is undergoing maintenance or is scheduled to undergo planned maintenance in the next 6 hours, EDRS recommendations are ignored.