The Elastic DRS algorithm monitors resource utilization in a cluster over time. After allowing for spikes and randomness in the utilization, it makes a recommendation to scale out or scale in a cluster and generates an alert. This alert is processed immediately by provisioning a new host or removing a host from the cluster.
- Minimum and maximum number of hosts the algorithm should scale up or down to.
- Thresholds for CPU, memory and storage utilization such that host allocation is optimized for cost or performance. These thresholds, which we list on the Select Elastic DRS Policy page, are predefined for each DRS policy type and cannot be altered by user.
A scale-out recommendation is generated when any of CPU, memory, or storage utilization remains consistently above thresholds. For example, if storage utilization goes above the high threshold but memory and CPU utilization remain below their respective thresholds, a scale-out recommendation is generated. A vCenter Server event is posted to indicate the start, completion, or failure of scaling out on the cluster.
A scale-in recommendation is generated when CPU, memory, and storage utilization all remain consistently below thresholds. The scale-in recommendation is not acted upon if the number of hosts in the cluster is at the minimum specified value. A vCenter Server event is posted to indicate the start, completion, or failure of the scaling in operation on the cluster.
Whenever you reduce cluster size, storage latency increases due to process overhead introduced by host removal. The duration of this overhead varies with the amount of data involved. It can take as little as an hour, though an extreme case could require more than 48 hours. While cluster-size reduction (scale-in) is underway, workload VMs supported by the affected clusters can experience significant increases in storage latency.
Scaling Multiple Availability Zone Clusters
When Elastic DRS generates a scale-in or scale-out event for a mutiple availability zone cluster, hosts are removed or added in both availability zones.
If a host fails in a multiple availability zone cluster, Auto-Scaler attempts to replace it in its original availability zone. If it is unable to do this because of a full or partial availability zone failure, Elastic DRS scales out the cluster in the remaining availability zone. It adds non-billable hosts in the remaining availability zone until the cluster reaches its original host count. This scale out is dependent on available capacity and is not guaranteed. When the failed availability zone is restored, Elastic DRS scales in the cluster to remove the extra hosts.
Time Delays Between Two Recommendations
A safety check is included in the algorithm to avoid processing frequently generated events and to provide some time to the cluster to cool off with changes due to last event processed. The following time intervals between events are enforced:
- A 30 minute delay between two successive scale-out events.
- A three hour delay to process a scale-in event after scaling out the cluster.
Interactions of Recommendations with Other Operations
The following operations might interact with Elastic DRS recommendations:
- User-initiated addition or removal of hosts.
Normally, you would not need to manually add or remove hosts from a cluster with Elastic DRS enabled. You can still perform these operations, but an Elastic DRS recommendation might revert them at some point.
If a user-initiated add or remove host operation is in progress, the current recommendation by the Elastic DRS algorithm is ignored. After the user-initiated operation completes, the algorithm may recommend a scale-in or scale-out operation based on the changes in the resource utilization and current selected policy.
If you start an add or remove host operation while an Elastic DRS recommendation is being applied, the add or remove host operation fails with an error indicating a concurrent update exception.
- Planned Maintenance Operation
A planned maintenance operation means a particular host needs to be replaced by a new host. While a planned maintenance operation is in progress, current recommendations by the Elastic DRS algorithm are ignored. After the planned maintenance completes, the algorithm runs again and fresh recommendations are applied. If a planned maintenance event is initiated on a cluster while an Elastic DRS recommendation is being applied to that cluster, the planned maintenance task is queued. After the Elastic DRS recommendation task completes, the planned maintenance task starts.
During auto-remediation, a failed host is replaced by a new host, and its host tags are applied to the replacement host. While auto-remediation is in progress, the current recommendations by the Elastic DRS algorithm are ignored. After auto-remediation completes, the algorithm runs again and fresh recommendations are applied. If an auto-remediation event is initiated for a cluster while an Elastic DRS recommendation is being applied to that cluster, the auto-remediation task is queued. After the Elastic DRS recommendation task completes, the auto-remediation task starts.
- SDDC maintenance window
If an SDDC is undergoing maintenance or is scheduled to undergo planned maintenance in the next 6 hours, EDRS recommendations are ignored.