Elastic DRS uses and algorithm to maintain an optimal number of provisioned hosts to keep cluster utilization high while maintaining desired CPU, memory, and storage performance.
- Minimum and maximum number of hosts the algorithm should scale up or down to.
- Thresholds for CPU, memory and storage utilization such that host allocation is optimized for cost or performance. These thresholds are predefined for each policy type and cannot be altered by user.
The algorithm runs every 5 minutes and monitors resource utilization over a period of time. Taking into consideration spikes and randomness in the utilization, the algorithm makes a determination to scale out or scale in a cluster by generating an alert. This alert is processed immediately by provisioning a new host or removing a host from the cluster.
A scale-out recommendation is generated when any of CPU, memory, or storage utilization remains consistently above thresholds. For example, if storage utilization goes above 75% but memory and CPU utilization remain below their respective thresholds, a scale-out recommendation is generated. A vCenter Server event is posted to indicate the start, completion, or failure of scaling out on the cluster.
A scale-in recommendation is generated when CPU, memory, and storage utilization all remain consistently below thresholds. The scale-in recommendation is not acted upon if the number of hosts in the cluster is at the minimum specified value. A vCenter Server event is posted to indicate the start, completion, or failure of the scaling in operation on the cluster.
Whenever you reduce cluster size, storage latency increases due to process overhead introduced by host removal. The duration of this overhead varies with the amount of data involved. It can take as little as an hour, though an extreme case could require more than 24 hours. While cluster-size reduction (scale-in) is underway, workload VMs supported by the affected clusters can experience significant increases in storage latency.
Time Delays Between Two Recommendations
A safety check is included in the algorithm to avoid processing frequently generated events and to provide some time to the cluster to cool off with changes due to last event processed. The following time intervals between events are enforced:
- 30 minutes delay between two successive scale-out events.
- 3 hour delay to process a scale-in event after scaling out the cluster.
Interactions of Recommendations with Other Operations
The following operations might interact with Elastic DRS recommendations:
- User-initiated addition or removal of hosts.
Normally, you would not need to manually add or remove hosts from a cluster with Elastic DRS enabled. You can still perform these operations, but an Elastic DRS recommendation might revert them at some point.
If a user-initiated add or remove host operation is in progress, the current recommendation by the Elastic DRS algorithm is ignored. After the user-initiated operation completes, the algorithm may recommend a scale-in or scale-out operation based on the changes in the resource utilization and current selected policy.
If you start an add or remove host operation while an Elastic DRS recommendation is being applied, the operation fails with an error indicating a concurrent update exception.
- Planned Maintenance Operation
A planned maintenance operation means a particular host needs to be replaced by a new host. While a planned maintenance operation is in progress, current recommendations by the Elastic DRS algorithm are ignored. After the planned maintenance completes, fresh recommendations will be applied. If a planned maintenance event is received while an Elastic DRS recommendation is being applied for that cluster, the planned maintenance task will be queued. After the Elastic DRS recommendation task completes, the planned maintenance task starts.
As a result of auto-remediation, a failed host is replaced by a new host. While auto-remediation is in progress, the current recommendation by the Elastic DRS algorithm are ignored. After the auto-remediation operation completes, fresh recommendations will be applied. If an auto-remediation event is received while an Elastic DRS recommendation is being applied to that cluster, the auto-remediation is queued. After the Elastic DRS recommendation task completes, the auto-remediation task starts.
- SDDC maintenance window
If an SDDC is undergoing maintenance or is scheduled to undergo maintenance in the next 6 hours, EDRS recommendations are ignored.