You can manage the Elastic DRS policy for each SDDC cluster to optimize cluster scaling to meet your workloads' needs.
For any policy, scale-out is triggered when a cluster reaches the high threshold for any resource. Regardless of the policy you choose, the storage scale-out threshold cannot be set to greater than 80%. Scale-in is triggered only after all of the low thresholds have been reached. See How the Elastic DRS Algorithm Works for more information about Elastic DRS (EDRS) scale-out and scale-in logic. The VMware Cloud on AWS console has an Elasticity tab that displays a grid with a row showing the EDRS policy currently applied to each cluster. To view or edit policy details, expand a row.
- vSAN Component and Storage Utilization
-
Elastic DRS monitors vSAN component and storage utilization and sends warning notifications when utilization becomes high.
For the baseline policy, we notify you when vSAN component utilization exceeds 75% or storage utilization exceeds 70%. If vSAN component utilization exceeds 85% or storage utilization exceeds 80%, we also send you a "host addition" notification and auto-remediate the cluster by adding a host to maintain an optimal SDDC health. If you do not have an available subscription for the instance type and region, you must purchase one within 48 hours. If a host is added beyond available subscriptions and you do not purchase sufficient subscription coverage within 48 hours, you will lose access to your SDDC and workloads until you purchase sufficient subscriptions. After 48 hours of excess usage, the SDDC will be isolated, and soon after that, any hosts or SDDCs not covered by a subscription will be deleted to bring your usage in line with your existing subscription purchases.
For all other policies, we send a warning notification when vSAN component or resource utilization is within 10% of the resource threshold.
Read VMware Knowledge Base article 74695 to learn more about vSAN component utilization.
- Elastic DRS Baseline
-
This is the default policy for a new cluster. It cannot be disabled and always applies to a cluster. If another policy is selected, it might add hosts faster than the baseline policy would, or enable automatic cluster scale-in operations (host removal). but the baseline policy always determines the upper threshold. All policies add hosts after storage utilization reaches 80% or vSAN component utilization reaches 85%. All policies will breach the maximum host count if needed to maintain vSAN slack space. If an AWS Availability Zone failure occurs in a multi-AZ SDDC, hosts added to remediate the outage are considered replacements and are not billable. This policy has the following thresholds:
Resource High Threshold Low Threshold CPU N/A N/A Memory N/A N/A Storage 80% utilization N/A
- Optimize for Best Performance
-
This policy adds hosts as needed to maintain performance and removes them only when resource consumption is significantly reduced. It does not remove hosts if it determines that the removal would degrade performance and force a near-term scale-out. It has the following thresholds:
Resource High Threshold Low Threshold CPU 90% utilization 50% utilization Memory 80% utilization 50% utilization Storage 80% utilization 20% utilization - Optimize for Lowest Cost
-
When scaling in, this policy removes hosts quickly to maintain baseline performance while keeping host counts to a practical minimum. It removes hosts only if it anticipates that storage utilization would not result in a scale out in the near term after host removal. It has the following thresholds:
Resource High Threshold Low Threshold CPU 90% utilization 60% utilization Memory 80% utilization 60% utilization Storage 80% utilization 40% utilization - Rapid Scaling
-
This policy adds multiple hosts at a time when needed for memory or CPU as long as sufficient subscriptions are available for the hosts. By default, hosts are added four at a time. You can specify a larger scale-out increment (8 or 12) if you need faster scaling for disaster recovery, Virtual Desktop Infrastructure (VDI), and similar use cases. As with any EDRS policy, scale-out time increases with increment size. When the increment is large (12 hosts), it can take up to 40 minutes to complete in some configurations.
Rapid scaling does not apply to storage. EDRS adds hosts incrementally when needed for storage.
When scaling in, this policy removes hosts rapidly, maintaining baseline performance while keeping host count to a practical minimum. It does not remove hosts if it anticipates that doing so would degrade performance and force a near-term scale-out. Scale-in stops when the cluster reaches the minimum host count or the number of hosts in the scale-out increment has been removed. This policy has the following thresholds:
Resource High Threshold Low Threshold CPU 80% utilization 50% utilization Memory 80% utilization 50% utilization Storage 80% utilization 40% utilization - Custom Managed EDRS Policy
-
This policy allows you to configure policy parameters independently to ensure performance standards while optimizing the cost. You can set the high and low thresholds for all resources.
Scale-out is based on resource type (CPU, memory, and storage). Scale-out is triggered when any of the high thresholds have been reached, but scale-in requires all resource types to be below those thresholds. Scale-out can be disabled for CPU and memory but not for storage. Scale-in can be disabled on the cluster.
By default, this policy adds multiple hosts in parallel when needed for memory or CPU, and adds hosts one at a time when needed for storage. As with any EDRS policy, scale-out time increases with increment size. When the increment is large (12 hosts), it can take up to 40 minutes to complete in some configurations. This policy has the following threshold ranges:
Resource High Threshold Range Low Threshold Range CPU 60%-95% utilization when enabled 5%-60%utilization Memory 60%-95% utilization when enabled 5%-60%utilization Storage 70%-80% utilization 5%-40% utilization Note:As a best practice, the gap between high and low thresholds should not be less than 15 percentage points. Larger gaps are less likely to force add/remove events for hosts. When setting scale-in thresholds, keep in mind that the data evacuation required when removing a host can temporarily impose additional load on remaining hosts. And be careful when configuring a Custom Managed EDRS Policy when cluster size approaches the 6-host SLA threshold for moving between FTT=1 and FTT=2. Any change in FTT forces background rebuilds of data, which impose additional load on the clusters.
- Minimum cluster size
- The smallest host count EDRS will scale in to regardless of resource utilization. When minimum cluster size is reached, EDRS can no longer perform a scale-in operation. You can still remove hosts manually as long as storage utilization remains below the minimum threshold and cluster size doesn't fall below minimum requirements (generally two hosts for a conventional cluster and six for a stretched cluster).
- Maximum cluster size
- The largest host count EDRS will scale out to regardless of resource utilization. Once maximum cluster size is reached, EDRS can no longer perform a scale-out operation for CPU or memory consumption, but can continue to add hosts for storage. You can always add hosts manually as long as cluster size doesn't exceed the maximum allowed for your organization.
- Scale Increment
- (Custom and Rapid Scaling policies only) The number of hosts added during a scale-out event or removed during a scale-in event for CPU and memory. The scale increment for storage is always a single host (1). In a conventional cluster (single AZ) the Custom Managed EDRS Policy supports increments of 1-6. In a stretched cluster, it supports even-numbered increments in the range 2-12.
Procedure
What to do next
All EDRS policy changes are logged in the SDDC Activity Log.