Understanding Auto-Remediation

The VMware Cloud on AWS autoscaler service monitors the health of your SDDC infrastructure, detects incipient and actual failures, and automatically remediates the infrastructure by replacing hosts before or after a failure occurs.

AWS Infrastructure is reliable, but failures are inevitable in even the most reliable infrastructure. The AWS Architecture framework reliability pillar discusses their design principles for reliability in the cloud. VMware Cloud on AWS extends these principles by abstracting the underlying infrastructure and leveraging the predictive failure analysis capabilities of vCenter and ESXi to provide reactive remediation when failures occur and predictive remediation that can prevent failures from affecting workloads.

Most of the auto-remediation process happens in the background and is carried out without affecting existing workloads. Auto-remediation monitors the health of the system and can quickly add hardware to an SDDC when necessary, inserting a new host into your cluster when a fault occurs or a health issue is detected and evacuating workload VMs from failed or failing hardware. In addition, because all VMware Cloud on AWS SDDCs use VMware vSAN and vSphere HA, workloads affected by host failures are automatically relocated and restarted.

Note: You are never billed for extra hosts used for auto-remediation or planned maintenance.

Auto-Remediation High-Level Architecture

Auto-remediation architecture includes components supplied by both AWS and VMware.

AWS sends VMware host-level information, notably AWS Planned Maintenance events. The autoscaler service receives these notifications and automatically remediates any issues within the SDDC.
A monitoring service at the SDDC level receives notifications from the underlying VMware Cloud on AWS components.

See the VMware Cloud Tech Zone article Feature Brief: Auto Remediation for more.

The autoscaler service receives messages from the SDDC monitoring service and from AWS, and performs appropriate remediation actions on the SDDC.

Reactive Remediation

Reactive auto-remediation monitors hardware and software faults and attempts to remediate problems in several ways. Auto-Remediation is an internal process and is constantly evolving. VMware Cloud on AWS users have no access to the workflow or its configuration, but to help you understand it better, here’s a high-level overview of the steps currently involved.

1: Monitor: VMware Cloud on AWS continuously monitors the health of every host in your SDDC. When a failure is detected, an event is sent to auto-remediation.
2: Wait for transient events: Some of the detected failures can be temporary. For example, when the monitoring system cannot reach a host due to a temporary connectivity issue. Auto-remediation waits for five minutes to determine whether the problem is temporary. If it is, auto-remediation returns without taking any action.
3: Add a host: If the error does not resolve after five minutes, auto-remediation begins adding a host to the SDDC. Pre-emptively adding a host in his way ensures that the host is available if required. Note that you are not billed for this host until it replaces a faulty host in your SDDC.
4: Determine failure type and take action: Hosts can fail for different reasons, and require different action. For example, a vSAN disk failure on a host that is still connected to a vCenter Server can be remediated through a soft reboot, whereas a PSOD host requires a hard reboot.
5: Check host health: The next step is to check if the remediation action has fixed the host. If the failed host is now healthy after a soft or hard reboot, auto-remediation avoids further disruption to the SDDC. It collects and takes any other necessary actions and removing the new host that was added pre-emptively in Step 3.
6: Replace host: If the failed host cannot be revived then the autoscaler removes the failed host, and replaces it with the host that was added in Step 3. vSphere HA and vSAN are triggered and compute policy tags are attached to the new host.

Pre-emptive Remediation

In addition to reactive remediation, the autoscaler monitors several independent feeds in an attempt to spot failures before they manifest. If the service determines a host is likely to encounter a hardware failure, a non-disruptive preemptive planned maintenance event is triggered. It is still possible that the host will fail before the planned maintenance is completed, but by preemptively initiating host replacement, the impact is minimized. During planned maintenance:

A new host is added to the cluster. Tags are copied to this new host from the host to be replaced.
Note: Customers using Compute Policies for licensing purposes may need to account for one additional host. The tag copying process can result in a tag being briefly applied to both hosts, and if DRS triggers during this period, it can cause VMs to run on both hosts.
The failed host is placed into maintenance mode with a full data evacuation. This non-disruptively moves VMs and vSAN data to other hosts within the cluster.
The failed host is removed from the cluster.

Autoscaler Events

When the autoscaler service receives a failure event, it determines the failure type and then takes appropriate action. The SDDC activity log includes any autoscaler activities, but does not show the failure event that triggered the activity.

vCenter events.

An event is triggered to check the host connection state
An event is triggered when the ESXi host is disconnected or not responding.

DAS events

vSphere HA events: An event is created when there is no communication with master node, or HA is down. (FDM)
When a host goes down, HA system reports a host failure.

vSAN events

When there is a disk failure on the hosts.
When the vSAN host is disconnected.

EDRS events (non-failure)

Upgrade: Disable EDRS. Maintenance activities frequently require an extra host, this host(s) is added as part of the maintenance event. EDRS is disabled for the duration of any planned maintenance to prevent these activities from triggering Scale-in/out events.

AWS events

Planned maintenance events. Notification from AWS that an instance health issue has been detected and the instance should be evacuated.
Personal Health Dashboard (PHD). An event stream that provides insight into various hardware components and allows VMware to spot hardware failures preemptively.
System status check. Monitors the health of the AWS systems the Instance relies upon. This check reports issues that only AWS can fix. In many cases, these issues are transient, and no action is required.
Instance status check. Monitors the software and network configuration for each instance. This check monitors the availability of the instance by issuing periodic ARP requests to the NIC. In addition to reporting on instance availability at the EC2 layer. Instance status checks monitor the underlying hardware utilization and will report Networking issues, Memory Exhaustion, Corrupt file system, kernel errors, etc. Unlike System Status Checks, Instance status checks require VMware interaction to resolve.

SDDC events

vCenter host health.