VMware constantly monitors customer SDDC environments through automation and a team of Site Reliability Engineers (SRE). The following describes processes that VMware automates to ensure the health of SDDCs.
- Orphaned VM(s) Auto-Remediation
- If you use "No data redundancy/VMs w/ FTT=0" as a storage policy, you might experience data loss if there is a failure or if the VM becomes unresponsive. If a failure happens and a VM or VMs become orphaned, VMware performs a cleanup action. You will receive an email notification when this happens.
- vCenter Sessions (Connections) Maxed Out
- If many sessions are created and not cleared, vCenter Server might become inaccessible. Typically this is caused by automation creating a large number of sessions. This generates an automated alert and VMware will restart vCenter Server. You will receive an email notification when this happens.
- vCenter Server Reboot
- A number of different issues might require a reboot of vCenter Server. Some issues might require an immediate reboot for remediation, while others might allow for continued usage with a reboot required in the near future. In the latter case, you will receive an email notification alerting you that a restart will occur in the next 24 hours. After a reboot, ongoing tasks and application connections might need to restart.
- Expired vCenter CA Certificate Removal
- Some product integrations install CA certificates on vCenter. If a CA certificate has expired, it could result in host add failures. Expired CA certificates will be removed.
- Management Plane (NSX Manager) Restart
- A number of different issues might require a restart of NSX Manager. Some issues might require an immediate reboot for remediation, while others might allow for continued usage with a reboot required in the near future. For the short time while NSX Manager is in the process of restarting, you will not be able to access the SDDC Networking and Security UI. You will not receive an email notification for NSX Manager restart events.
- NSX Edge Failover
- If our monitoring system detects that an NSX Edge (active) is close to becoming unhealthy, we will schedule NSX Edge failover at off-peak hours. This scheduled failover is done as a proactive measure to avoid possible disruption from a failover happening at peak hours. If there is a problem with NSX (active) Edge before the scheduled failover, it will automatically failover. You will receive an email notification if we schedule an NSX Edge failover.
- Single Host SDDC Failure
- The Single Host SDDC starter configuration has no SLA and is appropriate for proof-of-concept or test and development use cases. VMware does not perform any remediation in the event of a Single Host SDDC failure. You will receive an email notification if a Single Host SDDC failure occurs.
- SDDC Backups
We back up every SDDC daily at 0900Z as well as prior to any planned maintenance activity.
- What we back up: vCenter Server, vSAN configuration, and NSX. We do not back up customer data and workload VMs.
- Backup retention: Maximum age of 28 days and maximum of 56 backups. Backups are stored: encrypted in S3 within the SDDC's region and deleted when the SDDC is deleted. You cannot recover a deleted SDDC from backup.
- Recovery of management components is governed by your SLA. VMware will decide whether to recover from backup or repair.
- Datastore availability
- If vSphere hosts lose access to an NFS datastore (all paths down) for more than 320 seconds, vSphere HA will power-off all VMs on that host that had data stored on the impacted datastore. HA will attempt to restart the VM on a host that has a healthy connection to the datastore.
- SDDC Health
- If a host is blocked from entering maintenance mode because a running VM cannot be relocated due to partial NFS Datastore availability, VMware operations will power off the offending VM. VMware will attempt to recover any impacted workload, but the VMs remain powered off until storage access is restored and you power them back on.