VMware constantly monitors customer SDDC environments through automation and a team of Site Reliability Engineers (SRE). The following describes processes that VMware automates to ensure the health of SDDCs.
VM Operations
- Orphaned VM(s) Auto-Remediation
- If you use "No data redundancy/VMs w/ FTT=0" as a storage policy, you might experience data loss if there is a failure or if the VM becomes unresponsive. If a failure happens and a VM or VMs become orphaned, VMware performs a cleanup action. You will receive an email notification when this happens.
vCenter Operations
- vCenter Sessions (Connections) Maxed Out
- If many sessions are created and not cleared, vCenter might become inaccessible. Typically this is caused by automation creating a large number of sessions. This generates an automated alert and VMware will restart vCenter. You will receive an email notification when this happens.
- vCenter Reboot
- A number of different issues might require a reboot of vCenter. Some issues might require an immediate reboot for remediation, while others might allow for continued usage with a reboot required in the near future. In the latter case, you will receive an email notification alerting you that a restart will occur in the next 24 hours. After a reboot, ongoing tasks and application connections might need to restart.
- Expired vCenter CA Certificate Removal
- Some product integrations install CA certificates on vCenter. If a CA certificate has expired, it could result in host add failures. Expired CA certificates will be removed.
NSX Operations
- Management Plane (NSX Manager) Restart
- A number of different issues might require a restart of NSX Manager. Some issues might require an immediate reboot for remediation, while others might allow for continued usage with a reboot required in the near future. For the short time while NSX Manager is in the process of restarting, you will not be able to access the SDDC Networking and Security UI. You will not receive an email notification for NSX Manager restart events.
- NSX Edge Failover
- If our monitoring system detects that an NSX Edge (active) is close to becoming unhealthy, we will schedule NSX Edge failover at off-peak hours. This scheduled failover is done as a proactive measure to avoid possible disruption from a failover happening at peak hours. If there is a problem with NSX (active) Edge before the scheduled failover, it will automatically failover. You will receive an email notification if we schedule an NSX Edge failover.
SDDC Operations
- SDDC Backups
-
We back up every SDDC daily at 0900Z as well as prior to any planned maintenance activity.
- What we back up: vCenter, vSAN configuration, and NSX. We do not back up customer data and workload VMs.
- Backup retention: Maximum age of 28 days and maximum of 56 backups. Backups are stored: encrypted in S3 within the SDDC's region and deleted when the SDDC is deleted. You cannot recover a deleted SDDC from backup.
- Recovery of management components is governed by your SLA. VMware will decide whether to recover from backup or repair.
- Removal of Hybrid Linked Mode Links to Clear Stale Links
- If you remove an on-premises vCenter that is linked to your cloud SDDC using Hybrid Linked Mode without first unlinking the vCenter, this leaves a stale link in the Hybrid Linked Mode configuration. These stale links prevent SRE from using the vSphere Client to log into the vCenter instances in the cloud SDDC. If this problem occurs, VMware removes all Hybrid Linked Mode links from the SDDC. You will receive an email notification, and will need to relink any on-premises vCenter instances that are still in use.
NFS Datastores
- Datastore availability
- If vSphere hosts lose access to an NFS datastore (all paths down) for more than 320 seconds, vSphere HA will power-off all VMs on that host that had data stored on the impacted datastore. HA will attempt to restart the VM on a host that has a healthy connection to the datastore.
- SDDC Health
- If a host is blocked from entering maintenance mode because a running VM cannot be relocated due to partial NFS Datastore availability, VMware operations will power off the offending VM. VMware will attempt to recover any impacted workload, but the VMs remain powered off until storage access is restored and you power them back on.