VMware Cloud on AWS employs multiple monitoring solutions to ensure the health of the SDDC and the SaaS console used to manage the SDDCs. The tools used for monitoring and logging are evolving continuously. Monitoring is most often used for alerting purposes, for example we use monitoring systems to voluntarily alert developers when the system is not operating normally and could impact customers. Logging is used to represent state transformations within an application. When things go wrong, logs establish what change in state caused the error. Logs are primarily useful for forensics and root cause analysis.
VMware Cloud on AWS Monitoring and Logging encompasses SDDC (VMware Cloud infrastructure) and SaaS (VMware Cloud Console and cloud-native apps) Monitoring and Logging.
SDDC Monitoring and Logging is performed by Site Reliability Engineer Team (SRE). SRE's Charter is to Provide the horizontal platforms, operational processes, and operational response needed to maintain the desired VMC on AWS customer experience with minimum toil. The SRE team is involved in active monitoring & logging of VMware Cloud infrastructure 24 / 7 365 days across multiple regions. They work closely with service owners of various teams and also with AWS to remediate issues.
Key SRE Pillars:
The SRE team adopts multilayer SDDC monitoring model which has resulted in a successful service health monitoring and automation system. This system has enabled detection of service infrastructure issues, automated remediation and escalation.
VMware Cloud on AWS also has a SaaS Monitoring and Logging model in place to provide better customer experience even though the customers primary focus is SDDC monitoring. The SaaS model involves collection of Logs and Metrics from PoP, Cluster, Nodes and Services.