VMware Cloud on AWS employs multiple monitoring solutions to ensure the health of the SDDC and the SaaS console used to manage the SDDCs. The tools used for monitoring and logging are evolving continuously. Monitoring is most often used for alerting purposes, for example we use monitoring systems to voluntarily alert developers when the system is not operating normally and could impact customers. Logging is used to represent state transformations within an application. When things go wrong, logs establish what change in state caused the error. Logs are primarily useful for forensics and root cause analysis.

VMware Cloud on AWS Monitoring and Logging encompasses SDDC (VMware Cloud infrastructure) and SaaS (VMware Cloud Console and cloud-native apps) Monitoring and Logging.

SDDC Monitoring and Logging

Site Reliability Engineer Team (SRE)

SDDC Monitoring and Logging is performed by Site Reliability Engineer Team (SRE). SRE's Charter is to Provide the horizontal platforms, operational processes, and operational response needed to maintain the desired VMC on AWS customer experience with minimum toil. The SRE team is involved in active monitoring & logging of VMware Cloud infrastructure 24 / 7 365 days across multiple regions. They work closely with service owners of various teams and also with AWS to remediate issues.

Key SRE Pillars:

  • Monitoring, Service Health (Availability and Reliability)
  • Automation
  • Reporting and Analytics
  • Service Response and Operational Support

Service Health Monitoring and Automation

The SRE team adopts multilayer SDDC monitoring model which has resulted in a successful service health monitoring and automation system. This system has enabled detection of service infrastructure issues, automated remediation and escalation. 

SDDC Monitoring and Automation Flow

  1. Various agents on an EC2 instance (The POP) in each SDDC collect logging, telemetry, perf data, and service events. 
  2. The data is sent to various SRE monitoring services for processing and alerting. 
  3. Alerts are filtered, deduced, correlated and enhanced. 
  4. Based on configuration some alerts are sent to remediation & troubleshooting service (RTS) to initiate automated workflows, some are sent directly to JIRA Service Desk for Service Watch response, and specific hardware events are sent for hardware remediation. 
  5. If auto remediation fails an alert is sent to Service Desk to engage a service owner. 

 

SaaS Monitoring and Logging

VMware Cloud on AWS also has a SaaS Monitoring and Logging model in place to provide better customer experience even though the customers primary focus is SDDC monitoring. The SaaS model involves collection of Logs and Metrics from PoP, Cluster, Nodes and Services. 

check-circle-line exclamation-circle-line close-line
Scroll to top icon