When deciding on the metrics that need to be observed, it’s important to adopt a user centric approach that works backwards from application owners. The goal should be to collect the minimum number of data points necessary to implement observability in the most efficient possible manner. Choosing more metrics than necessary and you could experience alert fatigue and lower attention towards the statistics that matter. In contrast, not selecting enough metrics would be counter intuitive as it leads to lack of visibility and overall inability to examine significant behaviours.
This section will outline key considerations when building an observability plan for your VMware Cloud infrastructure. It is advised to think about your observability plan when you are in the pilot or pre-production stage of your cloud journey. Consider the following high-level guidelines:
Shift towards an SLO centric culture to observe your services based on critical end-user experience rather than system metrics. Ensure VMware Cloud monitoring and event metrics/thresholds are aligned to Service Level Requirements and SLOs that are documented in Service Level Agreements with the service consumers (i.e., LOBs)
Define and select the appropriate Infrastructure and Application metrics to create SLIs that help you achieve better system observability.
All key thresholds and metrics formally established and reviewed regularly. The review process is documented and formally established, and reviews are fully aligned to service level requirements, and they support business commitments. There is a well understood and documented understanding of the bidirectional impact of VMware Cloud in addition to the future planning of new KPIs/Metrics to drive further efficiencies and user experience.
Review your existing processes and tools used for monitoring and event management and how they could adapt to VMware Cloud i.e., Predictive analytics, guided troubleshooting, root cause analysis as well as policy-based, automated remediation capabilities. This will proactively protect you against degradation of performance and capacity.
Workload Health
Workloads operating in the VMware cloud need to consistently instrument the applications to emit metrics, logs, and traces so that the signals can be correlated to identify the root cause of any issue. These issues could relate to inaccessibility, operating system (OS) instability, application misconfiguration, or any number of other possibilities.
A well-designed system aims to have the right amount of observability that starts in its development phase. Don't wait until an application is in production before you start to observe it. This includes the setup of monitoring, alerting, and logging so that you can act based on the behaviour of your system.
Questions to consider when choosing instrumentation for VMware Cloud observability:
What tooling will be used to monitor VMware Cloud and manage related events? Is this a new tool or will something existing be adapted?
Do you require a system that supports multi-clouds, including on-premises?
Is there an egress cost for sending data to the observability system?
Should the system provide support for multiple regions?
Should the system scale out on-demand for capacity?
Should the system support multi-tenancy with separation of teams?
Should the system include AI-powered intelligence to facilitate AIOps as you evolve your observability practice?
Does the system need to support 3 rd party integrations such as PagerDuty, ServiceDesk, DataDog, Slack and VictorOps?
Does the system need to provide immutability (data/logs/metrics), which cannot be modified, deleted, manipulated? Access control is required.
Does the system support scraping metrics from modern apps, or does it require an agent to be installed?