groups: ##### BEGIN FOUNDATION ALERTING RULES ##### - name: BOSHDirectorHealth rules: - alert: BOSHDirectorStatus expr: 'increase(bosh_sli_failures_total{scrape_instance_group="bosh-health-exporter"}[20m]) > 0' for: 20m annotations: summary: "A BOSH Director is down" description: | Losing the BOSH Director does not significantly impact the experience of Tanzu Application Service end users. However, this issue means a loss of resiliency for BOSH-managed VMs. Troubleshooting Steps: SSH into the `bosh-health-exporter` VM in the "Healthwatch Exporter" deployment, and view logs to find out why the BOSH Director is failing. - name: CertificateExpiration rules: - alert: ExpiringCertificate expr: "ssl_certificate_expiry_seconds < 2592000" for: 5m annotations: summary: "A certificate is expiring" description: | At least one certificate ({{ $labels.display_name }}) on your foundation is going to expire within 30 days. - name: OpsManagerHealth rules: - alert: OpsManagerStatus expr: 'probe_success{instance=""} <= 0' for: 10m annotations: summary: "The Ops Manager health check failed" description: | Issues with Ops Manager health should have no direct end user impacts, however it can can impact an operator's ability to perform an upgrade or to rescale the Tanzu Application Service platform when necessary. ##### END FOUNDATION ALERTING RULES ##### ##### BEGIN TKG-i ALERTING RULES ##### - name: KubernetesClusterMasterNodes rules: - alert: KubernetesClusterMasterNodeHealth expr: 'avg by (cluster) (min by (cluster, instance) ( label_replace(etcd_server_has_leader{}, "instance", "$1", "instance", "(.*):.*") or label_replace(system_healthy{origin="system_metrics_agent", exported_job="master"}, "instance", "$1", "instance", "(.*):.*") ) ) < .35' for: 10m annotations: summary: "One or more Tanzu Kubernetes Grid Integrated Edition clusters are running with unhealthy master nodes for at least 10 minutes." description: | This might affect the operator's ability to administer changes to the clusters they oversee. Troubleshooting Steps: Identify which clusters are impacted using Kubernetes Cluster Overview dashboard, then view the Cluster Detail dashboard by clicking on the cluster's name in the `Clusters That Might Need Attention` panel to see if API server, scheduler, controller manager, and etcd are up. Identify the corresponding BOSH deployment and investigate the logs for the failing jobs on the master VMs. - name: KubernetesSLITests rules: - alert: KubernetesSLITests expr: 'increase(tkgi_sli_task_failures_total[2m]) > 0' for: 10m annotations: summary: "The Tanzu Kubernetes Grid Integrated Edition SLI test ({{ $labels.display_name }}) has been failing for at least 10 minutes." description: | Tanzu Kubernetes Grid Integrated Edition SLI Tests run every 1 minute by default. This setting is configurable in Ops Manager and may have been changed. These tests are intended to give Platform Operators confidence that Application Developers can successfully interact with and manage applications in their clusters. Note: the sli tests will report a failure if any task (e.g. `login`, `clusters`) takes more than 1 minute to complete. Troubleshooting Steps: If a failure occurs, attempt to use the failed Kubernetes CLI command in a terminal to see why it is failing. ##### BEGIN TKG-i ALERTING RULES ##### ##### BEGIN HEALTHWATCH ALERTING RULES ##### - name: HealthwatchTKGISLOs rules: - alert: HealthwatchTKGIFunctionalExporter expr: 'service_up{service="pks-sli-exporter"} < 1' for: 10m annotations: summary: "The Healthwatch Tanzu Kubernetes Grid Integrated Edition Functional Exporter is down" description: | The Healthwatch Tanzu Kubernetes Grid Integrated Edition Functional Exporter has been down for 10 minutes. - alert: HealthwatchTKGISystemMetricsExporter expr: 'service_up{service="pks-exporter"} < 1' for: 10m annotations: summary: "The Healthwatch Tanzu Kubernetes Grid Integrated Edition System Metrics Exporter is down" description: | The Healthwatch Tanzu Kubernetes Grid Integrated Edition System Metrics Exporter has been down for 10 minutes. ##### END HEALTHWATCH ALERTING RULES #####