Troubleshooting Healthwatch

This topic describes how to troubleshoot problems and known issues that may arise when deploying or operating Healthwatch™ for VMware Tanzu^® (Healthwatch), Healthwatch Exporter forVMware Tanzu^® Application Service™ (TAS for VMs), and Healthwatch Exporter for VMware Tanzu^® Kubernetes Grid™ Integrated Edition (TKGI).

Accessing VM UIs for Troubleshooting

The sections below describe how to access the user interfaces (UIs) of the Prometheus and Alertmanager VMs for troubleshooting.

Access the Prometheus UI

The Prometheus UI allows you to view various processes on the VMs in the Prometheus instance that the Healthwatch tile deploys, including alerts that are currently running and the health status of scrape targets. Because the Prometheus UI is not secure, the Healthwatch tile does not include it. However, you can access the Prometheus UI to troubleshoot the Prometheus instance.

To access the Prometheus UI:

Run:
```
bosh deployments
```
This command returns a list of all BOSH deployments that are currently running.
Record the name of your Healthwatch deployment.
Run:
```
bosh -d DEPLOYMENT-NAME ssh tsdb/0 --opts='-L 9090:localhost:9090'
```
Where DEPLOYMENT-NAME is the name of your Healthwatch deployment that you recorded in the previous step.
Navigate to the Ops Manager Installation Dashboard.
Click the Healthwatch tile.
Select the Credentials tab.
In the Tsdb Client Mtls row, click Link to Credential.
Record the certificate and private key for Tsdb Client Mtls.
Add the certificate and private key for Tsdb Client Mtls that you recorded in the previous step to the keystore for your operating system.
- To store the Tsdb Client Mtls certificate and key on macOS:
  1. Create a cert.pem file containing the Tsdb Client Mtls certificate.
  2. Create a cert.key file containing the Tsdb Client Mtls private key.
  3. Change the access permissions on the certificate and private key files to 0600.
    
    For example:
```
chmod 0600 ~/Downloads/cert.key
chmod 0600 ~/Downloads/cert.pem
```
  4. To import the Tsdb Client Mtls private key into the macOS keychain:
```
security import KEY-PATH -k ~/Library/Keychains/login.keychain-db
```
    Where KEY-PATH is the path to the cert.key file. For example, ~/Downloads/cert.key.
  5. To import the Tsdb Client Mtls certificate into the macOS keychain:
    1. To open the macOS Keychain Access app:
      1. Type CMD-SPACE followed by typing the word keychain.
      2. Select the Keychain Access app on the displayed list.
    2. Select File -> Import Items.
    3. Select the Tsdb Client Mtls certificate cert.pem file.
In a web browser, navigate to localhost:9090. If your browser prompts you to specify which certificate to use for mTLS, select the certificate you added to your operating system keystore.
- On macOS:
  1. Open a browser to https://localhost:9090.
  2. If you are challenged by a security warning, select Advanced -> Proceed Anyway. Alternatively, type the letters thisisunsafe into the webpage.
  3. In the displayed dialog, select the Tsdb Client Mtls certificate you added to the macOS keychain and click OK.
  4. When prompted, provide the keychain access password, and select Always Allow.
The Prometheus UI should display in your browser.

Access the Alertmanager UI

The Alertmanager UI allows you to view which alerts are currently running. Because the Alertmanager UI is not secure, the Healthwatch tile does not include it. However, you can access the Alertmanager UI to troubleshoot or silence alerts.

To access the Alertmanager UI:

Run:
```
bosh deployments
```
This command returns a list of all BOSH deployments that are currently running.
Record the name of your Healthwatch deployment.
Run:
```
bosh -d DEPLOYMENT-NAME ssh tsdb/0 --opts='-L 8080:localhost:10401'
```
Where DEPLOYMENT-NAME is the name of your Healthwatch deployment that you recorded in the previous step.
In a web browser, navigate to localhost:8080. The Alertmanager UI appears.

Troubleshooting Known Issues

The sections below describe how to troubleshoot known issues in Healthwatch and Healthwatch Exporter for TKGI.

“Unable to Render Templates” Error When Installing or Upgrading

When installing or upgrading to Healthwatch v2.2, you see the following error:

 - Unable to render templates for job ‘opsman-cert-expiration-exporter’. Errors are: - Error filling in template ‘bpm.yml.erb’ (line 9: Can’t find property ‘[“opsman_access_credentials.uaa_client_secret”]’)

This error occurs if you upgraded from Ops Manager v2.3 or earlier to Ops Manager v2.4 through v2.7. To resolve this issue:

SSH into the Ops Manager VM by following the procedure in the Ops Manager documentation.
Change the user to root.

Open the Rails console by running:

> cd /home/tempest-web/tempest/web; RAILS_ENV='production' TEMPEST_INFRASTRUCTURE='DEPLOYMENT-IAAS' TEMPEST_WEB_DIR='/home/tempest-web' SECRET_KEY_BASE='1234' DATA_ROOT='/var/tempest' LOG_DIR='/var/log/opsmanager' su tempest-web --command 'bundle exec rails console'

Where DEPLOYMENT-IAAS is either google, aws, azure, vsphere, or openstack, depending on the IaaS of your Ops Manager deployment.

Set the decryption passphrase by running:
```
irb(main):001:0> EncryptionKey.instance.passphrase = 'DECRYPTION-PASSPHRASE'
```
Where DECRYPTION-PASSPHRASE is the decryption passphrase you want to set.

Update the UAA restricted view access client secret by running:

irb(main):001:0> Uaa::UaaConfig.instance.update_attributes(restricted_view_api_access_client_secret: SecureRandom.hex)

Exit the Rails console and restart the tempest-web service by running:
```
irb(main):001:0> exit
> service tempest-web restart
```

This issue is fixed in Ops Manager v2.8 and later.

Smoke Tests Errand Fails When Deploying Healthwatch

When you deploy Healthwatch, the Smoke Tests errand fails with the following error message:

 querying for grafana up should be greater than 0

The Smoke Tests errand fails because the Prometheus instance fails to scrape metrics from the Grafana instance. Potential causes of this failure include:

There is a network issue between the Prometheus instance and Grafana instance.
The Grafana instance uses a certificate that does not match the certificate authority (CA) you configured in the Grafana pane in the Healthwatch tile. This could occur because the CA you configured in the Grafana pane is either a self-signed certificate or a different CA from the one that generated the certificate. As a result, the Prometheus instance does not trust the certificate that the Grafana instance uses. For more information about configuring a CA for the Grafana instance, see (Optional) Configure Grafana in Configuring Healthwatch.

To find out why the Prometheus instance fails to scrape metrics from the Grafana instance:

View information about the Grafana instance scrape target by running:

curl -vk https://localhost:9090/api/v1/targets --cacert /var/vcap/jobs/prometheus/config/certs/prometheus_ca.pem --cert /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.pem --key /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.key | /var/vcap/packages/prometheus_backup_jq/bin/jq '.data.activeTargets[] | select(.scrapePool == "grafana")'

The lastError field in the command output describes the reason for the Prometheus instance failing to scrape the Grafana instance.

TKGI Metric Exporter VM Fails to Connect to the BOSH Director

When the TKGI metric exporter VM attempts to connect to the BOSH Director, you see the following error:

 ERROR [context.UaaContext [ForkJoinPool-1-worker-3]] javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors ERROR [ingress.TokenCallCredentials [ForkJoinPool-1-worker-3]] Caught error retrieving UAA token: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors INFO  [ingress.EventStreamObserver [ForkJoinPool-1-worker-3]] io.grpc.StatusRuntimeException: UNAUTHENTICATED

This error appears when the TKGI metric exporter VM cannot verify that the certificate chain of the UAA server for the BOSH Director is valid. To allow the TKGI metric exporter VM to connect to the BOSH Director, you must correct any certificate chain errors.

To check for certificate chain errors in the UAA server for the BOSH Director:

Log in to the TKGI metric exporter VM by following the procedure in the Ops Manager documentation.
View the certificate that the UAA server uses by running:
```
openssl s_client -connect 10.0.0.5:8443
```
Save the certificate as a cert.pem file.
Run:
```
openssl verify cert.pem
```
If the command returns an OK message, the certificate is trusted and has a valid certificate chain. If the command returns any other message, see the OpenSSL documentation to troubleshoot.

BOSH Health Metrics Cause Errors When Two Healthwatch Exporter Tiles Are Installed

When you install both Healthwatch Exporter for TAS for VMs and Healthwatch Exporter for TKGI on the same Ops Manager foundation, the BOSH Director Status panel in the BOSH Director Health dashboard in the Grafana UI shows “Not Running”, and your BOSH Director deployment returns the following error:

 Director responded with non-successful status code ‘401’ response ‘{“code”:600000,“description”:“Require one of the scopes: bosh.admin, bosh.750587e9-eae5-494f-99c4-5ca429b13959.admin, bosh.teams.p-healthwatch2-pas-exporter-b3a337d7ec4cca94f166.admin”}’

This occurs because both Healthwatch Exporter tiles deploy a BOSH health metric exporter VM, and both BOSH health metric exporter VMs are named bosh-health-exporter. This causes the two sets of metrics to conflict with each other.

To address this, you must scale the BOSH health metric exporter VM down to zero instances in one of the Healthwatch Exporter tiles.

To scale the BOSH health metric exporter VM down to zero instances in one of the Healthwatch Exporter tiles:

Navigate to the Ops Manager Installation Dashboard.
Click the Healthwatch Exporter for Tanzu Kubernetes Grid - Integrated tile or Healthwatch Exporter for Tanzu Application Service tile.
Select Resource Config.
In the Bosh Health Exporter row, select 0 from the Instances dropdown.
Click Save.
Return to the Ops Manager Installation Dashboard.
Click Review Pending Changes.
Click Apply Changes.

Healthwatch Exporter for TKGI Does Not Clean Up TKGI Service Accounts

If you run SLI tests for TKGI through Healthwatch Exporter for TKGI, and you do not have an OpenID Connect (OIDC) provider for your Kubernetes clusters configured for TKGI, the TKGI SLI exporter VM does not automatically clean up the service accounts that it creates while running the TKGI SLI test suite.

To fix this issue, either upgrade to Healthwatch v2.2.1 or configure an OIDC provider as the identity provider for your Kubernetes clusters in the TKGI tile. This cleans up the service accounts that the TKGI SLI exporter VM creates in future TKGI SLI tests, but does not clean up existing service accounts from previous TKGI SLI tests. For more information about configuring an OIDC provider in TKGI, see the TKGI documentation.

VMware recommends that you manually delete existing service accounts from previous TKGI SLI tests if running the tkgi get-credentials command returns an error similar to the following example:

Error: Status: 500; ErrorMessage: nil; Description: Create Binding: Timed out waiting for secrets; ResponseError: nil

Manually deleting service accounts also deletes the secrets and Clusterrolebindings associated with those service accounts.

To manually delete a service account:

In a terminal window, run:
```
kubectl delete serviceaccount -n NAMESPACE SERVICE-ACCOUNT
```
Where:
- NAMESPACE is the namespace that contains the service account you want to delete.
- SERVICE-ACCOUNT is the service account you want to delete.

BBR Backup Snapshots Fill Disk Space on Prometheus VMs

In Healthwatch v2.2.0, the backup scripts for Prometheus VMs do not clean up the intermediary snapshots created by BOSH Backup and Restore (BBR). This results in the disk space on Prometheus VMs filling up.

To fix this issue, either upgrade to Healthwatch v2.2.1 or manually clean up the snapshots. To manually clean up the snapshots:

Log in to the Prometheus VM you want to clean up by following the procedure in the Ops Manager documentation.
Run:
```
sudo -i
```
Empty the snapshots folder for the Prometheus VM by running:
```
rm -rf /var/vcap/store/prometheus/snapshots/*
```
Change into the snapshots folder by running:
```
cd /var/vcap/store/prometheus/snapshots
```
Verify that the /var/vcap/store/prometheus/snapshots directory is empty by running:
```
ls
```

Troubleshooting Missing Router Metrics

If you are using Pivotal Application Service (PAS) v2.7, you must activate the Enable system metrics checkbox in the System Logging pane of the PAS tile. Otherwise, PAS v2.7 does not forward system metrics to the Loggregator Firehose, so Healthwatch Exporter for TAS for VMs cannot collect metrics for the VMs that PAS v2.7 deploys. As a result, the Router dashboard in the Grafana UI shows no metrics.

To allow PAS v2.7 to forward system metrics to the Loggregator Firehose:

Navigate to the Ops Manager Installation Dashboard.
Click the Pivotal Application Service tile.
Select System Logging.
Activate the Enable system metrics checkbox.
Return to the Ops Manager Installation Dashboard.
Click Review Pending Changes.
Click Apply Changes.

Troubleshooting Missing TKGI Cluster Metrics

The sections below describe how to troubleshoot missing TKGI cluster metrics in the Grafana UI.

To find out why the Prometheus instance fails to scrape metrics from your TKGI clusters, see Diagnose Prometheus Scrape Job Failure below.

Potential causes of this failure include:

You are using TKGI v1.10.0 or v1.10.1. For more information, see No Data on Kubernetes Nodes Dashboard for TKGI v1.10 below.
You are using TKGI v1.12. For more information, see No Data on Kubernetes Nodes Dashboard for TKGI v1.12 in Healthwatch v2.2 Release Notes.
You are using TKGI to monitor Windows clusters. For more information, see No Data on Kubernetes Nodes Dashboard for Windows Clusters in Healthwatch v2.2 Release Notes.
The Prometheus instance in the Healthwatch tile could not detect or create scrape jobs for the clusters, causing TKGI cluster discovery to fail. For more information, see Configure DNS for Your TKGI Cluster below.

Diagnose Prometheus Scrape Job Failure

When the Kubernetes Nodes dashboard in the Grafana UI does not show metrics data, the Prometheus instance in the Healthwatch tile has failed to scrape metrics from on-demand Kubernetes clusters created through the TKGI API.

To find out why the Prometheus instance fails to scrape metrics from your TKGI clusters:

View information about the Prometheus instance scrape targets by running:

curl -vk https://localhost:9090/api/v1/targets --cacert /var/vcap/jobs/prometheus/config/certs/prometheus_ca.pem --cert /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.pem --key /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.key | /var/vcap/packages/prometheus_backup_jq/bin/jq .

Find the scrape jobs for your TKGI clusters. The lastError field describes the reason for the Prometheus instance failing to scrape your TKGI clusters.

No Data on Kubernetes Nodes Dashboard for TKGI v1.10

If you are using TKGI v1.10.0 or v1.10.1, the Kubernetes Nodes dashboard in the Grafana UI might not show data for individual pods. This is due to a known issue in Kubernetes v1.19.6 and earlier and Kubernetes v1.20.1 and earlier.

To fix this issue, upgrade to TKGI v1.10.2 or later. For more information about upgrading to TKGI v1.10.2 or later, see the TKGI documentation.

Configure DNS for Your TKGI Clusters

When TKGI cluster discovery fails, you see the following error:

2020-05-20 19:24:02 ERROR k8s.K8sClient [parallel-1] Failed to make request
java.net.UnknownHostException: CLUSTER-NAME.ENVIRONMENT-DOMAIN

Where:

CLUSTER-NAME is the name of your TKGI cluster.
ENVIRONMENT-DOMAIN is the domain of your TKGI foundation.

This occurs because the TKGI API cannot access your TKGI clusters from the Internet. To resolve this issue, you must configure a DNS entry for the control plane of each of your TKGI clusters in the console for your IaaS.

To configure DNS entries for the control planes of your TKGI clusters:

Find the IP addresses and hostnames of the control plane of each of your TKGI clusters. For more information, see the TKGI documentation.
Record the Kubernetes Master IP(s) and Kubernetes Master Host from the output you viewed in the previous step. For more information, see the TKGI documentation.
In a web browser, log in to the user console for your IaaS.
For each TKGI cluster, find the public IP address of the VM that has an internal IP address matching the Kubernetes Master IP(s) you recorded in a previous step. For more information, see the documentation for your IaaS:
- AWS: To find the public IP address of a Linux instance, see the AWS documentation for Linux instances of Amazon EC2. To find the public IP address for a Windows instance, see the AWS documentation for Windows instances of Amazon EC2.
- Azure: To create or view the public IP address for an Azure VM, see the Azure documentation.
- GCP: To find the public IP address for a GCP VM, see the GCP documentation.
- OpenStack: To associate a floating IP address to an OpenStack VM, see the OpenStack documentation.
- vSphere: To find the public IP address of a vSphere VM, see the vSphere documentation.
For each TKGI cluster, create an A record in your DNS server that points to the public IP address of the control plane of the TKGI cluster that you recorded in the previous step. For more information, see the documentation for your IaaS:
- AWS: For more information about configuring a DNS entry in the Amazon VPC console, see the AWS documentation.
- Azure: For more information about configuring an A record in Azure DNS, see the Azure documentation.
- GCP: For more information about adding an A record to Cloud DNS, see the GCP documentation.
- OpenStack: For more information about configuring a DNS entry in the OpenStack internal DNS, see the OpenStack documentation.
- vSphere: For more information about configuring a DNS entry in the vCenter Server Appliance, see the vSphere documentation.
Wait for your DNS server to update.

Troubleshooting Healthwatch Exporter Tiles Using Grafana UI Dashboards

By default, the Grafana UI includes dashboards for Healthwatch Exporter tiles under the Healthwatch folder.

Viewing Healthwatch Exporter Tile Metrics

The Healthwatch - SLOs dashboard in the Grafana UI displays a row for each metric exporter VM you select from the corresponding metric exporter instance dropdown at the top of the page. Each row contains four panels:

Up: The current health of the Prometheus endpoint on the metric exporter VM. A value of 1 indicates that the Prometheus endpoint is healthy. A value of 0 or missing data indicates that either the Prometheus endpoint is unresponsive or the Prometheus instance failed to scrape the Prometheus endpoint. For more information, see the Prometheus documentation.
Exporter SLO: The percentage of time that the Healthwatch Exporter tile was up and running over the selected time period.
Error Budget Remaining: How many minutes are left in the error budget before exceeding the selected Uptime SLO Target over the selected time period.
Minutes of Downtime: How many minutes the Healthwatch Exporter tiles were down during the selected time period.

Troubleshooting Healthwatch Exporter for TAS for VMs

The Healthwatch - Exporter Troubleshooting dashboard in the Grafana UI displays metrics that allow you to monitor the performance of each Healthwatch Exporter for TAS for VMs tile installed on your Ops Manager foundations. You can use these metrics to troubleshoot when you see inconsistent graphs for a particular metric type, or if a Healthwatch Exporter for TAS for VMs tile is not behaving as expected.

These dashboards contain the following panels:

Exporter Info: A listing of the healthwatch_pasExporter_status metric, showing runtime information for Healthwatch Exporter for TAS for VMs.
Exporter JVM Memory: A graph of the jvm_memory_bytes_used, jvm_memory_bytes_commited, and jvm_memory_bytes_init metrics, showing the number of used, committed, and initial bytes in a given Java virtual machine (JVM) memory area over the selected time period. You can use this graph to check for memory leaks.
Ephemeral Disk Usage: A gauge of the system_disk_ephemeral_percent metric, showing the percentage of the ephemeral disk used. You can use this gauge to determine whether the disk is reaching capacity.
Rate of Garbage Collection: A graph of the jvm_gc_collection_seconds_sum metric, showing the rate of JVM garbage collection over the selected time period. You can use this graph to determine whether the JVM garbage collection is functional.
Rate of Envelope Ingress: A graph of the healthwatch_pasExporter_ingress_envelopes metric, showing the rate of Loggregator envelope ingress over the selected time period. You can use this graph to check for spikes in the number of Loggregator envelopes that the metric exporter VMs receive.
CPU Usage: A graph of the cpu_usage_user metric, showing the percentage of CPU used over the selected time period. You can use this graph to determine whether the amount of CPU used by Healthwatch Exporter for TAS for VMs is reaching capacity.
Exporter VM Threads: A graph of the jvm_threads_current and jvm_threads_peak metrics, showing the current and peak thread counts of a given JVM over the selected time period. You can use this graph to check whether Healthwatch Exporter for TAS for VMs is leaking threads.