This topic describes how to troubleshoot problems and known issues that may arise when deploying or operating Healthwatch™ for VMware Tanzu® (Healthwatch), Healthwatch Exporter forVMware Tanzu® Application Service™ (TAS for VMs), and Healthwatch Exporter for VMware Tanzu® Kubernetes Grid™ Integrated Edition (TKGI).
The sections below describe how to access the user interfaces (UIs) of the Prometheus and Alertmanager VMs for troubleshooting.
The Prometheus UI allows you to view various processes on the VMs in the Prometheus instance that the Healthwatch tile deploys, including alerts that are currently running and the health status of scrape targets. Because the Prometheus UI is not secure, the Healthwatch tile does not include it. However, you can access the Prometheus UI to troubleshoot the Prometheus instance.
To access the Prometheus UI:
Run:
bosh deployments
This command returns a list of all BOSH deployments that are currently running.
Record the name of your Healthwatch deployment.
Run:
bosh -d DEPLOYMENT-NAME ssh tsdb/0 --opts='-L 9090:localhost:9090'
Where DEPLOYMENT-NAME
is the name of your Healthwatch deployment that you recorded in the previous step.
Navigate to the Ops Manager Installation Dashboard.
Click the Healthwatch tile.
Select the Credentials tab.
In the Tsdb Client Mtls row, click Link to Credential.
Record the certificate and private key for Tsdb Client Mtls.
Add the certificate and private key for Tsdb Client Mtls that you recorded in the previous step to the keystore for your operating system.
To store the Tsdb Client Mtls certificate and key on macOS:
cert.pem
file containing the Tsdb Client Mtls certificate.cert.key
file containing the Tsdb Client Mtls private key.Change the access permissions on the certificate and private key files to 0600
.
For example:
chmod 0600 ~/Downloads/cert.key
chmod 0600 ~/Downloads/cert.pem
To import the Tsdb Client Mtls private key into the macOS keychain:
security import KEY-PATH -k ~/Library/Keychains/login.keychain-db
Where KEY-PATH
is the path to the cert.key
file. For example, ~/Downloads/cert.key
.
To import the Tsdb Client Mtls certificate into the macOS keychain:
keychain
.cert.pem
file.In a web browser, navigate to localhost:9090
. If your browser prompts you to specify which certificate to use for mTLS, select the certificate you added to your operating system keystore.
On macOS:
https://localhost:9090
.thisisunsafe
into the webpage.The Prometheus UI should display in your browser.
The Alertmanager UI allows you to view which alerts are currently running. Because the Alertmanager UI is not secure, the Healthwatch tile does not include it. However, you can access the Alertmanager UI to troubleshoot or silence alerts.
To access the Alertmanager UI:
Run:
bosh deployments
This command returns a list of all BOSH deployments that are currently running.
Record the name of your Healthwatch deployment.
Run:
bosh -d DEPLOYMENT-NAME ssh tsdb/0 --opts='-L 8080:localhost:10401'
Where DEPLOYMENT-NAME
is the name of your Healthwatch deployment that you recorded in the previous step.
In a web browser, navigate to localhost:8080
. The Alertmanager UI appears.
The sections below describe how to troubleshoot known issues in Healthwatch and Healthwatch Exporter for TKGI.
When installing or upgrading to Healthwatch v2.2, you see the following error:
- Unable to render templates for job ‘opsman-cert-expiration-exporter’. Errors are: - Error filling in template ‘bpm.yml.erb’ (line 9: Can’t find property ‘[“opsman_access_credentials.uaa_client_secret”]’)
This error occurs if you upgraded from Ops Manager v2.3 or earlier to Ops Manager v2.4 through v2.7. To resolve this issue:
SSH into the Ops Manager VM by following the procedure in the Ops Manager documentation.
Change the user to root
.
Open the Rails console by running:
> cd /home/tempest-web/tempest/web; RAILS_ENV='production' TEMPEST_INFRASTRUCTURE='DEPLOYMENT-IAAS' TEMPEST_WEB_DIR='/home/tempest-web' SECRET_KEY_BASE='1234' DATA_ROOT='/var/tempest' LOG_DIR='/var/log/opsmanager' su tempest-web --command 'bundle exec rails console'
Where DEPLOYMENT-IAAS
is either google
, aws
, azure
, vsphere
, or openstack
, depending on the IaaS of your Ops Manager deployment.
Set the decryption passphrase by running:
irb(main):001:0> EncryptionKey.instance.passphrase = 'DECRYPTION-PASSPHRASE'
Where DECRYPTION-PASSPHRASE
is the decryption passphrase you want to set.
Update the UAA restricted view access client secret by running:
irb(main):001:0> Uaa::UaaConfig.instance.update_attributes(restricted_view_api_access_client_secret: SecureRandom.hex)
Exit the Rails console and restart the tempest-web
service by running:
irb(main):001:0> exit
> service tempest-web restart
This issue is fixed in Ops Manager v2.8 and later.
When you deploy Healthwatch, the Smoke Tests errand fails with the following error message:
querying for grafana up should be greater than 0
The Smoke Tests errand fails because the Prometheus instance fails to scrape metrics from the Grafana instance. Potential causes of this failure include:
There is a network issue between the Prometheus instance and Grafana instance.
The Grafana instance uses a certificate that does not match the certificate authority (CA) you configured in the Grafana pane in the Healthwatch tile. This could occur because the CA you configured in the Grafana pane is either a self-signed certificate or a different CA from the one that generated the certificate. As a result, the Prometheus instance does not trust the certificate that the Grafana instance uses. For more information about configuring a CA for the Grafana instance, see (Optional) Configure Grafana in Configuring Healthwatch.
To find out why the Prometheus instance fails to scrape metrics from the Grafana instance:
Log in to one of the VMs in the Prometheus instance by following the procedure in the Ops Manager documentation.
View information about the Grafana instance scrape target by running:
curl -vk https://localhost:9090/api/v1/targets --cacert /var/vcap/jobs/prometheus/config/certs/prometheus_ca.pem --cert /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.pem --key /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.key | /var/vcap/packages/prometheus_backup_jq/bin/jq '.data.activeTargets[] | select(.scrapePool == "grafana")'
The lastError
field in the command output describes the reason for the Prometheus instance failing to scrape the Grafana instance.
When the TKGI metric exporter VM attempts to connect to the BOSH Director, you see the following error:
ERROR [context.UaaContext [ForkJoinPool-1-worker-3]] javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors ERROR [ingress.TokenCallCredentials [ForkJoinPool-1-worker-3]] Caught error retrieving UAA token: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors INFO [ingress.EventStreamObserver [ForkJoinPool-1-worker-3]] io.grpc.StatusRuntimeException: UNAUTHENTICATED
This error appears when the TKGI metric exporter VM cannot verify that the certificate chain of the UAA server for the BOSH Director is valid. To allow the TKGI metric exporter VM to connect to the BOSH Director, you must correct any certificate chain errors.
To check for certificate chain errors in the UAA server for the BOSH Director:
Log in to the TKGI metric exporter VM by following the procedure in the Ops Manager documentation.
View the certificate that the UAA server uses by running:
openssl s_client -connect 10.0.0.5:8443
Save the certificate as a cert.pem
file.
Run:
openssl verify cert.pem
If the command returns an OK
message, the certificate is trusted and has a valid certificate chain. If the command returns any other message, see the OpenSSL documentation to troubleshoot.
When you install both Healthwatch Exporter for TAS for VMs and Healthwatch Exporter for TKGI on the same Ops Manager foundation, the BOSH Director Status panel in the BOSH Director Health dashboard in the Grafana UI shows “Not Running”, and your BOSH Director deployment returns the following error:
Director responded with non-successful status code ‘401’ response ‘{“code”:600000,“description”:“Require one of the scopes: bosh.admin, bosh.750587e9-eae5-494f-99c4-5ca429b13959.admin, bosh.teams.p-healthwatch2-pas-exporter-b3a337d7ec4cca94f166.admin”}’
This occurs because both Healthwatch Exporter tiles deploy a BOSH health metric exporter VM, and both BOSH health metric exporter VMs are named bosh-health-exporter
. This causes the two sets of metrics to conflict with each other.
To address this, you must scale the BOSH health metric exporter VM down to zero instances in one of the Healthwatch Exporter tiles.
To scale the BOSH health metric exporter VM down to zero instances in one of the Healthwatch Exporter tiles:
Navigate to the Ops Manager Installation Dashboard.
Click the Healthwatch Exporter for Tanzu Kubernetes Grid - Integrated tile or Healthwatch Exporter for Tanzu Application Service tile.
Select Resource Config.
In the Bosh Health Exporter row, select 0 from the Instances dropdown.
Click Save.
Return to the Ops Manager Installation Dashboard.
Click Review Pending Changes.
Click Apply Changes.
If you run SLI tests for TKGI through Healthwatch Exporter for TKGI, and you do not have an OpenID Connect (OIDC) provider for your Kubernetes clusters configured for TKGI, the TKGI SLI exporter VM does not automatically clean up the service accounts that it creates while running the TKGI SLI test suite.
To fix this issue, either upgrade to Healthwatch v2.2.1 or configure an OIDC provider as the identity provider for your Kubernetes clusters in the TKGI tile. This cleans up the service accounts that the TKGI SLI exporter VM creates in future TKGI SLI tests, but does not clean up existing service accounts from previous TKGI SLI tests. For more information about configuring an OIDC provider in TKGI, see the TKGI documentation.
VMware recommends that you manually delete existing service accounts from previous TKGI SLI tests if running the tkgi get-credentials
command returns an error similar to the following example:
Error: Status: 500; ErrorMessage: nil; Description: Create Binding: Timed out waiting for secrets; ResponseError: nil
Manually deleting service accounts also deletes the secrets and Clusterrolebinding
s associated with those service accounts.
To manually delete a service account:
In a terminal window, run:
kubectl delete serviceaccount -n NAMESPACE SERVICE-ACCOUNT
Where:
NAMESPACE
is the namespace that contains the service account you want to delete.SERVICE-ACCOUNT
is the service account you want to delete.In Healthwatch v2.2.0, the backup scripts for Prometheus VMs do not clean up the intermediary snapshots created by BOSH Backup and Restore (BBR). This results in the disk space on Prometheus VMs filling up.
To fix this issue, either upgrade to Healthwatch v2.2.1 or manually clean up the snapshots. To manually clean up the snapshots:
Log in to the Prometheus VM you want to clean up by following the procedure in the Ops Manager documentation.
Run:
sudo -i
Empty the snapshots folder for the Prometheus VM by running:
rm -rf /var/vcap/store/prometheus/snapshots/*
Change into the snapshots folder by running:
cd /var/vcap/store/prometheus/snapshots
Verify that the /var/vcap/store/prometheus/snapshots
directory is empty by running:
ls
If you are using Pivotal Application Service (PAS) v2.7, you must activate the Enable system metrics checkbox in the System Logging pane of the PAS tile. Otherwise, PAS v2.7 does not forward system metrics to the Loggregator Firehose, so Healthwatch Exporter for TAS for VMs cannot collect metrics for the VMs that PAS v2.7 deploys. As a result, the Router dashboard in the Grafana UI shows no metrics.
To allow PAS v2.7 to forward system metrics to the Loggregator Firehose:
Navigate to the Ops Manager Installation Dashboard.
Click the Pivotal Application Service tile.
Select System Logging.
Activate the Enable system metrics checkbox.
Return to the Ops Manager Installation Dashboard.
Click Review Pending Changes.
Click Apply Changes.
The sections below describe how to troubleshoot missing TKGI cluster metrics in the Grafana UI.
To find out why the Prometheus instance fails to scrape metrics from your TKGI clusters, see Diagnose Prometheus Scrape Job Failure below.
Potential causes of this failure include:
You are using TKGI v1.10.0 or v1.10.1. For more information, see No Data on Kubernetes Nodes Dashboard for TKGI v1.10 below.
You are using TKGI v1.12. For more information, see No Data on Kubernetes Nodes Dashboard for TKGI v1.12 in Healthwatch v2.2 Release Notes.
You are using TKGI to monitor Windows clusters. For more information, see No Data on Kubernetes Nodes Dashboard for Windows Clusters in Healthwatch v2.2 Release Notes.
The Prometheus instance in the Healthwatch tile could not detect or create scrape jobs for the clusters, causing TKGI cluster discovery to fail. For more information, see Configure DNS for Your TKGI Cluster below.
When the Kubernetes Nodes dashboard in the Grafana UI does not show metrics data, the Prometheus instance in the Healthwatch tile has failed to scrape metrics from on-demand Kubernetes clusters created through the TKGI API.
To find out why the Prometheus instance fails to scrape metrics from your TKGI clusters:
Log in to one of the VMs in the Prometheus instance by following the procedure in the Ops Manager documentation.
View information about the Prometheus instance scrape targets by running:
curl -vk https://localhost:9090/api/v1/targets --cacert /var/vcap/jobs/prometheus/config/certs/prometheus_ca.pem --cert /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.pem --key /var/vcap/jobs/prometheus/config/certs/prometheus_certificate.key | /var/vcap/packages/prometheus_backup_jq/bin/jq .
Find the scrape jobs for your TKGI clusters. The lastError
field describes the reason for the Prometheus instance failing to scrape your TKGI clusters.
If you are using TKGI v1.10.0 or v1.10.1, the Kubernetes Nodes dashboard in the Grafana UI might not show data for individual pods. This is due to a known issue in Kubernetes v1.19.6 and earlier and Kubernetes v1.20.1 and earlier.
To fix this issue, upgrade to TKGI v1.10.2 or later. For more information about upgrading to TKGI v1.10.2 or later, see the TKGI documentation.
When TKGI cluster discovery fails, you see the following error:
2020-05-20 19:24:02 ERROR k8s.K8sClient [parallel-1] Failed to make request
java.net.UnknownHostException: CLUSTER-NAME.ENVIRONMENT-DOMAIN
Where:
CLUSTER-NAME
is the name of your TKGI cluster.ENVIRONMENT-DOMAIN
is the domain of your TKGI foundation.This occurs because the TKGI API cannot access your TKGI clusters from the Internet. To resolve this issue, you must configure a DNS entry for the control plane of each of your TKGI clusters in the console for your IaaS.
To configure DNS entries for the control planes of your TKGI clusters:
Find the IP addresses and hostnames of the control plane of each of your TKGI clusters. For more information, see the TKGI documentation.
Record the Kubernetes Master IP(s) and Kubernetes Master Host from the output you viewed in the previous step. For more information, see the TKGI documentation.
In a web browser, log in to the user console for your IaaS.
For each TKGI cluster, find the public IP address of the VM that has an internal IP address matching the Kubernetes Master IP(s) you recorded in a previous step. For more information, see the documentation for your IaaS:
For each TKGI cluster, create an A record in your DNS server that points to the public IP address of the control plane of the TKGI cluster that you recorded in the previous step. For more information, see the documentation for your IaaS:
Wait for your DNS server to update.
By default, the Grafana UI includes dashboards for Healthwatch Exporter tiles under the Healthwatch folder.
The Healthwatch - SLOs dashboard in the Grafana UI displays a row for each metric exporter VM you select from the corresponding metric exporter instance dropdown at the top of the page. Each row contains four panels:
Up: The current health of the Prometheus endpoint on the metric exporter VM. A value of 1
indicates that the Prometheus endpoint is healthy. A value of 0
or missing data indicates that either the Prometheus endpoint is unresponsive or the Prometheus instance failed to scrape the Prometheus endpoint. For more information, see the Prometheus documentation.
Exporter SLO: The percentage of time that the Healthwatch Exporter tile was up and running over the selected time period.
Error Budget Remaining: How many minutes are left in the error budget before exceeding the selected Uptime SLO Target over the selected time period.
Minutes of Downtime: How many minutes the Healthwatch Exporter tiles were down during the selected time period.
The Healthwatch - Exporter Troubleshooting dashboard in the Grafana UI displays metrics that allow you to monitor the performance of each Healthwatch Exporter for TAS for VMs tile installed on your Ops Manager foundations. You can use these metrics to troubleshoot when you see inconsistent graphs for a particular metric type, or if a Healthwatch Exporter for TAS for VMs tile is not behaving as expected.
These dashboards contain the following panels:
Exporter Info: A listing of the healthwatch_pasExporter_status
metric, showing runtime information for Healthwatch Exporter for TAS for VMs.
Exporter JVM Memory: A graph of the jvm_memory_bytes_used
, jvm_memory_bytes_commited
, and jvm_memory_bytes_init
metrics, showing the number of used, committed, and initial bytes in a given Java virtual machine (JVM) memory area over the selected time period. You can use this graph to check for memory leaks.
Ephemeral Disk Usage: A gauge of the system_disk_ephemeral_percent
metric, showing the percentage of the ephemeral disk used. You can use this gauge to determine whether the disk is reaching capacity.
Rate of Garbage Collection: A graph of the jvm_gc_collection_seconds_sum
metric, showing the rate of JVM garbage collection over the selected time period. You can use this graph to determine whether the JVM garbage collection is functional.
Rate of Envelope Ingress: A graph of the healthwatch_pasExporter_ingress_envelopes
metric, showing the rate of Loggregator envelope ingress over the selected time period. You can use this graph to check for spikes in the number of Loggregator envelopes that the metric exporter VMs receive.
CPU Usage: A graph of the cpu_usage_user
metric, showing the percentage of CPU used over the selected time period. You can use this graph to determine whether the amount of CPU used by Healthwatch Exporter for TAS for VMs is reaching capacity.
Exporter VM Threads: A graph of the jvm_threads_current
and jvm_threads_peak
metrics, showing the current and peak thread counts of a given JVM over the selected time period. You can use this graph to check whether Healthwatch Exporter for TAS for VMs is leaking threads.