Configuring TKGI Cluster Discovery

This topic describes how to configure VMware Tanzu^® Kubernetes Grid™ Integrated Edition (TKGI) cluster discovery in Healthwatch™ for VMware Tanzu^® (Healthwatch).

Overview of TKGI Cluster Discovery

In the TKGI Cluster Discovery pane of the Healthwatch tile, you configure the Prometheus instance in the Healthwatch tile to detect on-demand Kubernetes clusters created through the TKGI API and create scrape jobs for them. You only need to configure this pane if you have Ops Manager foundations with TKGI installed.

The Prometheus instance detects and scrapes TKGI clusters by connecting to the Kubernetes API through the TKGI API using a UAA client. To allow this, you must configure the Healthwatch tile, the Prometheus instance in the Healthwatch tile, the UAA client that the Prometheus instance uses to connect to the TKGI API, and the TKGI tile.

To configure TKGI cluster discovery:

Configure the TKGI Cluster Discovery pane in the Healthwatch tile. For more information, see Configure TKGI Cluster Discovery in Healthwatch below.
Configure TKGI to allow the Prometheus instance to scrape metrics from TKGI clusters. For more information, see Configure TKGI below.

If TKGI cluster discovery fails after you have completed both parts of the procedure in this topic, see Troubleshooting TKGI Cluster Discovery Failure below.

Note: To collect additional BOSH system metrics related to TKGI and view them in the Grafana UI, you must install and configure the Healthwatch Exporter for TKGI on your Ops Manager foundations with TKGI installed. To install the Healthwatch Exporter for TKGI tile, see Installing a Tile Manually. To configure the Healthwatch Exporter for TKGI tile, see Configuring Healthwatch Exporter for TKGI.

Configure TKGI Cluster Discovery in Healthwatch

In the TKGI Cluster Discovery pane of the Healthwatch tile, you configure TKGI cluster discovery, including the UAA client that the Prometheus instance uses to connect to the Kubernetes API through the TKGI API.

To configure the TKGI Cluster Discovery pane:

Navigate to the Ops Manager Installation Dashboard.
Click the Healthwatch tile.
Select TKGI Cluster Discovery.
Under TKGI cluster discovery, select one of the following options:
- On: This option allows TKGI cluster discovery and reveals the configuration fields described in the steps below. TKGI cluster discovery is allowed by default when TKGI is installed on your Ops Manager foundation.
- Off: This option disallows TKGI cluster discovery.
For Discovery interval, enter in seconds how frequently you want the Prometheus instance detects and scrapes TKGI clusters. The minimum value is 60.
(Optional) To allow the Prometheus instance to communicate with the TKGI API over TLS, configure one of the following options:
- To configure the Prometheus instance to use a self-signed CA certificate or a certificate that is signed by a self-signed CA certificate when communicating with the TKGI API over TLS, provide the certificate for the CA in CA certificate for TLS. If you provide a self-signed CA certificate, it must be for the same CA that signs the certificate in the TKGI API. If the Prometheus instance uses certificates signed by a trusted third-party CA or the Skip TLS certificate verification checkbox is activated, do not configure this field.
- If you do not provide a self-signed CA certificate or a certificate that is signed by a self-signed CA certificate, you can activate the Skip TLS certificate verification checkbox. When this checkbox is activated, the Prometheus instance does not verify the identity of the TKGI API. This checkbox is deactivated by default. VMware does not recommend skipping TLS certificate verification in a production environment.
Click Save.

Configure TKGI

After you configure TKGI cluster discovery in the Healthwatch tile, you must configure TKGI to allow the Prometheus instance to scrape metrics from TKGI clusters.

To configure TKGI:

Return to the Ops Manager Installation Dashboard.
Click the Tanzu Kubernetes Grid Integrated Edition tile.
Select Host Monitoring.
Under Enable Telegraf Outputs?, select Yes.
Activate the Include etcd metrics checkbox to allow TKGI to send etcd server and debugging metrics to Healthwatch.
Activate the Include Kubernetes Controller Manager metrics checkbox to allow TKGI to send Kubernetes Controller Manager metrics to Healthwatch.
If you are using TKGI v1.14.2 or later, activate the Include Kubernetes Scheduler metrics checkbox to allow TKGI to send Kubernetes Scheduler metrics to Healthwatch.
For Setup Telegraf Outputs, provide the following TOML configuration file:
```
[[outputs.prometheus_client]]
  listen = ":10200"
  metric_version = 2
```
You must use 10200 as the listening port to allow the Prometheus instance to scrape Telegraf metrics from your TKGI clusters. For more information about creating a configuration file in TKGI, see the TKGI documentation.
Note: If you are configuring TKGI v1.12 or earlier, remove metric_version = 2 from the TOML configuration file.
Click Save.

For each plan you want to monitor:

Select the plan you want to monitor. For example, Plan 2.

For (Optional) Add-ons - Use with caution, enter the following YAML snippet to create the roles required to allow the Prometheus instance to scrape metrics from your TKGI clusters:

---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: healthwatch
    rules:
    - resources:
        - pods/proxy
        - pods
        - nodes
        - nodes/proxy
        - namespace/pods
        - endpoints
        - services
      verbs:
        - get
        - watch
        - list
      apiGroups:
        - ""
    - nonResourceURLs: ["/metrics"]
      verbs: ["get"]
---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: healthwatch
    roleRef:
      apiGroup: ""
      kind: ClusterRole
      name: healthwatch
    subjects:
    - apiGroup: ""
      kind: User
      name: healthwatch

If (Optional) Add-ons - Use with caution already contains other API resource definitions, append the above YAML snippet to the end of the existing resource definitions, followed by a newline character.

Click Save.
Select Errands.
Ensure that the Upgrade all clusters errand is running. Running this errand configures your TKGI clusters with the roles you created in the (Optional) Add-ons - Use with caution field of the plans you monitor in a previous step.
Click Save.

Troubleshooting TKGI Cluster Discovery Failure

TKGI cluster discovery can fail if the Prometheus instance fails to scrape metrics from your TKGI clusters. To troubleshoot TKGI cluster discovery failure, see Troubleshooting Missing TKGI Cluster Metrics in Troubleshooting Healthwatch.