Getting Started with Kubernetes in Tanzu CloudHealth

Overview

Container technology provides organizations several benefits:

Containers enable lightweight packaging of business logic that is portable across multiple platforms.
While a small team manages the underlying cluster, multiple software teams can use the cluster as a shared resource, thereby increasing efficiencies. Teams are able to deliver new features faster and with less friction.
Launching a new container takes milliseconds, allowing containers to be scheduled based on need.

However, as organizations use container technology, two significant governance challenges emerge.

Rightsizing: Multiple teams deploy new software on the clusters. As these teams scale, it becomes harder to manage the provisioning of the right hardware configurations (CPU and memory) for software workloads that change over time.
Resource Allocation: The rapidly changing workload requirements make it difficult to track how the shared resources are being consumed by various teams over time. As a result, organizations face challenges in attributing usage and cost to specific teams and departments, making it harder to track budgets and define governance.

Tanzu CloudHealth provides long-term, trended visibility into container resource utilization by service and team. The module helps you discover which services are consuming the most resources and identify opportunities for optimization.

Using the module, you can:

understand cluster resource utilization by services and teams
evaluate if you have the right mix of resources in your clusters to support your workloads
identify areas of waste or underutilization to reduce spend

Getting Started with Kubernetes in Tanzu CloudHealth

Deploy a lightweight container called a Collector in each cluster in your environment so that the collector can gather metadata from your container environment

If you are using Amazon ECS as your orchestration solution, see Getting Started with Amazon ECS in Tanzu CloudHealth.

There are two ways to get started with Kubernetes in Tanzu CloudHealth: 1. Set up the Helm Chart to automatically deploy the Tanzu CloudHealth Collector in each cluster in your environment. 2. Deploy the Tanzu CloudHealth Collector to each cluster using the deployment file.

Option 1 - Set up the Helm Chart to Automatically Deploy the Tanzu CloudHealth Collector

Use the helm chart to deploy a lightweight container called the Tanzu CloudHealth Collector agent into each Kubernetes cluster in your environment. The Collector gathers metadata from your environment to generate reports.

What Data Does the Collector Gather

Tanzu CloudHealth gathers two categories of data through the Collector:

Node-level hardware resources available in terms of Memory and CPU.
Workloads running in the cluster and their resource allocation, measured in terms of Memory and CPU.

Prerequisites

Kubernetes version 1.12 or later
Administrator privileges for deploying the Tanzu CloudHealth Collector in your cluster.
Helm 3.0+

Installing the Helm Chart

Locate your Tanzu CloudHealth API token: Navigate to Cluster > Add Cluster > Kubernetes (via Helm).
The Tanzu CloudHealth API token is listed on the line with $ export CHT_API_TOKEN=.

Install the Helm chart with the release name cloudhealth-collector:

$ helm repo add cloudhealth https://cloudhealth.github.io/helm/
$ helm install cloudhealth-collector --set apiToken=<CloudHealth API Token>,clusterName=<Cluster Name> cloudhealth/cloudhealth-collector

These commands deploy the Tanzu CloudHealth Collector on the Kubernetes cluster in the default configuration. To view the parameters that can be configured during installation, visit the Helm Chart GitHub page.

Results: The Helm Chart is installed and deploys the Tanzu CloudHealth Collector to the new cluster configured in your environment.

Option 2 - Deploy the Tanzu CloudHealth Collector to each cluster using the deployment file

Configure a collector for each Kubernetes cluster. Next, deploy the collectors so that Tanzu CloudHealth can start gathering metrics on your container environment.

In the Tanzu CloudHealth Platform, from the left menu, select Setup > Containers > Clusters > Add Cluster.
In the dialog box that appears, enter a friendly cluster Name. Tanzu CloudHealth uses this name to customize the installation instructions. The name also appears in Tanzu CloudHealth interactive reports.
Click Add Cluster.
Follow the instructions that appear in the pop-up to deploy the Collector in your cluster. The instructions contain cluster-specific details.

There should be no equal sign between the variables and the value. The variables should also be referenced as $VARIABLENAME instead of %VARIABLENAME%.

You can return to the Collector deploy instructions at any time. The cluster created will be displayed on the page at Setup > Containers > Clusters.

Results: The Tanzu CloudHealth Collector is deployed to the cluster.

Collector Status

The collector starts collecting metrics from the cluster as soon as it is deployed, but it does not backfill historical information. The status of the cluster changes to Healthy after Tanzu CloudHealth starts receiving data from the collector.

It can take up to 24 hours for meaningful visualizations to appear in the Tanzu CloudHealth platform after the collector has been deployed.

On the Setup > Containers > Clusters page, clusters can have one of three statuses:

Unknown: The Collector has either never sent data to Tanzu CloudHealth, or has not sent data to Tanzu CloudHealth in more than two days.
Healthy: The Collector is successfully deployed on the Kubernetes container cluster and has successfully pushed data to Tanzu CloudHealth in the last 15 minutes.
Unhealthy: The Collector has contacted Tanzu CloudHealth within the last two days, but not in the last 15 minutes.

You can confirm that the Collector is collecting metrics through the Metrics column:

Active:
Inactive:

(Optional) Group and Distribute Cluster Costs

Use the Tanzu CloudHealth platform to organize your container assets using Perspectives. The goal of this organization is to map specific container tasks to the container assets where those tasks are run.

Gather the hardware supporting your clusters into one or more Perspective Groups. Include dedicated compute and storage resources as well as shared resources.
Gather containerized workloads into meaningful Perspective Groups. It is not necessary for this Group to belong to the same Perspective that contains the previous Group of hardware resources.
Define one or more container cost distribution rules to allocate the cost of the hardware Groups to workload Groups. Define multiple rules for more fine-grained control over the allocation.

See Configure Container Infrastructure for Cost Analysisfor more details about grouping and distributing cluster costs.

Troubleshoot Kubernetes Configuration Issues

Verify That Collectors are Working Correctly

Verify Configuration

kubectl get --namespace cloudhealth pods

Inspect Collector Logs

Identify the name of the pod.

kubectl get --namespace cloudhealth pods

Append the name of the pod to the following command to get logs for that pod. Kubernetes generates the pod name at runtime.
```
kubectl logs --namespace cloudhealth <pod-name>
```

Verify Metrics Collection

Ensure that the Collector is on the latest version and is able to collect metrics. Confirm that the Metrics column is Healthy from the Containers Cluster page.

If the Metrics column displays an Unhealthy status, update the Collector manually or using Helm:

Helm: $ helm upgrade cloudhealth-collector cloudhealth/cloudhealth-collector
Manual update: Delete and reinstall the existing Collector. For more information, select a cluster > Collector Deployment Instructions and follow the instructions in Updating the Collector Manually.

Troubleshoot the Kubernetes collector agent

Inspect the Collector Agent Logs

Identify the name of the pod.

kubectl get --namespace cloudhealth pods

Append the name of the pod to the following command to get logs for that pod. Kubernetes generates the pod name at runtime.
```
kubectl logs --namespace cloudhealth <pod-name>
```

Example Output:


  CHT Containers Collector Environment

  CHT_API_TOKEN: ****

  CHT_CLUSTER_NAME: testCluster

  JAVA_OPTS:
    -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap
    -XX:MaxRAMFraction=1 -XX:+ExitOnOutOfMemoryError -Xms10M -Xmx891M

  CHT_INTERVAL: 900

=========================================================================

CHT Containers Collector : version DIRTY starting
I, [2021-02-01T23:19:39.211985 #11]  INFO -- : loaded K8S config from  with master @ https://kubernetes.default.svc/ with ca certificate /var/run/secrets/kubernetes.io/serviceaccount/ca.crt with client_cert_file  with client key file  with trust_certs false with trust store file  with proxy username
D, [2021-02-01T23:19:39.732649 #11] DEBUG -- : Ensuring cache directory is present: /tmp/cache
D, [2021-02-01T23:19:39.793997 #11] DEBUG -- : Fetching state...
D, [2021-02-01T23:19:39.798196 #11] DEBUG -- : Connecting to URL: https://kubernetes.default.svc/api/v1/nodes
D, [2021-02-01T23:19:40.588497 #11] DEBUG -- : Connecting to URL: https://kubernetes.default.svc/api/v1/pods
D, [2021-02-01T23:19:40.610918 #11] DEBUG -- : Connecting to URL: https://kubernetes.default.svc/api/v1/services
D, [2021-02-01T23:19:40.618938 #11] DEBUG -- : Posting state...
D, [2021-02-01T23:19:40.622422 #11] DEBUG -- : Posting state from 2021-02-01 23:19:39 +0000: /tmp/cache/kubernetes_nodes_1612221579 (size: 1685)
E, [2021-02-01T23:19:40.703980 #11] ERROR -- : Not Found [404]: Failed to post cluster state to http://10.108.1.248:9292/v1/containers/kubernetes/state?auth_token=API_TOKEN_REDACTED&cluster_id=testCluster&sample_time=1612221579.0. Error: Could not find: http://127.0.0.1:8500/v1/kv/customer_container/blobs/

From the example logs above you can derive the following:

The state from the Kubernetes API against three specific endpoints are fetched to collect data on nodes, pods, and services. This data is stored in three separate cache files located at /tmp/cache.
In the next step, three requests are made back to the Tanzu CloudHealth collection endpoint to post the nodes, pods, and services from those cache files. In the first posting state, it attempts to send the node data from the cache file /tmp/cache/kubernetes_nodes_1612221579 with a size as shown in (size: 1685). If this value is 0, then there is no data from Kubernetes and you need to investigate issues with the cluster.
The last error line indicates an issue on making a request to the Tanzu CloudHealth collection endpoint.

Validate Kubernetes Cluster

To validate the Kubernetes cluster, run the following commands:

Verify if the pods are running: kubectl get pods --all-namespaces -o wide | grep cloudhealth
Manually request Kubernetes API to ensure it is reachable. For instance, https://kubernetes.default.svc/api/v1/nodes.

Validate Collector Agent Connectivity

To validate collector agent connectivity to our own collection endpoint, run the following commands:

Use nmap or netcat to ping port 443
- nmap -p 443 api.cloudhealthtech.com
- nc -zv api.cloudhealthtech.com 443
Run CURL commands against the collection endpoint manually:
- curl -v -X GET https://containers-api.edge.cloudhealthtech.com/api/v1/health to request the collection health endpoint.
  The expected response: {"status":"healthy","time":"Fri, 29 Jan 2021 22:48:10 GMT"}
- curl --header "Content-Type: application/json" --request POST https://containers-api.edge.cloudhealthtech.com/v1/containers/kubernetes/state?cluster_id=INSERT_CLUSTER_ID_HERE&auth_token=INSERT_AUTHENTICATION_TOKEN_HERE to mock the request made by the collector agent (except without any k8s data cache payload).
  Replace the auth_token and the cluster_id as necessary.
  The expected response (since we sent no payload): {"messages": "Required request body is missing"}.

How to Avoid Common Problems

Ensure that the cluster original address:

Can be resolved (if it is a DNS name) within the cluster.
Is accessible from within the cluster by reviewing firewalls and other networking restrictions.

Ensure that the Collector:

Is scheduled successfully and that there are sufficient resources in the cluster.
Is not stopped due to an OOM error. If the Collector is stopped, increase the memory limit on the container.
Is running the latest version.