Prometheus is a system and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true. Alertmanager handles alerts generated by Prometheus and routes them to their receiving endpoints. Deploy the TKG Extension for Prometheus to collect and view metrics for Tanzu Kubernetes clusters.

Extension Prerequisites

This topic describes how to deploy the TKG Extension v1.3.1 for Prometheus with Alertmanager for cluster monitoring. Adhere to the following requirements before deploying the extension. In addition to the general requirements, Prometheus monitoring requires a default persistent storage class. You can create a cluster with a default persistent storage class, or specify one in the Prometheus configuration file when deploying the extension. See Review Persistent Storage Requirements for TKG Extensions.

Deploy the Prometheus Extension

The TKG Extension for Prometheus installs several containers. For more information, see https://prometheus.io/.
Container Resource Type Replicas Description
prometheus-alertmanager Deployment 1 Handles alerts sent by client applications such as the Prometheus server.
prometheus-cadvisor DaemonSet 5 Analyzes and exposes resource usage and performance data from running containers
prometheus-kube-state-metrics Deployment 1 Monitors node status and capacity, replica-set compliance, pod, job, and cronjob status, resource requests and limits.
prometheus-node-exporter DaemonSet 5 Exporter for hardware and OS metrics exposed by kernels.
prometheus-pushgateway Deployment 1 Service that allows you to push metrics from jobs which cannot be scraped.
prometheus-server Deployment 1 Provides core functionality, including scraping, rule processing, and alerting.
The extension is configured to pull the containers from the VMware public registry at https://projects.registry.vmware.com/. If you are using a private registry, change the endpoint URL in the data values and extension configurations to match. See Configure the Prometheus Extension.
  1. Verify that you have completed each of the extension prerequisites. See Extension Prerequisites.
  2. Change directory to the Prometheus extension.
    cd /tkg-extensions-v1.3.1+vmware.1/extensions/monitoring/prometheus
  3. Create the tanzu-system-monitoring namespace and Prometheus service account and role objects.
    kubectl apply -f namespace-role.yaml
  4. Create a Prometheus data values file.

    The example data values file provides the minimum configuration.

    cp prometheus-data-values.yaml.example prometheus-data-values.yaml
  5. Configure the Prometheus extension by updating prometheus-data-values.yaml. See Configure the Prometheus Extension for a description of the fields and options.
    If the cluster is not provisioned with a default persistent storage class, you can specify it in the data values file. Also, make sure the namespace has sufficient storage for the persistent volume claims.
    monitoring:
      prometheus_server:
        image:
          repository: projects.registry.vmware.com/tkg/prometheus
        pvc:
          storage_class: vwt-storage-policy
          storage: "8Gi"
      alertmanager:
        image:
          repository: projects.registry.vmware.com/tkg/prometheus
        pvc:
          storage_class: vwt-storage-policy
          storage: "8Gi"
    ...      
    
  6. Create the Prometheus secret using the prometheus-data-values file.
    kubectl create secret generic prometheus-data-values --from-file=values.yaml=prometheus-data-values.yaml -n tanzu-system-monitoring
    

    The prometheus-data-values secret is created in the tanzu-system-monitoring namespace. Verify using kubectl get secrets -n tanzu-system-monitoring.

  7. Deploy the Prometheus extension.
    kubectl apply -f prometheus-extension.yaml

    On success the Prometheus app is created: app.kappctrl.k14s.io/prometheus created.

  8. Check the status of the Prometheus app.
    kubectl get app prometheus -n tanzu-system-monitoring
    The status should change from Reconciling to Reconcile succeeded. If the status is Reconcile failed, see Troubleshooting.
  9. View detailed information on the Prometheus app.
    kubectl get app prometheus -n tanzu-system-monitoring -o yaml
  10. Verify Prometheus DaemonSets.
    kubectl get daemonsets -n tanzu-system-monitoring
  11. Verify Prometheus Deployments.
    kubectl get deployments -n tanzu-system-monitoring

Troubleshoot Prometheus Deployment

If the deployment or reconciliation fails, run kubectl get pods -A to view the status of the pods. Under normal conditions you should see that the pods are Running. If the status is ImagePullBackOff or ImageCrashLoopBackOff, the container image could not be pulled from the registry. Check the URL in the data values and the extension YAML files and make sure they are accurate.

Check the container logs, where name-XXXX is the unique pod name when you run kubectl get pods -A:
kubectl logs pod/prometheus-alertmanager-XXXXX -c prometheus-alertmanager -n tanzu-system-monitoring
kubectl logs pod/prometheus-server-XXXXX -c prometheus-server -n tanzu-system-monitoring

Update the Prometheus Extension

Update the configuration for a Prometheus extension that is deployed to a Tanzu Kubernetes cluster.

  1. Get Prometheus data values from the secret.
    kubectl get secret prometheus-data-values -n tanzu-system-monitoring -o 'go-template={{ index .data "values.yaml" }}' | base64 -d > prometheus-data-values.yaml
    
  2. Update the Prometheus data values secret.
    kubectl create secret generic prometheus-data-values --from-file=values.yaml=prometheus-data-values.yaml -n tanzu-system-monitoring -o yaml --dry-run | kubectl replace -f-
    
    The Prometheus extension will be reconciled with the updated data values.
    Note: By default, kapp-controller will sync apps every 5 minutes. The update should take effect in 5 minutes or less. If you want the update to take effect immediately, change syncPeriod in prometheus-extension.yaml to a lesser value and apply the Fluent Bit extension using kubectl apply -f prometheus-extension.yaml.
  3. Check the status of the extension.
    kubectl get app prometheus -n tanzu-system-monitoring

    The status should change to Reconcile Succeeded once Prometheus is updated.

  4. View detailed status and troubleshoot.
    kubectl get app prometheus -n tanzu-system-monitoring -o yaml

Delete the Prometheus Extension

Delete the Prometheus extension from a Tanzu Kubernetes cluster.
Note: Complete the steps in order. Do not delete the namespace, service account, and role objects before the Prometheus app is fully deleted. Doing so can lead to system errors.
Caution: Both Prometheus and Grafana use the same namespace. Deleting the namespace is destructive for any extension deployed there. If Grafana is deployed, do not delete the namespace before deleting Grafana.
  1. Change directory to the Prometheus extension.
    cd /extensions/monitoring/prometheus/
  2. Delete the Prometheus app.
    kubectl delete app prometheus -n tanzu-system-monitoring

    Expected result: app.kappctrl.k14s.io "prometheus" deleted.

  3. Verify that the Prometheus app is deleted.
    kubectl get app prometheus -n tanzu-system-monitoring

    Expected result: apps.kappctrl.k14s.io "prometheus" not found.

  4. Delete the tanzu-system-monitoring namespace and the Prometheus service account and role objects.
    Warning: Do not perform this step if Grafana is deployed.
    kubectl delete -f namespace-role.yaml
  5. If you want to redeploy Prometheus, remove the secret prometheus-data-values.
    kubectl delete secret prometheus-data-values -n tanzu-system-monitoring

    Expected result: secret "prometheus-data-values" deleted.

Upgrade the Prometheus Extension

If you have an existing Prometheus extension deployed, you can upgrade it to the latest version.
  1. Export the Prometheus configmap and save it as backup.
    kubectl get configmap prometheus -n tanzu-system-monitoring -o 'go-template={{ index .data "prometheus.yaml" }}' > prometheus-configmap.yaml
    
  2. Delete the existing Prometheus deployment. See Delete the Prometheus Extension.
  3. Deploy the Prometheus extension. See Deploy the Prometheus Extension.

Configure the Prometheus Extension

The Prometheus configuration is set in /extensions/monitoring/prometheus/prometheus-data-values.yaml.
Table 1. Prometheus Configuration Parameters
Parameter Description Type Default
monitoring.namespace Namespace where Prometheus will be deployed string tanzu-system-monitoring
monitoring.create_namespace The flag indicates whether to create the namespace specified by monitoring.namespace boolean false
monitoring.prometheus_server.config.prometheus_yaml Kubernetes cluster monitor config details to be passed to Prometheus yaml file prometheus.yaml
monitoring.prometheus_server.config.alerting_rules_yaml Detailed alert rules defined in Prometheus yaml file alerting_rules.yaml
monitoring.prometheus_server.config.recording_rules_yaml Detailed record rules defined in Prometheus yaml file recording_rules.yaml
monitoring.prometheus_server.service.type Type of service to expose Prometheus. Supported Values: ClusterIP string ClusterIP
monitoring.prometheus_server.enable_alerts.kubernetes_api Enable SLO alerting for the Kubernetes API in Prometheus boolean true
monitoring.prometheus_server.sc.aws_type AWS type defined for storageclass on AWS string gp2
monitoring.prometheus_server.sc.aws_fsType AWS file system type defined for storageclass on AWS string ext4
monitoring.prometheus_server.sc.allowVolumeExpansion Define if volume expansion allowed for storageclass on AWS boolean true
monitoring.prometheus_server.pvc.annotations Storage class annotations map {}
monitoring.prometheus_server.pvc.storage_class Storage class to use for persistent volume claim. By default this is null and default provisioner is used string null
monitoring.prometheus_server.pvc.accessMode Define access mode for persistent volume claim. Supported values: ReadWriteOnce, ReadOnlyMany, ReadWriteMany string ReadWriteOnce
monitoring.prometheus_server.pvc.storage Define storage size for persistent volume claim string 8Gi
monitoring.prometheus_server.deployment.replicas Number of prometheus replicas integer 1
monitoring.prometheus_server.image.repository Location of the repository with the Prometheus image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). string projects.registry.vmware.com/tkg/prometheus
monitoring.prometheus_server.image.name Name of Prometheus image string prometheus
monitoring.prometheus_server.image.tag Prometheus image tag. This value may need to be updated if you are upgrading the version. string v2.17.1_vmware.1
monitoring.prometheus_server.image.pullPolicy Prometheus image pull policy string IfNotPresent
monitoring.alertmanager.config.slack_demo Slack notification configuration for Alertmanager string
slack_demo:
  name: slack_demo
  slack_configs:
  - api_url: https://hooks.slack.com
    channel: '#alertmanager-test'
monitoring.alertmanager.config.email_receiver Email notification configuration for Alertmanager string
email_receiver:
  name: email-receiver
  email_configs:
  - to: demo@tanzu.com
    send_resolved: false
    from: from-email@tanzu.com
    smarthost: smtp.eample.com:25
    require_tls: false
monitoring.alertmanager.service.type Type of service to expose Alertmanager. Supported Values: ClusterIP string ClusterIP
monitoring.alertmanager.image.repository Location of the repository with the Alertmanager image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). string projects.registry.vmware.com/tkg/prometheus
monitoring.alertmanager.image.name Name of Alertmanager image string alertmanager
monitoring.alertmanager.image.tag Alertmanager image tag. This value may need to be updated if you are upgrading the version. string v0.20.0_vmware.1
monitoring.alertmanager.image.pullPolicy Alertmanager image pull policy string IfNotPresent
monitoring.alertmanager.pvc.annotations Storage class annotations map {}
monitoring.alertmanager.pvc.storage_class Storage class to use for persistent volume claim. By default this is null and default provisioner is used. string null
monitoring.alertmanager.pvc.accessMode Define access mode for persistent volume claim. Supported values: ReadWriteOnce, ReadOnlyMany, ReadWriteMany string ReadWriteOnce
monitoring.alertmanager.pvc.storage Define storage size for persistent volume claim string 2Gi
monitoring.alertmanager.deployment.replicas Number of alertmanager replicas integer 1
monitoring.kube_state_metrics.image.repository Repository containing kube-state-metircs image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). string projects.registry.vmware.com/tkg/prometheus
monitoring.kube_state_metrics.image.name Name of kube-state-metircs image string kube-state-metrics
monitoring.kube_state_metrics.image.tag kube-state-metircs image tag. This value may need to be updated if you are upgrading the version. string v1.9.5_vmware.1
monitoring.kube_state_metrics.image.pullPolicy kube-state-metircs image pull policy string IfNotPresent
monitoring.kube_state_metrics.deployment.replicas Number of kube-state-metrics replicas integer 1
monitoring.node_exporter.image.repository Repository containing node-exporter image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). string projects.registry.vmware.com/tkg/prometheus
monitoring.node_exporter.image.name Name of node-exporter image string node-exporter
monitoring.node_exporter.image.tag node-exporter image tag. This value may need to be updated if you are upgrading the version. string v0.18.1_vmware.1
monitoring.node_exporter.image.pullPolicy node-exporter image pull policy string IfNotPresent
monitoring.node_exporter.hostNetwork If set to hostNetwork: true, the pod can use the network namespace and network resources of the node. boolean false
monitoring.node_exporter.deployment.replicas Number of node-exporter replicas integer 1
monitoring.pushgateway.image.repository Repository containing pushgateway image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). string projects.registry.vmware.com/tkg/prometheus
monitoring.pushgateway.image.name Name of pushgateway image string pushgateway
monitoring.pushgateway.image.tag pushgateway image tag. This value may need to be updated if you are upgrading the version. string v1.2.0_vmware.1
monitoring.pushgateway.image.pullPolicy pushgateway image pull policy string IfNotPresent
monitoring.pushgateway.deployment.replicas Number of pushgateway replicas integer 1
monitoring.cadvisor.image.repository Repository containing cadvisor image. The default is the public VMware registry. Change this value if you are using a private repository (e.g., air-gapped environment). string projects.registry.vmware.com/tkg/prometheus
monitoring.cadvisor.image.name Name of cadvisor image string cadvisor
monitoring.cadvisor.image.tag cadvisor image tag. This value may need to be updated if you are upgrading the version. string v0.36.0_vmware.1
monitoring.cadvisor.image.pullPolicy cadvisor image pull policy string IfNotPresent
monitoring.cadvisor.deployment.replicas Number of cadvisor replicas integer 1
monitoring.ingress.enabled Enable/disable ingress for prometheus and alertmanager boolean

false

To use ingress, set this field to true and deploy Contour. To access Prometheus, update your local /etc/hosts with an entry that maps prometheus.system.tanzu to a worker node IP address.

monitoring.ingress.virtual_host_fqdn Hostname for accessing Prometheus and Alertmanager string prometheus.system.tanzu
monitoring.ingress.prometheus_prefix Path prefix for prometheus string /
monitoring.ingress.alertmanager_prefix Path prefix for alertmanager string /alertmanager/
monitoring.ingress.tlsCertificate.tls.crt Optional cert for ingress if you want to use your own TLS cert. A self signed cert is generated by default string Generated cert
monitoring.ingress.tlsCertificate.tls.key Optional cert private key for ingress if you want to use your own TLS cert. string Generated cert key
Table 2. Configurable Fields for Prometheus_Server Configmap
Parameter Description Type Default
evaluation_interval frequency to evaluate rules duration 1m
scrape_interval frequency to scrape targets duration 1m
scrape_timeout How long until a scrape request times out duration 10s
rule_files Rule files specifies a list of globs. Rules and alerts are read from all matching files yaml file
scrape_configs A list of scrape configurations. list
job_name The job name assigned to scraped metrics by default string
kubernetes_sd_configs List of Kubernetes service discovery configurations. list
relabel_configs List of target relabel configurations. list
action Action to perform based on regex matching. string
regex Regular expression against which the extracted value is matched. string
source_labels The source labels select values from existing labels. string
scheme Configures the protocol scheme used for requests. string
tls_config Configures the scrape request's TLS settings. string
ca_file CA certificate to validate API server certificate with. filename
insecure_skip_verify Disable validation of the server certificate. boolean
bearer_token_file Optional bearer token file authentication information. filename
replacement Replacement value against which a regex replace is performed if the regular expression matches. string
target_label Label to which the resulting value is written in a replace action. string
Table 3. Configurable Fields for Alertmanager Configmap
Parameter Description Type Default
resolve_timeout ResolveTimeout is the default value used by alertmanager if the alert does not include EndsAt duration 5m
smtp_smarthost The SMTP host through which emails are sent. duration 1m
slack_api_url The Slack webhook URL. string global.slack_api_url
pagerduty_url The pagerduty URL to send API requests to. string global.pagerduty_url
templates Files from which custom notification template definitions are read file path
group_by group the alerts by label string
group_interval set time to wait before sending a notification about new alerts that are added to a group duration 5m
group_wait How long to initially wait to send a notification for a group of alerts duration 30s
repeat_interval How long to wait before sending a notification again if it has already been sent successfully for an alert duration 4h
receivers A list of notification receivers. list
severity Severity of the incident. string
channel The channel or user to send notifications to. string
html The HTML body of the email notification. string
text The text body of the email notification. string
send_resolved Whether or not to notify about resolved alerts. filename
email_configs Configurations for email integration boolean
Annotations on pods allow a fine control of the scraping process. These annotations must be part of the pod metadata. They will have no effect if set on other objects such as Services or DaemonSets.
Table 4. Prometheus Pod Annotations
Pod Annotation Description
prometheus.io/scrape The default configuration will scrape all pods and, if set to false, this annotation will exclude the pod from the scraping process.
prometheus.io/path If the metrics path is not /metrics, define it with this annotation.
prometheus.io/port Scrape the pod on the indicated port instead of the pod’s declared ports (default is a port-free target if none are declared).
The DaemonSet manifest below will instruct Prometheus to scrape all of its pods on port 9102.
apiVersion: apps/v1beta2 # for versions before 1.8.0 use extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  namespace: weave
  labels:
    app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd-elasticsearch
  template:
    metadata:
      labels:
        name: fluentd-elasticsearch
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '9102'
    spec:
      containers:
      - name: fluentd-elasticsearch
        image: gcr.io/google-containers/fluentd-elasticsearch:1.20