Use this reference when configuring additional parameters of prometheus addon via the Custom Resources(CRs) tab.

Configurable parameters

Parameter

Description

Type

Default value

Note

prometheus.deployment.replicas

Number of Prometheus replicas.

integer

1

prometheus.deployment.containers.args

Prometheus container arguments. You can configure this parameter to change retention time. For information about configuring Prometheus storage parameters, see the Prometheus documentation. Note: Longer retention times require more storage capacity. It might be necessary to increase the persistent volume claim size if you are significantly increasing the retention time.

list

- --storage.tsdb.retention.time=42d

- --config.file=/etc/config/prometheus.yml

- --storage.tsdb.path=/data

- --web.console.libraries=/etc/prometheus/console_libraries2

- --web.console.templates=/etc/prometheus/consoles

- --web.enable-lifecycle

Prometheus will replace the whole arg list, make sure the customized arg list contains all of these args.

prometheus.deployment.containers.resources

Prometheus container resource requests and limits.

map

{}

prometheus.deployment.podAnnotations

The Prometheus deployments pod annotations.

map

{}

prometheus.deployment.podLabels

The Prometheus deployments pod labels.

map

{}

prometheus.deployment.configMapReload.containers.args

Configmap-reload container arguments.

list

prometheus.deployment.configMapReload.containers.resources

Configmap-reload container resource requests and limits.

map

{}

prometheus.service.type

Type of service to expose Prometheus. 

Enum["ClusterIP","NodePort","LoadBalancer"]

ClusterIP

Immutable

prometheus.service.port

Prometheus service port.

Integer

80

Immutable

prometheus.service.targetPort

Prometheus service target port.

Integer

9090

Immutable

prometheus.service.labels

Prometheus service labels.

map

{}

prometheus.service.annotations

Prometheus service annotations.

map

{}

prometheus.pvc.annotations

PVC annotations.

map

{}

prometheus.pvc.storageClassName

Storage class to use for persistent volume claim.  The default storage class is used if it is not set.

string

Immutable, formatted on UI

prometheus.pvc.accessMode

Define access mode for persistent volume claim.

Enum["ReadWriteOnce", "ReadOnlyMany", "ReadWriteMany"]

ReadWriteOnce

Immutable, formatted on UI

prometheus.pvc.storage

Define storage size for persistent volume claim.

string

150Gi

Immutable, formatted on UI

prometheus.config.prometheus_yml

For information about the global Prometheus configuration, see the Prometheus documentation.

YAML file

prometheus.yaml

prometheus.config.alerting_rules_yml

For information about the Prometheus alerting rules, see the Prometheus documentation.

YAML file

alerting_rules.yaml

prometheus.config.recording_rules_yml

For information about the Prometheus recording rules, see the Prometheus documentation.

YAML file

recording_rules.yaml

prometheus.config.alerts_yml

Additional prometheus alerting rules are configured here.

YAML file

alerts_yml.yaml

prometheus.config.rules_yml

Additional prometheus recording rules are configured here.

YAML file

rules_yml.yaml

alertmanager.deployment.replicas

Number of alertmanager replicas.

Integer

1

alertmanager.deployment.containers.resources

Alertmanager container resource requests and limits.

map

{}

alertmanager.deployment.podAnnotations

The Alertmanager deployments pod annotations.

map

{}

alertmanager.deployment.podLabels

The Alertmanager deployments pod labels.

map

{}

alertmanager.service.type

Type of service to expose Alertmanager.

Enum["ClusterIP"]

ClusterIP

Immutable

alertmanager.service.port

Alertmanager service port.

Integer

80

Immutable

alertmanager.service.targetPort

Alertmanager service target port.

Integer

9093

Immutable

alertmanager.service.labels

Alertmanager service labels.

map

{}

alertmanager.service.annotations

Alertmanager service annotations.

map

{}

alertmanager.pvc.annotations

Alertmanager PVC annotations.

map

{}

alertmanager.pvc.storageClassName

Storage class to use for persistent volume claim.  The default provisioner is used if it is not set.

string

Immutable

alertmanager.pvc.accessMode

Define access mode for persistent volume claim.

Enum["ReadWriteOnce", "ReadOnlyMany", "ReadWriteMany"]

ReadWriteOnce

Immutable

alertmanager.pvc.storage

Define storage size for persistent volume claim.

string

2Gi

Immutable

alertmanager.config.alertmanager_yml

For information about the global YAML configuration for Alert Manager, see the Prometheus documentation.

YAML file

alertmanager_yml

kube_state_metrics.deployment.replicas

Number of kube-state-metrics replicas.

integer

1

kube_state_metrics.deployment.containers.resources

kube-state-metrics container resource requests and limits.

map

{}

kube_state_metrics.deployment.podAnnotations

The kube-state-metrics deployments pod annotations.

map

{}

kube_state_metrics.deployment.podLabels

The kube-state-metrics deployments pod labels.

map

{}

kube_state_metrics.service.type

Type of service to expose kube-state-metrics

Enum["ClusterIP"]

ClusterIP

Immutable

kube_state_metrics.service.port

kube-state-metrics service port.

Integer

80

Immutable

kube_state_metrics.service.targetPort

kube-state-metrics service target port.

Integer

8080

Immutable

kube_state_metrics.service.telemetryPort

kube-state-metrics service telemetry port.

Integer

81

Immutable

kube_state_metrics.service.telemetryTargetPort

kube-state-metrics service target telemetry port.

Integer

8081

Immutable

kube_state_metrics.service.labels

kube-state-metrics service labels.

map

{}

kube_state_metrics.service.annotations

kube-state-metrics service annotations.

map

{}

node_exporter.daemonset.replicas

Number of node-exporter replicas.

Integer

1

node_exporter.daemonset.containers.resources

node-exporter container resource requests and limits.

map

{}

node_exporter.daemonset.hostNetwork

Host networking requested for this pod.

boolean

false

node_exporter.daemonset.podAnnotations

The node-exporter deployments pod annotations.

map

{}

node_exporter.daemonset.podLabels

The node-exporter deployments pod labels.

map

{}

node_exporter.service.type

Type of service to expose node-exporter

Enum["ClusterIP"]

ClusterIP

Immutable

node_exporter.service.port

node-exporter service port.

Integer

9100

Immutable

node_exporter.service.targetPort

node-exporter service target port.

Integer

9100

Immutable

node_exporter.service.labels

node-exporter service labels.

map

{}

node_exporter.service.annotations

node-exporter service annotations.

map

{}

pushgateway.deployment.replicas

Number of pushgateway replicas.

Integer

1

pushgateway.deployment.containers.resources

pushgateway container resource requests and limits.

map

{}

pushgateway.deployment.podAnnotations

The pushgateway deployments pod annotations.

map

{}

pushgateway.deployment.podLabels

The pushgateway deployments pod labels.

map

{}

pushgateway.service.type

Type of service to expose pushgateway

Enum["ClusterIP"]

ClusterIP

Immutable

pushgateway.service.port

pushgateway service port.

Integer

9091

Immutable

pushgateway.service.targetPort

pushgateway service target port.

Integer

9091

Immutable

pushgateway.service.labels

pushgateway service labels.

map

{}

pushgateway.service.annotations

pushgateway service annotations.

map

{}

cadvisor.daemonset.replicas

Number of cadvisor replicas.

Integer

1

cadvisor.daemonset.containers.resources

cadvisor container resource requests and limits.

map

{}

cadvisor.daemonset.podAnnotations

The cadvisor deployments pod annotations.

map

{}

cadvisor.daemonset.podLabels

The cadvisor deployments pod labels.

map

{}

ingress.enabled

Enable/disable ingress for prometheus and alertmanager.

boolean

false

Immutable, depends on cert-manager addon and contour ingress controller

ingress.virtual_host_fqdn

Hostname for accessing promethues and alertmanager.

string

prometheus.system.tanzu

Immutable

ingress.prometheus_prefix

Path prefix for prometheus.

string

/

Immutable

ingress.alertmanager_prefix

Path prefix for alertmanager.

string

/alertmanager/

Immutable

ingress.prometheusServicePort

Prometheus service port to proxy traffic to.

Integer

80

Immutable

ingress.alertmanagerServicePort

Alertmanager service port to proxy traffic to.

Integer

80

Immutable

ingress.tlsCertificate.tls.crt

Optional certificate for ingress if you want to use your own TLS certificate. A self signed certificate is generated by default.

string

Generated cert

tls.crt is a key and not nested.

ingress.tlsCertificate.tls.key

Optional certificate private key for ingress if you want to use your own TLS certificate.

string

Generated cert key

tls.key is a key and not nested.

Ingress.tlsCertificate.ca.crt

Optional CA certificate.

string

CA certificate

ca.crt is a key and not nested.

A sample prometheus addon CR is:

metadata:
  name: prometheus
spec:
  clusterRef:
    name: wc0
    namespace: wc0
  name: prometheus
  namespace: wc0
  config:
    stringData:
      values.yaml: |
        prometheus:
          deployment:
            replicas: 1
            containers:
              args:
                - --storage.tsdb.retention.time=5d
                - --config.file=/etc/config/prometheus.yml
                - --storage.tsdb.path=/data
                - --web.console.libraries=/etc/prometheus/console_libraries2
                - --web.console.templates=/etc/prometheus/consoles
                - --web.enable-lifecycle
          service:
            type: NodePort
            port: 80
            targetPort: 9090
          pvc:
            accessMode: ReadWriteOnce
            storage: 150Gi
          config:
            prometheus_yml: |
              global:
                evaluation_interval: 1m
                scrape_interval: 1m
                scrape_timeout: 10s
              rule_files:
              - /etc/config/alerting_rules.yml
              - /etc/config/recording_rules.yml
              - /etc/config/alerts
              - /etc/config/rules
              scrape_configs:
              - job_name: 'prometheus'
                scrape_interval: 5s
                static_configs:
                - targets: ['localhost:9090']
              - job_name: 'kube-state-metrics'
                static_configs:
                - targets: ['prometheus-kube-state-metrics.tanzu-system-monitoring.svc.cluster.local:8080']
              - job_name: 'node-exporter'
                static_configs:
                - targets: ['prometheus-node-exporter.tanzu-system-monitoring.svc.cluster.local:9100']
              - job_name: 'kubernetes-pods'
                kubernetes_sd_configs:
                - role: pod
                relabel_configs:
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                  action: keep
                  regex: true
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
                  action: replace
                  target_label: __metrics_path__
                  regex: (.+)
                - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
                  action: replace
                  regex: ([^:]+)(?::\d+)?;(\d+)
                  replacement: $1:$2
                  target_label: __address__
                - action: labelmap
                  regex: __meta_kubernetes_pod_label_(.+)
                - source_labels: [__meta_kubernetes_namespace]
                  action: replace
                  target_label: kubernetes_namespace
                - source_labels: [__meta_kubernetes_pod_name]
                  action: replace
                  target_label: kubernetes_pod_name
                - source_labels: [__meta_kubernetes_pod_node_name]
                  action: replace
                  target_label: node
              - job_name: kubernetes-nodes-cadvisor
                kubernetes_sd_configs:
                - role: node
                relabel_configs:
                - action: labelmap
                  regex: __meta_kubernetes_node_label_(.+)
                - replacement: kubernetes.default.svc:443
                  target_label: __address__
                - regex: (.+)
                  replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
                  source_labels:
                  - __meta_kubernetes_node_name
                  target_label: __metrics_path__
                scheme: https
                tls_config:
                  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  insecure_skip_verify: true
                bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              - job_name: kubernetes-apiservers
                kubernetes_sd_configs:
                - role: endpoints
                relabel_configs:
                - action: keep
                  regex: default;kubernetes;https
                  source_labels:
                  - __meta_kubernetes_namespace
                  - __meta_kubernetes_service_name
                  - __meta_kubernetes_endpoint_port_name
                scheme: https
                tls_config:
                  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  insecure_skip_verify: true
                bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              alerting:
                alertmanagers:
                - scheme: http
                  static_configs:
                  - targets:
                    - alertmanager.tanzu-system-monitoring.svc:80
                - kubernetes_sd_configs:
                    - role: pod
                  relabel_configs:
                  - source_labels: [__meta_kubernetes_namespace]
                    regex: default
                    action: keep
                  - source_labels: [__meta_kubernetes_pod_label_app]
                    regex: prometheus
                    action: keep
                  - source_labels: [__meta_kubernetes_pod_label_component]
                    regex: alertmanager
                    action: keep
                  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
                    regex: .*
                    action: keep
                  - source_labels: [__meta_kubernetes_pod_container_port_number]
                    regex:
                    action: drop
            alerting_rules_yml: |
              {}
            recording_rules_yml: |
              groups:
                - name: vmw-telco-namespace-cpu-rules
                  interval: 1m
                  rules:
                  - record: tkg_namespace_cpu_usage_seconds
                    expr: sum by (namespace) (rate (container_cpu_usage_seconds_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_cpu_throttled_seconds
                    expr: sum by (namespace) (((rate(container_cpu_cfs_throttled_seconds_total[5m])) ) > 0 or  kube_pod_info < bool 0)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_cpu_request_core
                    expr: sum by (namespace) (kube_pod_container_resource_requests_cpu_cores)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_cpu_limits_core
                    expr: sum by (namespace) (kube_pod_container_resource_limits_cpu_cores > 0.0 or kube_pod_info < bool 0.1)
                    labels:
                      job: kubernetes-nodes-cadvisor
                - name: vmw-telco-namespace-mem-rules
                  interval: 1m
                  rules:
                  - record: tkg_namespace_mem_usage_mb
                    expr: sum by (namespace) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_mem_rss_mb
                    expr: sum by (namespace) (container_memory_rss{container!~"POD",container!=""}) / (1024*1024)
                    labels:
                      job:  kubernetes-nodes-cadvisor
                  - record: tkg_namespace_mem_workingset_mb
                    expr: sum by (namespace) (container_memory_working_set_bytes{container!~"POD",container!=""}) / (1024*1024)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_mem_request_mb
                    expr: sum by (namespace) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_mem_limit_mb
                    expr: sum by (namespace) ((kube_pod_container_resource_limits_memory_bytes / (1024*1024) )> 0 or kube_pod_info < bool 0)
                    labels:
                      job: kubernetes-nodes-cadvisor
                - name: vmw-telco-namespace-network-rules
                  interval: 1m
                  rules:
                  - record: tkg_namespace_network_tx_bytes
                    expr: sum by (namespace) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_rx_bytes
                    expr: sum by (namespace) (rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_tx_packets
                    expr: sum by (namespace) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_rx_packets
                    expr: sum by (namespace) (rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job:  kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_tx_drop_packets
                    expr: sum by (namespace) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_rx_drop_packets
                    expr: sum by (namespace) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_tx_errors
                    expr: sum by (namespace) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_rx_errors
                    expr: sum by (namespace) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_total_bytes
                    expr: sum by (namespace) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_total_packets
                    expr: sum by (namespace) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_total_drop_packets
                    expr: sum by (namespace) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_network_total_errors
                    expr: sum by (namespace) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                - name: vmw-telco-namespace-storage-rules
                  interval: 1m
                  rules:
                  - record: tkg_namespace_storage_pvc_bound
                    expr: sum by (namespace) ((kube_persistentvolumeclaim_status_phase{phase="Bound"}) > 0 or kube_pod_info < bool 0)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_storage_pvc_count
                    expr: sum by (namespace) ((kube_pod_spec_volumes_persistentvolumeclaims_info)> 0 or kube_pod_info < bool 0)
                    labels:
                      job: kubernetes-nodes-cadvisor
                - name: vmw-telco-namespace-other-rules
                  interval: 1m
                  rules:
                  - record: tkg_namespace_pods_qty_count
                    expr: sum by (namespace) (kube_pod_info)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_pods_reboot_5m_count
                    expr: sum by (namespace) (changes(kube_pod_status_ready{condition="true"}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_namespace_pods_broken_count
                    expr: sum by (namespace) (kube_pod_status_ready{condition="false"})
                    labels:
                      job: kubernetes-nodes-cadvisor
                - name: vmw-telco-pod-cpu-rules
                  interval: 1m
                  rules:
                  - record: tkg_pod_cpu_usage_seconds
                    expr: sum by (pod) (rate (container_cpu_usage_seconds_total{container!~"POD",pod!="",image!=""}[5m])) * 100
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_cpu_request_core
                    expr: sum by (pod) (kube_pod_container_resource_requests_cpu_cores)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_cpu_limit_core
                    expr: sum by (pod) (kube_pod_container_resource_limits_cpu_cores > 0.0 or kube_pod_info < bool 0.1)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_cpu_throttled_seconds
                    expr: sum by (pod) (((rate(container_cpu_cfs_throttled_seconds_total[5m])) ) > 0 or  kube_pod_info < bool 0)
                    labels:
                      job: kubernetes-nodes-cadvisor
                - name: vmw-telco-pod-mem-rules
                  interval: 1m
                  rules:
                  - record: tkg_pod_mem_usage_mb
                    expr: sum by (pod) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_mem_rss_mb
                    expr: sum by (pod) (container_memory_rss{container!~"POD",container!=""}) / (1024*1024)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_mem_workingset_mb
                    expr: sum by (pod) (container_memory_working_set_bytes{container!~"POD",container!=""}) / (1024*1024)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_mem_request_mb
                    expr: sum by (pod) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024)
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_mem_limit_mb
                    expr: sum by (pod) ((kube_pod_container_resource_limits_memory_bytes / (1024*1024) )> 0 or kube_pod_info < bool 0)
                    labels:
                      job: kubernetes-nodes-cadvisor
                - name: vmw-telco-pod-network-rules
                  interval: 1m
                  rules:
                  - record: tkg_pod_network_tx_bytes
                    expr: sum by (pod) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_rx_bytes
                    expr: sum by (pod) (rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_tx_packets
                    expr: sum by (pod) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_rx_packets
                    expr: sum by (pod) (rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_tx_dropped_packets
                    expr: sum by (pod) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_rx_dropped_packets
                    expr: sum by (pod) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_tx_errors
                    expr: sum by (pod) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_rx_errors
                    expr: sum by (pod) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_total_bytes
                    expr: sum by (pod) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_total_packets
                    expr: sum by (pod) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_total_drop_packets
                    expr: sum by (pod) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_network_total_errors
                    expr: sum by (pod) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                - name: vmw-telco-pod-other-rules
                  interval: 1m
                  rules:
                  - record: tkg_pod_health_container_restarts_1hr_count
                    expr: sum by (pod) (increase(kube_pod_container_status_restarts_total[1h]))
                    labels:
                      job: kubernetes-nodes-cadvisor
                  - record: tkg_pod_health_unhealthy_count
                    expr: min_over_time(sum by (pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m])
                    labels:
                      job: kubernetes-nodes-cadvisor
                - name: vmw-telco-node-cpu-rules
                  interval: 1m
                  rules:
                  - record: tkg_node_cpu_capacity_core
                    expr: sum by (node) (kube_node_status_capacity_cpu_cores)
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_cpu_allocate_core
                    expr: sum by (node) (kube_node_status_allocatable_cpu_cores)
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_cpu_usage_seconds
                    expr: (label_replace(sum by (instance) (rate(container_cpu_usage_seconds_total[5m])), "node", "$1", "instance", "(.*)"))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_cpu_throttled_seconds
                    expr: sum by (instance) (rate(container_cpu_cfs_throttled_seconds_total[5m]))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_cpu_request_core
                    expr: sum by (node) (kube_pod_container_resource_requests_cpu_cores)
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_cpu_limits_core
                    expr: sum by (node) (kube_pod_container_resource_limits_cpu_cores)
                    labels:
                      job: kubernetes-service-endpoints
                - name: vmw-telco-node-mem-rules
                  interval: 1m
                  rules:
                  - record: tkg_node_mem_capacity_mb
                    expr: sum by (node) (kube_node_status_capacity_memory_bytes / (1024*1024))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_mem_allocate_mb
                    expr: sum by (node) (kube_node_status_allocatable_memory_bytes / (1024*1024))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_mem_request_mb
                    expr: sum by (node) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024)
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_mem_limits_mb
                    expr: sum by (node) (kube_pod_container_resource_limits_memory_bytes) / (1024*1024)
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_mem_available_mb
                    expr: sum by (node) ((node_memory_MemAvailable_bytes / (1024*1024) ))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_mem_free_mb
                    expr: sum by (node) ((node_memory_MemFree_bytes / (1024*1024) ))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_mem_usage_mb
                    expr: (label_replace(sum by (instance) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024), "node", "$1", "instance", "(.*)"))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_mem_free_pc
                    expr: sum ((node_memory_MemFree_bytes{job="kubernetes-pods"} / node_memory_MemTotal_bytes) *100) by (node)
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_oom_kill
                    expr: sum by(node) (node_vmstat_oom_kill)
                    labels:
                      job: kubernetes-service-endpoints
                - name: vmw-telco-node-network-rules
                  interval: 1m
                  rules:
                  - record: tkg_node_network_tx_bytes
                    expr: (label_replace(sum by (instance) (rate(container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_rx_bytes
                    expr: (label_replace(sum by (instance) (rate(container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_tx_packets
                    expr: (label_replace(sum by (instance) (rate(container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_rx_packets
                    expr: (label_replace(sum by (instance) (rate(container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_tx_dropped_packets
                    expr: (label_replace(sum by (instance) (rate(container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_rx_dropped_packets
                    expr: (label_replace(sum by (instance) (rate(container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_tx_errors
                    expr: (label_replace(sum by (instance) (rate(container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_rx_errors
                    expr: (label_replace(sum by (instance) (rate(container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_total_bytes
                    expr: label_replace((sum by (instance) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)")
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_total_packets
                    expr: label_replace((sum by (instance) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)")
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_total_drop_packets
                    expr: label_replace((sum by (instance) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)")
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_network_total_errors
                    expr: label_replace((sum by (instance) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)")
                    labels:
                      job: kubernetes-service-endpoints
                - name: vmw-telco-node-other-rules
                  interval: 1m
                  rules:
                  - record: tkg_node_status_mempressure_count
                    expr: sum by (node) (kube_node_status_condition{condition="MemoryPressure",status="true"})
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_status_diskpressure_count
                    expr: sum by (node) (kube_node_status_condition{condition="DiskPressure",status="true"})
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_status_pidpressure_count
                    expr: sum by (node) (kube_node_status_condition{condition="PIDPressure",status="true"})
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_status_networkunavailable_count
                    expr: sum by (node) (kube_node_status_condition{condition="NetworkUnavailable",status="true"})
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_status_etcdb_bytes
                    expr: (label_replace(etcd_db_total_size_in_bytes, "instance", "$1", "instance", "(.+):(\\d+)")) * on (instance) group_left (node) (avg by (instance, node) (label_replace ((kube_pod_info), "instance", "$1", "host_ip", "(.*)")) )
                    labels:
                      job: kubernetes-service-endpoints
                  - record: tkg_node_status_apiserver_request_total
                    expr: sum((label_replace(apiserver_request_total, "instance", "$1", "instance", "(.+):(\\d+)")) * on (instance) group_left (node) (avg by (instance, node) (label_replace ((kube_pod_info), "instance", "$1", "host_ip", "(.*)")) )) by (node)
                    labels:
                      job: kubernetes-service-endpoints
        ingress:
          enabled: false
          virtual_host_fqdn: prometheus.system.tanzu
          prometheus_prefix: /
          alertmanager_prefix: /alertmanager/
          prometheusServicePort: 80
          alertmanagerServicePort: 80
        alertmanager:
          deployment:
            replicas: 1
          service:
            type: ClusterIP
            port: 80
            targetPort: 9093
          pvc:
            accessMode: ReadWriteOnce
            storage: 2Gi
          config:
            alertmanager_yml: |
              global: {}
              receivers:
              - name: default-receiver
              templates:
              - '/etc/alertmanager/templates/*.tmpl'
              route:
                group_interval: 5m
                group_wait: 10s
                receiver: default-receiver
                repeat_interval: 3h
        kube_state_metrics:
          deployment:
            replicas: 1
          service:
            type: ClusterIP
            port: 80
            targetPort: 8080
            telemetryPort: 81
            telemetryTargetPort: 8081
        node_exporter:
          daemonset:
            hostNetwork: false
            updatestrategy: RollingUpdate
          service:
            type: ClusterIP
            port: 9100
            targetPort: 9100
        pushgateway:
          deployment:
            replicas: 1
          service:
            type: ClusterIP
            port: 9091
            targetPort: 9091
        cadvisor:
          daemonset:
            updatestrategy: RollingUpdate

In this sample CR:

  • The TSDB retention time in parameter prometheus.deployment.containers.args is changed to 5 days instead of default 42 days.

  • Some recording rules are added to prometheus.config.recording_rules_yml. Customize them or add more as needed.

  • The prometheus.service.type is changed to NodePort so that it can be integrated with external components(e.g. vROPS or Grafana). See Prometheus service type.

Prometheus service type

By default, Prometheus is deployed with a service type of ClusterIP, this means it is NOT exposable to the outside world.

There are three options available for the prometheus.service.type:

  • ClusterIP – use the default configuration then prometheus service only can be accessed in workload cluster. The service can also be exposed via ingress however this depends on ingress controller and some other munual configuraitons.

  • NodePort(recommended) – exposes the prometheus service on a nodeport. TCA does not support to specify the actual Nodeport, K8s will allocate a random nodeport number(a high-range port number between 30,000 and 32,767). To determine what this nodeport number is (post configuration), user must view the service configuration from the TCA cluster with command kubectl get svc -n tanzu-system-monitoring prometheus-server, as can be seen in the following output, the prometheus-server is exposed on node port 32020, then other external components can integrate with Prometheus with URL http://<cluster-endpoint-ip>:32020

    capv@cp0-control-plane-kz5k6 [ ~ ]$ kubectl get svc -n tanzu-system-monitoring prometheus-server
    NAME              TYPE     CLUSTER-IP   EXTERNAL-IP PORT(S)      AGE
    prometheus-server NodePort 100.65.8.127 <node>      80:32020/TCP 25s
  • Loadbalancer - leverages load balancer provider on Kubernetes to expose service. VMware recommends Avi load balancer which is deployed by load-balancer-and-ingress-service addon. Other load balancer provider can be used, but it will not be supported by VMware. TCA does not support to specify static VIP for prometheus service, a VIP from default VIP pool will be allocated for prometheus service, then other external components can integrate with Prometheus with URL http://<prometheus-VIP>.