Use this reference when configuring additional parameters of prometheus addon via the Custom Resources(CRs) tab.

Configurable parameters

Parameter

Description

Type

Default value

Note

prometheus.deployment.replicas

Number of Prometheus replicas.

integer

1

prometheus.deployment.containers.args

Prometheus container arguments. You can configure this parameter to change retention time. For information about configuring Prometheus storage parameters, see the Prometheus documentation. Note: Longer retention times require more storage capacity. It might be necessary to increase the persistent volume claim size if you are significantly increasing the retention time.

list

- --storage.tsdb.retention.time=42d

- --config.file=/etc/config/prometheus.yml

- --storage.tsdb.path=/data

- --web.console.libraries=/etc/prometheus/console_libraries2

- --web.console.templates=/etc/prometheus/consoles

- --web.enable-lifecycle

Prometheus will replace the whole arg list, make sure the customized arg list contains all of these args.

prometheus.deployment.containers.resources

Prometheus container resource requests and limits.

map

{}

prometheus.deployment.podAnnotations

The Prometheus deployments pod annotations.

map

{}

prometheus.deployment.podLabels

The Prometheus deployments pod labels.

map

{}

prometheus.deployment.configMapReload.containers.args

Configmap-reload container arguments.

list

prometheus.deployment.configMapReload.containers.resources

Configmap-reload container resource requests and limits.

map

{}

prometheus.service.type

Type of service to expose Prometheus. 

Enum["ClusterIP","NodePort","LoadBalancer"]

ClusterIP

Immutable

prometheus.service.port

Prometheus service port.

Integer

80

Immutable

prometheus.service.targetPort

Prometheus service target port.

Integer

9090

Immutable

prometheus.service.labels

Prometheus service labels.

map

{}

prometheus.service.annotations

Prometheus service annotations.

map

{}

prometheus.pvc.annotations

PVC annotations.

map

{}

prometheus.pvc.storageClassName

Storage class to use for persistent volume claim.  The default storage class is used if it is not set.

string

Immutable, formatted on UI

prometheus.pvc.accessMode

Define access mode for persistent volume claim.

Enum["ReadWriteOnce", "ReadOnlyMany", "ReadWriteMany"]

ReadWriteOnce

Immutable, formatted on UI

prometheus.pvc.storage

Define storage size for persistent volume claim.

string

150Gi

Immutable, formatted on UI

prometheus.config.prometheus_yml

For information about the global Prometheus configuration, see the Prometheus documentation.

YAML file

prometheus.yaml

prometheus.config.alerting_rules_yml

For information about the Prometheus alerting rules, see the Prometheus documentation.

YAML file

alerting_rules.yaml

prometheus.config.recording_rules_yml

For information about the Prometheus recording rules, see the Prometheus documentation.

YAML file

recording_rules.yaml

prometheus.config.alerts_yml

Additional prometheus alerting rules are configured here.

YAML file

alerts_yml.yaml

prometheus.config.rules_yml

Additional prometheus recording rules are configured here.

YAML file

rules_yml.yaml

alertmanager.deployment.replicas

Number of alertmanager replicas.

Integer

1

alertmanager.deployment.containers.resources

Alertmanager container resource requests and limits.

map

{}

alertmanager.deployment.podAnnotations

The Alertmanager deployments pod annotations.

map

{}

alertmanager.deployment.podLabels

The Alertmanager deployments pod labels.

map

{}

alertmanager.service.type

Type of service to expose Alertmanager.

Enum["ClusterIP"]

ClusterIP

Immutable

alertmanager.service.port

Alertmanager service port.

Integer

80

Immutable

alertmanager.service.targetPort

Alertmanager service target port.

Integer

9093

Immutable

alertmanager.service.labels

Alertmanager service labels.

map

{}

alertmanager.service.annotations

Alertmanager service annotations.

map

{}

alertmanager.pvc.annotations

Alertmanager PVC annotations.

map

{}

alertmanager.pvc.storageClassName

Storage class to use for persistent volume claim.  The default provisioner is used if it is not set.

string

Immutable

alertmanager.pvc.accessMode

Define access mode for persistent volume claim.

Enum["ReadWriteOnce", "ReadOnlyMany", "ReadWriteMany"]

ReadWriteOnce

Immutable

alertmanager.pvc.storage

Define storage size for persistent volume claim.

string

2Gi

Immutable

alertmanager.config.alertmanager_yml

For information about the global YAML configuration for Alert Manager, see the Prometheus documentation.

YAML file

alertmanager_yml

kube_state_metrics.deployment.replicas

Number of kube-state-metrics replicas.

integer

1

kube_state_metrics.deployment.containers.resources

kube-state-metrics container resource requests and limits.

map

{}

kube_state_metrics.deployment.podAnnotations

The kube-state-metrics deployments pod annotations.

map

{}

kube_state_metrics.deployment.podLabels

The kube-state-metrics deployments pod labels.

map

{}

kube_state_metrics.service.type

Type of service to expose kube-state-metrics

Enum["ClusterIP"]

ClusterIP

Immutable

kube_state_metrics.service.port

kube-state-metrics service port.

Integer

80

Immutable

kube_state_metrics.service.targetPort

kube-state-metrics service target port.

Integer

8080

Immutable

kube_state_metrics.service.telemetryPort

kube-state-metrics service telemetry port.

Integer

81

Immutable

kube_state_metrics.service.telemetryTargetPort

kube-state-metrics service target telemetry port.

Integer

8081

Immutable

kube_state_metrics.service.labels

kube-state-metrics service labels.

map

{}

kube_state_metrics.service.annotations

kube-state-metrics service annotations.

map

{}

node_exporter.daemonset.replicas

Number of node-exporter replicas.

Integer

1

node_exporter.daemonset.containers.resources

node-exporter container resource requests and limits.

map

{}

node_exporter.daemonset.hostNetwork

Host networking requested for this pod.

boolean

false

node_exporter.daemonset.podAnnotations

The node-exporter deployments pod annotations.

map

{}

node_exporter.daemonset.podLabels

The node-exporter deployments pod labels.

map

{}

node_exporter.service.type

Type of service to expose node-exporter

Enum["ClusterIP"]

ClusterIP

Immutable

node_exporter.service.port

node-exporter service port.

Integer

9100

Immutable

node_exporter.service.targetPort

node-exporter service target port.

Integer

9100

Immutable

node_exporter.service.labels

node-exporter service labels.

map

{}

node_exporter.service.annotations

node-exporter service annotations.

map

{}

pushgateway.deployment.replicas

Number of pushgateway replicas.

Integer

1

pushgateway.deployment.containers.resources

pushgateway container resource requests and limits.

map

{}

pushgateway.deployment.podAnnotations

The pushgateway deployments pod annotations.

map

{}

pushgateway.deployment.podLabels

The pushgateway deployments pod labels.

map

{}

pushgateway.service.type

Type of service to expose pushgateway

Enum["ClusterIP"]

ClusterIP

Immutable

pushgateway.service.port

pushgateway service port.

Integer

9091

Immutable

pushgateway.service.targetPort

pushgateway service target port.

Integer

9091

Immutable

pushgateway.service.labels

pushgateway service labels.

map

{}

pushgateway.service.annotations

pushgateway service annotations.

map

{}

cadvisor.daemonset.replicas

Number of cadvisor replicas.

Integer

1

cadvisor.daemonset.containers.resources

cadvisor container resource requests and limits.

map

{}

cadvisor.daemonset.podAnnotations

The cadvisor deployments pod annotations.

map

{}

cadvisor.daemonset.podLabels

The cadvisor deployments pod labels.

map

{}

ingress.enabled

Enable/disable ingress for prometheus and alertmanager.

boolean

false

Immutable, depends on cert-manager addon and contour ingress controller

ingress.virtual_host_fqdn

Hostname for accessing promethues and alertmanager.

string

prometheus.system.tanzu

Immutable

ingress.prometheus_prefix

Path prefix for prometheus.

string

/

Immutable

ingress.alertmanager_prefix

Path prefix for alertmanager.

string

/alertmanager/

Immutable

ingress.prometheusServicePort

Prometheus service port to proxy traffic to.

Integer

80

Immutable

ingress.alertmanagerServicePort

Alertmanager service port to proxy traffic to.

Integer

80

Immutable

ingress.tlsCertificate.tls.crt

Optional certificate for ingress if you want to use your own TLS certificate. A self signed certificate is generated by default.

string

Generated cert

tls.crt is a key and not nested.

ingress.tlsCertificate.tls.key

Optional certificate private key for ingress if you want to use your own TLS certificate.

string

Generated cert key

tls.key is a key and not nested.

Ingress.tlsCertificate.ca.crt

Optional CA certificate.

string

CA certificate

ca.crt is a key and not nested.

A sample prometheus addon CR is:

metadata: name: prometheus spec: clusterRef: name: wc0 namespace: wc0 name: prometheus namespace: wc0 config: stringData: values.yaml: | prometheus: deployment: replicas: 1 containers: args: - --storage.tsdb.retention.time=5d - --config.file=/etc/config/prometheus.yml - --storage.tsdb.path=/data - --web.console.libraries=/etc/prometheus/console_libraries2 - --web.console.templates=/etc/prometheus/consoles - --web.enable-lifecycle service: type: NodePort port: 80 targetPort: 9090 pvc: accessMode: ReadWriteOnce storage: 150Gi config: prometheus_yml: | global: evaluation_interval: 1m scrape_interval: 1m scrape_timeout: 10s rule_files: - /etc/config/alerting_rules.yml - /etc/config/recording_rules.yml - /etc/config/alerts - /etc/config/rules scrape_configs: - job_name: 'prometheus' scrape_interval: 5s static_configs: - targets: ['localhost:9090'] - job_name: 'kube-state-metrics' static_configs: - targets: ['prometheus-kube-state-metrics.tanzu-system-monitoring.svc.cluster.local:8080'] - job_name: 'node-exporter' static_configs: - targets: ['prometheus-node-exporter.tanzu-system-monitoring.svc.cluster.local:9100'] - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node - job_name: kubernetes-nodes-cadvisor kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - replacement: kubernetes.default.svc:443 target_label: __address__ - regex: (.+) replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor source_labels: - __meta_kubernetes_node_name target_label: __metrics_path__ scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-apiservers kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: default;kubernetes;https source_labels: - __meta_kubernetes_namespace - __meta_kubernetes_service_name - __meta_kubernetes_endpoint_port_name scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token alerting: alertmanagers: - scheme: http static_configs: - targets: - alertmanager.tanzu-system-monitoring.svc:80 - kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_namespace] regex: default action: keep - source_labels: [__meta_kubernetes_pod_label_app] regex: prometheus action: keep - source_labels: [__meta_kubernetes_pod_label_component] regex: alertmanager action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe] regex: .* action: keep - source_labels: [__meta_kubernetes_pod_container_port_number] regex: action: drop alerting_rules_yml: | {} recording_rules_yml: | groups: - name: vmw-telco-namespace-cpu-rules interval: 1m rules: - record: tkg_namespace_cpu_usage_seconds expr: sum by (namespace) (rate (container_cpu_usage_seconds_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_cpu_throttled_seconds expr: sum by (namespace) (((rate(container_cpu_cfs_throttled_seconds_total[5m])) ) > 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_cpu_request_core expr: sum by (namespace) (kube_pod_container_resource_requests_cpu_cores) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_cpu_limits_core expr: sum by (namespace) (kube_pod_container_resource_limits_cpu_cores > 0.0 or kube_pod_info < bool 0.1) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-namespace-mem-rules interval: 1m rules: - record: tkg_namespace_mem_usage_mb expr: sum by (namespace) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_mem_rss_mb expr: sum by (namespace) (container_memory_rss{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_mem_workingset_mb expr: sum by (namespace) (container_memory_working_set_bytes{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_mem_request_mb expr: sum by (namespace) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_mem_limit_mb expr: sum by (namespace) ((kube_pod_container_resource_limits_memory_bytes / (1024*1024) )> 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-namespace-network-rules interval: 1m rules: - record: tkg_namespace_network_tx_bytes expr: sum by (namespace) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_rx_bytes expr: sum by (namespace) (rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_tx_packets expr: sum by (namespace) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_rx_packets expr: sum by (namespace) (rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_tx_drop_packets expr: sum by (namespace) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_rx_drop_packets expr: sum by (namespace) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_tx_errors expr: sum by (namespace) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_rx_errors expr: sum by (namespace) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_total_bytes expr: sum by (namespace) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_total_packets expr: sum by (namespace) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_total_drop_packets expr: sum by (namespace) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_total_errors expr: sum by (namespace) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-namespace-storage-rules interval: 1m rules: - record: tkg_namespace_storage_pvc_bound expr: sum by (namespace) ((kube_persistentvolumeclaim_status_phase{phase="Bound"}) > 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_storage_pvc_count expr: sum by (namespace) ((kube_pod_spec_volumes_persistentvolumeclaims_info)> 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-namespace-other-rules interval: 1m rules: - record: tkg_namespace_pods_qty_count expr: sum by (namespace) (kube_pod_info) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_pods_reboot_5m_count expr: sum by (namespace) (changes(kube_pod_status_ready{condition="true"}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_pods_broken_count expr: sum by (namespace) (kube_pod_status_ready{condition="false"}) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-pod-cpu-rules interval: 1m rules: - record: tkg_pod_cpu_usage_seconds expr: sum by (pod) (rate (container_cpu_usage_seconds_total{container!~"POD",pod!="",image!=""}[5m])) * 100 labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_cpu_request_core expr: sum by (pod) (kube_pod_container_resource_requests_cpu_cores) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_cpu_limit_core expr: sum by (pod) (kube_pod_container_resource_limits_cpu_cores > 0.0 or kube_pod_info < bool 0.1) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_cpu_throttled_seconds expr: sum by (pod) (((rate(container_cpu_cfs_throttled_seconds_total[5m])) ) > 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-pod-mem-rules interval: 1m rules: - record: tkg_pod_mem_usage_mb expr: sum by (pod) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_mem_rss_mb expr: sum by (pod) (container_memory_rss{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_mem_workingset_mb expr: sum by (pod) (container_memory_working_set_bytes{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_mem_request_mb expr: sum by (pod) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_mem_limit_mb expr: sum by (pod) ((kube_pod_container_resource_limits_memory_bytes / (1024*1024) )> 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-pod-network-rules interval: 1m rules: - record: tkg_pod_network_tx_bytes expr: sum by (pod) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_rx_bytes expr: sum by (pod) (rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_tx_packets expr: sum by (pod) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_rx_packets expr: sum by (pod) (rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_tx_dropped_packets expr: sum by (pod) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_rx_dropped_packets expr: sum by (pod) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_tx_errors expr: sum by (pod) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_rx_errors expr: sum by (pod) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_total_bytes expr: sum by (pod) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_total_packets expr: sum by (pod) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_total_drop_packets expr: sum by (pod) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_total_errors expr: sum by (pod) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-pod-other-rules interval: 1m rules: - record: tkg_pod_health_container_restarts_1hr_count expr: sum by (pod) (increase(kube_pod_container_status_restarts_total[1h])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_health_unhealthy_count expr: min_over_time(sum by (pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-node-cpu-rules interval: 1m rules: - record: tkg_node_cpu_capacity_core expr: sum by (node) (kube_node_status_capacity_cpu_cores) labels: job: kubernetes-service-endpoints - record: tkg_node_cpu_allocate_core expr: sum by (node) (kube_node_status_allocatable_cpu_cores) labels: job: kubernetes-service-endpoints - record: tkg_node_cpu_usage_seconds expr: (label_replace(sum by (instance) (rate(container_cpu_usage_seconds_total[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_cpu_throttled_seconds expr: sum by (instance) (rate(container_cpu_cfs_throttled_seconds_total[5m])) labels: job: kubernetes-service-endpoints - record: tkg_node_cpu_request_core expr: sum by (node) (kube_pod_container_resource_requests_cpu_cores) labels: job: kubernetes-service-endpoints - record: tkg_node_cpu_limits_core expr: sum by (node) (kube_pod_container_resource_limits_cpu_cores) labels: job: kubernetes-service-endpoints - name: vmw-telco-node-mem-rules interval: 1m rules: - record: tkg_node_mem_capacity_mb expr: sum by (node) (kube_node_status_capacity_memory_bytes / (1024*1024)) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_allocate_mb expr: sum by (node) (kube_node_status_allocatable_memory_bytes / (1024*1024)) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_request_mb expr: sum by (node) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_limits_mb expr: sum by (node) (kube_pod_container_resource_limits_memory_bytes) / (1024*1024) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_available_mb expr: sum by (node) ((node_memory_MemAvailable_bytes / (1024*1024) )) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_free_mb expr: sum by (node) ((node_memory_MemFree_bytes / (1024*1024) )) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_usage_mb expr: (label_replace(sum by (instance) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_free_pc expr: sum ((node_memory_MemFree_bytes{job="kubernetes-pods"} / node_memory_MemTotal_bytes) *100) by (node) labels: job: kubernetes-service-endpoints - record: tkg_node_oom_kill expr: sum by(node) (node_vmstat_oom_kill) labels: job: kubernetes-service-endpoints - name: vmw-telco-node-network-rules interval: 1m rules: - record: tkg_node_network_tx_bytes expr: (label_replace(sum by (instance) (rate(container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_rx_bytes expr: (label_replace(sum by (instance) (rate(container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_tx_packets expr: (label_replace(sum by (instance) (rate(container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_rx_packets expr: (label_replace(sum by (instance) (rate(container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_tx_dropped_packets expr: (label_replace(sum by (instance) (rate(container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_rx_dropped_packets expr: (label_replace(sum by (instance) (rate(container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_tx_errors expr: (label_replace(sum by (instance) (rate(container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_rx_errors expr: (label_replace(sum by (instance) (rate(container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_total_bytes expr: label_replace((sum by (instance) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)") labels: job: kubernetes-service-endpoints - record: tkg_node_network_total_packets expr: label_replace((sum by (instance) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)") labels: job: kubernetes-service-endpoints - record: tkg_node_network_total_drop_packets expr: label_replace((sum by (instance) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)") labels: job: kubernetes-service-endpoints - record: tkg_node_network_total_errors expr: label_replace((sum by (instance) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)") labels: job: kubernetes-service-endpoints - name: vmw-telco-node-other-rules interval: 1m rules: - record: tkg_node_status_mempressure_count expr: sum by (node) (kube_node_status_condition{condition="MemoryPressure",status="true"}) labels: job: kubernetes-service-endpoints - record: tkg_node_status_diskpressure_count expr: sum by (node) (kube_node_status_condition{condition="DiskPressure",status="true"}) labels: job: kubernetes-service-endpoints - record: tkg_node_status_pidpressure_count expr: sum by (node) (kube_node_status_condition{condition="PIDPressure",status="true"}) labels: job: kubernetes-service-endpoints - record: tkg_node_status_networkunavailable_count expr: sum by (node) (kube_node_status_condition{condition="NetworkUnavailable",status="true"}) labels: job: kubernetes-service-endpoints - record: tkg_node_status_etcdb_bytes expr: (label_replace(etcd_db_total_size_in_bytes, "instance", "$1", "instance", "(.+):(\\d+)")) * on (instance) group_left (node) (avg by (instance, node) (label_replace ((kube_pod_info), "instance", "$1", "host_ip", "(.*)")) ) labels: job: kubernetes-service-endpoints - record: tkg_node_status_apiserver_request_total expr: sum((label_replace(apiserver_request_total, "instance", "$1", "instance", "(.+):(\\d+)")) * on (instance) group_left (node) (avg by (instance, node) (label_replace ((kube_pod_info), "instance", "$1", "host_ip", "(.*)")) )) by (node) labels: job: kubernetes-service-endpoints ingress: enabled: false virtual_host_fqdn: prometheus.system.tanzu prometheus_prefix: / alertmanager_prefix: /alertmanager/ prometheusServicePort: 80 alertmanagerServicePort: 80 alertmanager: deployment: replicas: 1 service: type: ClusterIP port: 80 targetPort: 9093 pvc: accessMode: ReadWriteOnce storage: 2Gi config: alertmanager_yml: | global: {} receivers: - name: default-receiver templates: - '/etc/alertmanager/templates/*.tmpl' route: group_interval: 5m group_wait: 10s receiver: default-receiver repeat_interval: 3h kube_state_metrics: deployment: replicas: 1 service: type: ClusterIP port: 80 targetPort: 8080 telemetryPort: 81 telemetryTargetPort: 8081 node_exporter: daemonset: hostNetwork: false updatestrategy: RollingUpdate service: type: ClusterIP port: 9100 targetPort: 9100 pushgateway: deployment: replicas: 1 service: type: ClusterIP port: 9091 targetPort: 9091 cadvisor: daemonset: updatestrategy: RollingUpdate

In this sample CR:

  • The TSDB retention time in parameter prometheus.deployment.containers.args is changed to 5 days instead of default 42 days.

  • Some recording rules are added to prometheus.config.recording_rules_yml. Customize them or add more as needed.

  • The prometheus.service.type is changed to NodePort so that it can be integrated with external components(e.g. vROPS or Grafana). See Prometheus service type.

Prometheus service type

By default, Prometheus is deployed with a service type of ClusterIP, this means it is NOT exposable to the outside world.

There are three options available for the prometheus.service.type:

  • ClusterIP – use the default configuration then prometheus service only can be accessed in workload cluster. The service can also be exposed via ingress however this depends on ingress controller and some other munual configuraitons.

  • NodePort(recommended) – exposes the prometheus service on a nodeport. TCA does not support to specify the actual Nodeport, K8s will allocate a random nodeport number(a high-range port number between 30,000 and 32,767). To determine what this nodeport number is (post configuration), user must view the service configuration from the TCA cluster with command kubectl get svc -n tanzu-system-monitoring prometheus-server, as can be seen in the following output, the prometheus-server is exposed on node port 32020, then other external components can integrate with Prometheus with URL http://<cluster-endpoint-ip>:32020

    capv@cp0-control-plane-kz5k6 [ ~ ]$ kubectl get svc -n tanzu-system-monitoring prometheus-server NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-server NodePort 100.65.8.127 <node> 80:32020/TCP 25s
  • Loadbalancer - leverages load balancer provider on Kubernetes to expose service. VMware recommends Avi load balancer which is deployed by load-balancer-and-ingress-service addon. Other load balancer provider can be used, but it will not be supported by VMware. TCA does not support to specify static VIP for prometheus service, a VIP from default VIP pool will be allocated for prometheus service, then other external components can integrate with Prometheus with URL http://<prometheus-VIP>.