Use this reference when configuring additional parameters of prometheus addon via the Custom Resources(CRs) tab.
Configurable parameters
Parameter |
Description |
Type |
Default value |
Note |
|---|---|---|---|---|
prometheus.deployment.replicas |
Number of Prometheus replicas. |
integer |
1 |
|
prometheus.deployment.containers.args |
Prometheus container arguments. You can configure this parameter to change retention time. For information about configuring Prometheus storage parameters, see the Prometheus documentation. Note: Longer retention times require more storage capacity. It might be necessary to increase the persistent volume claim size if you are significantly increasing the retention time. |
list |
- --storage.tsdb.retention.time=42d - --config.file=/etc/config/prometheus.yml - --storage.tsdb.path=/data - --web.console.libraries=/etc/prometheus/console_libraries2 - --web.console.templates=/etc/prometheus/consoles - --web.enable-lifecycle |
Prometheus will replace the whole arg list, make sure the customized arg list contains all of these args. |
prometheus.deployment.containers.resources |
Prometheus container resource requests and limits. |
map |
{} |
|
prometheus.deployment.podAnnotations |
The Prometheus deployments pod annotations. |
map |
{} |
|
prometheus.deployment.podLabels |
The Prometheus deployments pod labels. |
map |
{} |
|
prometheus.deployment.configMapReload.containers.args |
Configmap-reload container arguments. |
list |
||
prometheus.deployment.configMapReload.containers.resources |
Configmap-reload container resource requests and limits. |
map |
{} |
|
prometheus.service.type |
Type of service to expose Prometheus. |
Enum["ClusterIP","NodePort","LoadBalancer"] |
ClusterIP |
Immutable |
prometheus.service.port |
Prometheus service port. |
Integer |
80 |
Immutable |
prometheus.service.targetPort |
Prometheus service target port. |
Integer |
9090 |
Immutable |
prometheus.service.labels |
Prometheus service labels. |
map |
{} |
|
prometheus.service.annotations |
Prometheus service annotations. |
map |
{} |
|
prometheus.pvc.annotations |
PVC annotations. |
map |
{} |
|
prometheus.pvc.storageClassName |
Storage class to use for persistent volume claim. The default storage class is used if it is not set. |
string |
Immutable, formatted on UI |
|
prometheus.pvc.accessMode |
Define access mode for persistent volume claim. |
Enum["ReadWriteOnce", "ReadOnlyMany", "ReadWriteMany"] |
ReadWriteOnce |
Immutable, formatted on UI |
prometheus.pvc.storage |
Define storage size for persistent volume claim. |
string |
150Gi |
Immutable, formatted on UI |
prometheus.config.prometheus_yml |
For information about the global Prometheus configuration, see the Prometheus documentation. |
YAML file |
prometheus.yaml |
|
prometheus.config.alerting_rules_yml |
For information about the Prometheus alerting rules, see the Prometheus documentation. |
YAML file |
alerting_rules.yaml |
|
prometheus.config.recording_rules_yml |
For information about the Prometheus recording rules, see the Prometheus documentation. |
YAML file |
recording_rules.yaml |
|
prometheus.config.alerts_yml |
Additional prometheus alerting rules are configured here. |
YAML file |
alerts_yml.yaml |
|
prometheus.config.rules_yml |
Additional prometheus recording rules are configured here. |
YAML file |
rules_yml.yaml |
|
alertmanager.deployment.replicas |
Number of alertmanager replicas. |
Integer |
1 |
|
alertmanager.deployment.containers.resources |
Alertmanager container resource requests and limits. |
map |
{} |
|
alertmanager.deployment.podAnnotations |
The Alertmanager deployments pod annotations. |
map |
{} |
|
alertmanager.deployment.podLabels |
The Alertmanager deployments pod labels. |
map |
{} |
|
alertmanager.service.type |
Type of service to expose Alertmanager. |
Enum["ClusterIP"] |
ClusterIP |
Immutable |
alertmanager.service.port |
Alertmanager service port. |
Integer |
80 |
Immutable |
alertmanager.service.targetPort |
Alertmanager service target port. |
Integer |
9093 |
Immutable |
alertmanager.service.labels |
Alertmanager service labels. |
map |
{} |
|
alertmanager.service.annotations |
Alertmanager service annotations. |
map |
{} |
|
alertmanager.pvc.annotations |
Alertmanager PVC annotations. |
map |
{} |
|
alertmanager.pvc.storageClassName |
Storage class to use for persistent volume claim. The default provisioner is used if it is not set. |
string |
Immutable |
|
alertmanager.pvc.accessMode |
Define access mode for persistent volume claim. |
Enum["ReadWriteOnce", "ReadOnlyMany", "ReadWriteMany"] |
ReadWriteOnce |
Immutable |
alertmanager.pvc.storage |
Define storage size for persistent volume claim. |
string |
2Gi |
Immutable |
alertmanager.config.alertmanager_yml |
For information about the global YAML configuration for Alert Manager, see the Prometheus documentation. |
YAML file |
alertmanager_yml |
|
kube_state_metrics.deployment.replicas |
Number of kube-state-metrics replicas. |
integer |
1 |
|
kube_state_metrics.deployment.containers.resources |
kube-state-metrics container resource requests and limits. |
map |
{} |
|
kube_state_metrics.deployment.podAnnotations |
The kube-state-metrics deployments pod annotations. |
map |
{} |
|
kube_state_metrics.deployment.podLabels |
The kube-state-metrics deployments pod labels. |
map |
{} |
|
kube_state_metrics.service.type |
Type of service to expose kube-state-metrics |
Enum["ClusterIP"] |
ClusterIP |
Immutable |
kube_state_metrics.service.port |
kube-state-metrics service port. |
Integer |
80 |
Immutable |
kube_state_metrics.service.targetPort |
kube-state-metrics service target port. |
Integer |
8080 |
Immutable |
kube_state_metrics.service.telemetryPort |
kube-state-metrics service telemetry port. |
Integer |
81 |
Immutable |
kube_state_metrics.service.telemetryTargetPort |
kube-state-metrics service target telemetry port. |
Integer |
8081 |
Immutable |
kube_state_metrics.service.labels |
kube-state-metrics service labels. |
map |
{} |
|
kube_state_metrics.service.annotations |
kube-state-metrics service annotations. |
map |
{} |
|
node_exporter.daemonset.replicas |
Number of node-exporter replicas. |
Integer |
1 |
|
node_exporter.daemonset.containers.resources |
node-exporter container resource requests and limits. |
map |
{} |
|
node_exporter.daemonset.hostNetwork |
Host networking requested for this pod. |
boolean |
false |
|
node_exporter.daemonset.podAnnotations |
The node-exporter deployments pod annotations. |
map |
{} |
|
node_exporter.daemonset.podLabels |
The node-exporter deployments pod labels. |
map |
{} |
|
node_exporter.service.type |
Type of service to expose node-exporter |
Enum["ClusterIP"] |
ClusterIP |
Immutable |
node_exporter.service.port |
node-exporter service port. |
Integer |
9100 |
Immutable |
node_exporter.service.targetPort |
node-exporter service target port. |
Integer |
9100 |
Immutable |
node_exporter.service.labels |
node-exporter service labels. |
map |
{} |
|
node_exporter.service.annotations |
node-exporter service annotations. |
map |
{} |
|
pushgateway.deployment.replicas |
Number of pushgateway replicas. |
Integer |
1 |
|
pushgateway.deployment.containers.resources |
pushgateway container resource requests and limits. |
map |
{} |
|
pushgateway.deployment.podAnnotations |
The pushgateway deployments pod annotations. |
map |
{} |
|
pushgateway.deployment.podLabels |
The pushgateway deployments pod labels. |
map |
{} |
|
pushgateway.service.type |
Type of service to expose pushgateway |
Enum["ClusterIP"] |
ClusterIP |
Immutable |
pushgateway.service.port |
pushgateway service port. |
Integer |
9091 |
Immutable |
pushgateway.service.targetPort |
pushgateway service target port. |
Integer |
9091 |
Immutable |
pushgateway.service.labels |
pushgateway service labels. |
map |
{} |
|
pushgateway.service.annotations |
pushgateway service annotations. |
map |
{} |
|
cadvisor.daemonset.replicas |
Number of cadvisor replicas. |
Integer |
1 |
|
cadvisor.daemonset.containers.resources |
cadvisor container resource requests and limits. |
map |
{} |
|
cadvisor.daemonset.podAnnotations |
The cadvisor deployments pod annotations. |
map |
{} |
|
cadvisor.daemonset.podLabels |
The cadvisor deployments pod labels. |
map |
{} |
|
ingress.enabled |
Enable/disable ingress for prometheus and alertmanager. |
boolean |
false |
Immutable, depends on cert-manager addon and contour ingress controller |
ingress.virtual_host_fqdn |
Hostname for accessing promethues and alertmanager. |
string |
prometheus.system.tanzu |
Immutable |
ingress.prometheus_prefix |
Path prefix for prometheus. |
string |
/ |
Immutable |
ingress.alertmanager_prefix |
Path prefix for alertmanager. |
string |
/alertmanager/ |
Immutable |
ingress.prometheusServicePort |
Prometheus service port to proxy traffic to. |
Integer |
80 |
Immutable |
ingress.alertmanagerServicePort |
Alertmanager service port to proxy traffic to. |
Integer |
80 |
Immutable |
ingress.tlsCertificate.tls.crt |
Optional certificate for ingress if you want to use your own TLS certificate. A self signed certificate is generated by default. |
string |
Generated cert |
tls.crt is a key and not nested. |
ingress.tlsCertificate.tls.key |
Optional certificate private key for ingress if you want to use your own TLS certificate. |
string |
Generated cert key |
tls.key is a key and not nested. |
Ingress.tlsCertificate.ca.crt |
Optional CA certificate. |
string |
CA certificate |
ca.crt is a key and not nested. |
A sample prometheus addon CR is:
metadata:
name: prometheus
spec:
clusterRef:
name: wc0
namespace: wc0
name: prometheus
namespace: wc0
config:
stringData:
values.yaml: |
prometheus:
deployment:
replicas: 1
containers:
args:
- --storage.tsdb.retention.time=5d
- --config.file=/etc/config/prometheus.yml
- --storage.tsdb.path=/data
- --web.console.libraries=/etc/prometheus/console_libraries2
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
service:
type: NodePort
port: 80
targetPort: 9090
pvc:
accessMode: ReadWriteOnce
storage: 150Gi
config:
prometheus_yml: |
global:
evaluation_interval: 1m
scrape_interval: 1m
scrape_timeout: 10s
rule_files:
- /etc/config/alerting_rules.yml
- /etc/config/recording_rules.yml
- /etc/config/alerts
- /etc/config/rules
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'kube-state-metrics'
static_configs:
- targets: ['prometheus-kube-state-metrics.tanzu-system-monitoring.svc.cluster.local:8080']
- job_name: 'node-exporter'
static_configs:
- targets: ['prometheus-node-exporter.tanzu-system-monitoring.svc.cluster.local:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: node
- job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- alertmanager.tanzu-system-monitoring.svc:80
- kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
regex: default
action: keep
- source_labels: [__meta_kubernetes_pod_label_app]
regex: prometheus
action: keep
- source_labels: [__meta_kubernetes_pod_label_component]
regex: alertmanager
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
regex: .*
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex:
action: drop
alerting_rules_yml: |
{}
recording_rules_yml: |
groups:
- name: vmw-telco-namespace-cpu-rules
interval: 1m
rules:
- record: tkg_namespace_cpu_usage_seconds
expr: sum by (namespace) (rate (container_cpu_usage_seconds_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_cpu_throttled_seconds
expr: sum by (namespace) (((rate(container_cpu_cfs_throttled_seconds_total[5m])) ) > 0 or kube_pod_info < bool 0)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_cpu_request_core
expr: sum by (namespace) (kube_pod_container_resource_requests_cpu_cores)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_cpu_limits_core
expr: sum by (namespace) (kube_pod_container_resource_limits_cpu_cores > 0.0 or kube_pod_info < bool 0.1)
labels:
job: kubernetes-nodes-cadvisor
- name: vmw-telco-namespace-mem-rules
interval: 1m
rules:
- record: tkg_namespace_mem_usage_mb
expr: sum by (namespace) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_mem_rss_mb
expr: sum by (namespace) (container_memory_rss{container!~"POD",container!=""}) / (1024*1024)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_mem_workingset_mb
expr: sum by (namespace) (container_memory_working_set_bytes{container!~"POD",container!=""}) / (1024*1024)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_mem_request_mb
expr: sum by (namespace) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_mem_limit_mb
expr: sum by (namespace) ((kube_pod_container_resource_limits_memory_bytes / (1024*1024) )> 0 or kube_pod_info < bool 0)
labels:
job: kubernetes-nodes-cadvisor
- name: vmw-telco-namespace-network-rules
interval: 1m
rules:
- record: tkg_namespace_network_tx_bytes
expr: sum by (namespace) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_rx_bytes
expr: sum by (namespace) (rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_tx_packets
expr: sum by (namespace) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_rx_packets
expr: sum by (namespace) (rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_tx_drop_packets
expr: sum by (namespace) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_rx_drop_packets
expr: sum by (namespace) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_tx_errors
expr: sum by (namespace) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_rx_errors
expr: sum by (namespace) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_total_bytes
expr: sum by (namespace) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_total_packets
expr: sum by (namespace) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_total_drop_packets
expr: sum by (namespace) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_network_total_errors
expr: sum by (namespace) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- name: vmw-telco-namespace-storage-rules
interval: 1m
rules:
- record: tkg_namespace_storage_pvc_bound
expr: sum by (namespace) ((kube_persistentvolumeclaim_status_phase{phase="Bound"}) > 0 or kube_pod_info < bool 0)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_storage_pvc_count
expr: sum by (namespace) ((kube_pod_spec_volumes_persistentvolumeclaims_info)> 0 or kube_pod_info < bool 0)
labels:
job: kubernetes-nodes-cadvisor
- name: vmw-telco-namespace-other-rules
interval: 1m
rules:
- record: tkg_namespace_pods_qty_count
expr: sum by (namespace) (kube_pod_info)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_pods_reboot_5m_count
expr: sum by (namespace) (changes(kube_pod_status_ready{condition="true"}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_namespace_pods_broken_count
expr: sum by (namespace) (kube_pod_status_ready{condition="false"})
labels:
job: kubernetes-nodes-cadvisor
- name: vmw-telco-pod-cpu-rules
interval: 1m
rules:
- record: tkg_pod_cpu_usage_seconds
expr: sum by (pod) (rate (container_cpu_usage_seconds_total{container!~"POD",pod!="",image!=""}[5m])) * 100
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_cpu_request_core
expr: sum by (pod) (kube_pod_container_resource_requests_cpu_cores)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_cpu_limit_core
expr: sum by (pod) (kube_pod_container_resource_limits_cpu_cores > 0.0 or kube_pod_info < bool 0.1)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_cpu_throttled_seconds
expr: sum by (pod) (((rate(container_cpu_cfs_throttled_seconds_total[5m])) ) > 0 or kube_pod_info < bool 0)
labels:
job: kubernetes-nodes-cadvisor
- name: vmw-telco-pod-mem-rules
interval: 1m
rules:
- record: tkg_pod_mem_usage_mb
expr: sum by (pod) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_mem_rss_mb
expr: sum by (pod) (container_memory_rss{container!~"POD",container!=""}) / (1024*1024)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_mem_workingset_mb
expr: sum by (pod) (container_memory_working_set_bytes{container!~"POD",container!=""}) / (1024*1024)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_mem_request_mb
expr: sum by (pod) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024)
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_mem_limit_mb
expr: sum by (pod) ((kube_pod_container_resource_limits_memory_bytes / (1024*1024) )> 0 or kube_pod_info < bool 0)
labels:
job: kubernetes-nodes-cadvisor
- name: vmw-telco-pod-network-rules
interval: 1m
rules:
- record: tkg_pod_network_tx_bytes
expr: sum by (pod) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_rx_bytes
expr: sum by (pod) (rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_tx_packets
expr: sum by (pod) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_rx_packets
expr: sum by (pod) (rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_tx_dropped_packets
expr: sum by (pod) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_rx_dropped_packets
expr: sum by (pod) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_tx_errors
expr: sum by (pod) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_rx_errors
expr: sum by (pod) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_total_bytes
expr: sum by (pod) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_total_packets
expr: sum by (pod) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_total_drop_packets
expr: sum by (pod) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_network_total_errors
expr: sum by (pod) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]))
labels:
job: kubernetes-nodes-cadvisor
- name: vmw-telco-pod-other-rules
interval: 1m
rules:
- record: tkg_pod_health_container_restarts_1hr_count
expr: sum by (pod) (increase(kube_pod_container_status_restarts_total[1h]))
labels:
job: kubernetes-nodes-cadvisor
- record: tkg_pod_health_unhealthy_count
expr: min_over_time(sum by (pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m])
labels:
job: kubernetes-nodes-cadvisor
- name: vmw-telco-node-cpu-rules
interval: 1m
rules:
- record: tkg_node_cpu_capacity_core
expr: sum by (node) (kube_node_status_capacity_cpu_cores)
labels:
job: kubernetes-service-endpoints
- record: tkg_node_cpu_allocate_core
expr: sum by (node) (kube_node_status_allocatable_cpu_cores)
labels:
job: kubernetes-service-endpoints
- record: tkg_node_cpu_usage_seconds
expr: (label_replace(sum by (instance) (rate(container_cpu_usage_seconds_total[5m])), "node", "$1", "instance", "(.*)"))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_cpu_throttled_seconds
expr: sum by (instance) (rate(container_cpu_cfs_throttled_seconds_total[5m]))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_cpu_request_core
expr: sum by (node) (kube_pod_container_resource_requests_cpu_cores)
labels:
job: kubernetes-service-endpoints
- record: tkg_node_cpu_limits_core
expr: sum by (node) (kube_pod_container_resource_limits_cpu_cores)
labels:
job: kubernetes-service-endpoints
- name: vmw-telco-node-mem-rules
interval: 1m
rules:
- record: tkg_node_mem_capacity_mb
expr: sum by (node) (kube_node_status_capacity_memory_bytes / (1024*1024))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_mem_allocate_mb
expr: sum by (node) (kube_node_status_allocatable_memory_bytes / (1024*1024))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_mem_request_mb
expr: sum by (node) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024)
labels:
job: kubernetes-service-endpoints
- record: tkg_node_mem_limits_mb
expr: sum by (node) (kube_pod_container_resource_limits_memory_bytes) / (1024*1024)
labels:
job: kubernetes-service-endpoints
- record: tkg_node_mem_available_mb
expr: sum by (node) ((node_memory_MemAvailable_bytes / (1024*1024) ))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_mem_free_mb
expr: sum by (node) ((node_memory_MemFree_bytes / (1024*1024) ))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_mem_usage_mb
expr: (label_replace(sum by (instance) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024), "node", "$1", "instance", "(.*)"))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_mem_free_pc
expr: sum ((node_memory_MemFree_bytes{job="kubernetes-pods"} / node_memory_MemTotal_bytes) *100) by (node)
labels:
job: kubernetes-service-endpoints
- record: tkg_node_oom_kill
expr: sum by(node) (node_vmstat_oom_kill)
labels:
job: kubernetes-service-endpoints
- name: vmw-telco-node-network-rules
interval: 1m
rules:
- record: tkg_node_network_tx_bytes
expr: (label_replace(sum by (instance) (rate(container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_rx_bytes
expr: (label_replace(sum by (instance) (rate(container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_tx_packets
expr: (label_replace(sum by (instance) (rate(container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_rx_packets
expr: (label_replace(sum by (instance) (rate(container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_tx_dropped_packets
expr: (label_replace(sum by (instance) (rate(container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_rx_dropped_packets
expr: (label_replace(sum by (instance) (rate(container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_tx_errors
expr: (label_replace(sum by (instance) (rate(container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_rx_errors
expr: (label_replace(sum by (instance) (rate(container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)"))
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_total_bytes
expr: label_replace((sum by (instance) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)")
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_total_packets
expr: label_replace((sum by (instance) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)")
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_total_drop_packets
expr: label_replace((sum by (instance) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)")
labels:
job: kubernetes-service-endpoints
- record: tkg_node_network_total_errors
expr: label_replace((sum by (instance) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)")
labels:
job: kubernetes-service-endpoints
- name: vmw-telco-node-other-rules
interval: 1m
rules:
- record: tkg_node_status_mempressure_count
expr: sum by (node) (kube_node_status_condition{condition="MemoryPressure",status="true"})
labels:
job: kubernetes-service-endpoints
- record: tkg_node_status_diskpressure_count
expr: sum by (node) (kube_node_status_condition{condition="DiskPressure",status="true"})
labels:
job: kubernetes-service-endpoints
- record: tkg_node_status_pidpressure_count
expr: sum by (node) (kube_node_status_condition{condition="PIDPressure",status="true"})
labels:
job: kubernetes-service-endpoints
- record: tkg_node_status_networkunavailable_count
expr: sum by (node) (kube_node_status_condition{condition="NetworkUnavailable",status="true"})
labels:
job: kubernetes-service-endpoints
- record: tkg_node_status_etcdb_bytes
expr: (label_replace(etcd_db_total_size_in_bytes, "instance", "$1", "instance", "(.+):(\\d+)")) * on (instance) group_left (node) (avg by (instance, node) (label_replace ((kube_pod_info), "instance", "$1", "host_ip", "(.*)")) )
labels:
job: kubernetes-service-endpoints
- record: tkg_node_status_apiserver_request_total
expr: sum((label_replace(apiserver_request_total, "instance", "$1", "instance", "(.+):(\\d+)")) * on (instance) group_left (node) (avg by (instance, node) (label_replace ((kube_pod_info), "instance", "$1", "host_ip", "(.*)")) )) by (node)
labels:
job: kubernetes-service-endpoints
ingress:
enabled: false
virtual_host_fqdn: prometheus.system.tanzu
prometheus_prefix: /
alertmanager_prefix: /alertmanager/
prometheusServicePort: 80
alertmanagerServicePort: 80
alertmanager:
deployment:
replicas: 1
service:
type: ClusterIP
port: 80
targetPort: 9093
pvc:
accessMode: ReadWriteOnce
storage: 2Gi
config:
alertmanager_yml: |
global: {}
receivers:
- name: default-receiver
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_interval: 5m
group_wait: 10s
receiver: default-receiver
repeat_interval: 3h
kube_state_metrics:
deployment:
replicas: 1
service:
type: ClusterIP
port: 80
targetPort: 8080
telemetryPort: 81
telemetryTargetPort: 8081
node_exporter:
daemonset:
hostNetwork: false
updatestrategy: RollingUpdate
service:
type: ClusterIP
port: 9100
targetPort: 9100
pushgateway:
deployment:
replicas: 1
service:
type: ClusterIP
port: 9091
targetPort: 9091
cadvisor:
daemonset:
updatestrategy: RollingUpdate
In this sample CR:
The TSDB retention time in parameter
prometheus.deployment.containers.argsis changed to 5 days instead of default 42 days.Some recording rules are added to
prometheus.config.recording_rules_yml. Customize them or add more as needed.The
prometheus.service.typeis changed to NodePort so that it can be integrated with external components(e.g. vROPS or Grafana). See Prometheus service type.
Prometheus service type
By default, Prometheus is deployed with a service type of ClusterIP, this means it is NOT exposable to the outside world.
There are three options available for the prometheus.service.type:
ClusterIP – use the default configuration then prometheus service only can be accessed in workload cluster. The service can also be exposed via ingress however this depends on ingress controller and some other munual configuraitons.
NodePort(recommended) – exposes the prometheus service on a nodeport. TCA does not support to specify the actual Nodeport, K8s will allocate a random nodeport number(a high-range port number between 30,000 and 32,767). To determine what this nodeport number is (post configuration), user must view the service configuration from the TCA cluster with command
kubectl get svc -n tanzu-system-monitoring prometheus-server, as can be seen in the following output, the prometheus-server is exposed on node port 32020, then other external components can integrate with Prometheus with URL http://<cluster-endpoint-ip>:32020capv@cp0-control-plane-kz5k6 [ ~ ]$ kubectl get svc -n tanzu-system-monitoring prometheus-server NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-server NodePort 100.65.8.127 <node> 80:32020/TCP 25s
Loadbalancer - leverages load balancer provider on Kubernetes to expose service. VMware recommends Avi load balancer which is deployed by load-balancer-and-ingress-service addon. Other load balancer provider can be used, but it will not be supported by VMware. TCA does not support to specify static VIP for prometheus service, a VIP from default VIP pool will be allocated for prometheus service, then other external components can integrate with Prometheus with URL http://<prometheus-VIP>.