Use this reference when configuring additional parameters of prometheus addon via the Custom Resources(CRs) tab.
Configurable parameters
Parameter |
Description |
Type |
Default value |
Note |
---|---|---|---|---|
prometheus.deployment.replicas |
Number of Prometheus replicas. |
integer |
1 |
|
prometheus.deployment.containers.args |
Prometheus container arguments. You can configure this parameter to change retention time. For information about configuring Prometheus storage parameters, see the Prometheus documentation. Note: Longer retention times require more storage capacity. It might be necessary to increase the persistent volume claim size if you are significantly increasing the retention time. |
list |
- --storage.tsdb.retention.time=42d - --config.file=/etc/config/prometheus.yml - --storage.tsdb.path=/data - --web.console.libraries=/etc/prometheus/console_libraries2 - --web.console.templates=/etc/prometheus/consoles - --web.enable-lifecycle |
Prometheus will replace the whole arg list, make sure the customized arg list contains all of these args. |
prometheus.deployment.containers.resources |
Prometheus container resource requests and limits. |
map |
{} |
|
prometheus.deployment.podAnnotations |
The Prometheus deployments pod annotations. |
map |
{} |
|
prometheus.deployment.podLabels |
The Prometheus deployments pod labels. |
map |
{} |
|
prometheus.deployment.configMapReload.containers.args |
Configmap-reload container arguments. |
list |
||
prometheus.deployment.configMapReload.containers.resources |
Configmap-reload container resource requests and limits. |
map |
{} |
|
prometheus.service.type |
Type of service to expose Prometheus. |
Enum["ClusterIP","NodePort","LoadBalancer"] |
ClusterIP |
Immutable |
prometheus.service.port |
Prometheus service port. |
Integer |
80 |
Immutable |
prometheus.service.targetPort |
Prometheus service target port. |
Integer |
9090 |
Immutable |
prometheus.service.labels |
Prometheus service labels. |
map |
{} |
|
prometheus.service.annotations |
Prometheus service annotations. |
map |
{} |
|
prometheus.pvc.annotations |
PVC annotations. |
map |
{} |
|
prometheus.pvc.storageClassName |
Storage class to use for persistent volume claim. The default storage class is used if it is not set. |
string |
Immutable, formatted on UI |
|
prometheus.pvc.accessMode |
Define access mode for persistent volume claim. |
Enum["ReadWriteOnce", "ReadOnlyMany", "ReadWriteMany"] |
ReadWriteOnce |
Immutable, formatted on UI |
prometheus.pvc.storage |
Define storage size for persistent volume claim. |
string |
150Gi |
Immutable, formatted on UI |
prometheus.config.prometheus_yml |
For information about the global Prometheus configuration, see the Prometheus documentation. |
YAML file |
prometheus.yaml |
|
prometheus.config.alerting_rules_yml |
For information about the Prometheus alerting rules, see the Prometheus documentation. |
YAML file |
alerting_rules.yaml |
|
prometheus.config.recording_rules_yml |
For information about the Prometheus recording rules, see the Prometheus documentation. |
YAML file |
recording_rules.yaml |
|
prometheus.config.alerts_yml |
Additional prometheus alerting rules are configured here. |
YAML file |
alerts_yml.yaml |
|
prometheus.config.rules_yml |
Additional prometheus recording rules are configured here. |
YAML file |
rules_yml.yaml |
|
alertmanager.deployment.replicas |
Number of alertmanager replicas. |
Integer |
1 |
|
alertmanager.deployment.containers.resources |
Alertmanager container resource requests and limits. |
map |
{} |
|
alertmanager.deployment.podAnnotations |
The Alertmanager deployments pod annotations. |
map |
{} |
|
alertmanager.deployment.podLabels |
The Alertmanager deployments pod labels. |
map |
{} |
|
alertmanager.service.type |
Type of service to expose Alertmanager. |
Enum["ClusterIP"] |
ClusterIP |
Immutable |
alertmanager.service.port |
Alertmanager service port. |
Integer |
80 |
Immutable |
alertmanager.service.targetPort |
Alertmanager service target port. |
Integer |
9093 |
Immutable |
alertmanager.service.labels |
Alertmanager service labels. |
map |
{} |
|
alertmanager.service.annotations |
Alertmanager service annotations. |
map |
{} |
|
alertmanager.pvc.annotations |
Alertmanager PVC annotations. |
map |
{} |
|
alertmanager.pvc.storageClassName |
Storage class to use for persistent volume claim. The default provisioner is used if it is not set. |
string |
Immutable |
|
alertmanager.pvc.accessMode |
Define access mode for persistent volume claim. |
Enum["ReadWriteOnce", "ReadOnlyMany", "ReadWriteMany"] |
ReadWriteOnce |
Immutable |
alertmanager.pvc.storage |
Define storage size for persistent volume claim. |
string |
2Gi |
Immutable |
alertmanager.config.alertmanager_yml |
For information about the global YAML configuration for Alert Manager, see the Prometheus documentation. |
YAML file |
alertmanager_yml |
|
kube_state_metrics.deployment.replicas |
Number of kube-state-metrics replicas. |
integer |
1 |
|
kube_state_metrics.deployment.containers.resources |
kube-state-metrics container resource requests and limits. |
map |
{} |
|
kube_state_metrics.deployment.podAnnotations |
The kube-state-metrics deployments pod annotations. |
map |
{} |
|
kube_state_metrics.deployment.podLabels |
The kube-state-metrics deployments pod labels. |
map |
{} |
|
kube_state_metrics.service.type |
Type of service to expose kube-state-metrics |
Enum["ClusterIP"] |
ClusterIP |
Immutable |
kube_state_metrics.service.port |
kube-state-metrics service port. |
Integer |
80 |
Immutable |
kube_state_metrics.service.targetPort |
kube-state-metrics service target port. |
Integer |
8080 |
Immutable |
kube_state_metrics.service.telemetryPort |
kube-state-metrics service telemetry port. |
Integer |
81 |
Immutable |
kube_state_metrics.service.telemetryTargetPort |
kube-state-metrics service target telemetry port. |
Integer |
8081 |
Immutable |
kube_state_metrics.service.labels |
kube-state-metrics service labels. |
map |
{} |
|
kube_state_metrics.service.annotations |
kube-state-metrics service annotations. |
map |
{} |
|
node_exporter.daemonset.replicas |
Number of node-exporter replicas. |
Integer |
1 |
|
node_exporter.daemonset.containers.resources |
node-exporter container resource requests and limits. |
map |
{} |
|
node_exporter.daemonset.hostNetwork |
Host networking requested for this pod. |
boolean |
false |
|
node_exporter.daemonset.podAnnotations |
The node-exporter deployments pod annotations. |
map |
{} |
|
node_exporter.daemonset.podLabels |
The node-exporter deployments pod labels. |
map |
{} |
|
node_exporter.service.type |
Type of service to expose node-exporter |
Enum["ClusterIP"] |
ClusterIP |
Immutable |
node_exporter.service.port |
node-exporter service port. |
Integer |
9100 |
Immutable |
node_exporter.service.targetPort |
node-exporter service target port. |
Integer |
9100 |
Immutable |
node_exporter.service.labels |
node-exporter service labels. |
map |
{} |
|
node_exporter.service.annotations |
node-exporter service annotations. |
map |
{} |
|
pushgateway.deployment.replicas |
Number of pushgateway replicas. |
Integer |
1 |
|
pushgateway.deployment.containers.resources |
pushgateway container resource requests and limits. |
map |
{} |
|
pushgateway.deployment.podAnnotations |
The pushgateway deployments pod annotations. |
map |
{} |
|
pushgateway.deployment.podLabels |
The pushgateway deployments pod labels. |
map |
{} |
|
pushgateway.service.type |
Type of service to expose pushgateway |
Enum["ClusterIP"] |
ClusterIP |
Immutable |
pushgateway.service.port |
pushgateway service port. |
Integer |
9091 |
Immutable |
pushgateway.service.targetPort |
pushgateway service target port. |
Integer |
9091 |
Immutable |
pushgateway.service.labels |
pushgateway service labels. |
map |
{} |
|
pushgateway.service.annotations |
pushgateway service annotations. |
map |
{} |
|
cadvisor.daemonset.replicas |
Number of cadvisor replicas. |
Integer |
1 |
|
cadvisor.daemonset.containers.resources |
cadvisor container resource requests and limits. |
map |
{} |
|
cadvisor.daemonset.podAnnotations |
The cadvisor deployments pod annotations. |
map |
{} |
|
cadvisor.daemonset.podLabels |
The cadvisor deployments pod labels. |
map |
{} |
|
ingress.enabled |
Enable/disable ingress for prometheus and alertmanager. |
boolean |
false |
Immutable, depends on cert-manager addon and contour ingress controller |
ingress.virtual_host_fqdn |
Hostname for accessing promethues and alertmanager. |
string |
prometheus.system.tanzu |
Immutable |
ingress.prometheus_prefix |
Path prefix for prometheus. |
string |
/ |
Immutable |
ingress.alertmanager_prefix |
Path prefix for alertmanager. |
string |
/alertmanager/ |
Immutable |
ingress.prometheusServicePort |
Prometheus service port to proxy traffic to. |
Integer |
80 |
Immutable |
ingress.alertmanagerServicePort |
Alertmanager service port to proxy traffic to. |
Integer |
80 |
Immutable |
ingress.tlsCertificate.tls.crt |
Optional certificate for ingress if you want to use your own TLS certificate. A self signed certificate is generated by default. |
string |
Generated cert |
tls.crt is a key and not nested. |
ingress.tlsCertificate.tls.key |
Optional certificate private key for ingress if you want to use your own TLS certificate. |
string |
Generated cert key |
tls.key is a key and not nested. |
Ingress.tlsCertificate.ca.crt |
Optional CA certificate. |
string |
CA certificate |
ca.crt is a key and not nested. |
A sample prometheus addon CR is:
metadata: name: prometheus spec: clusterRef: name: wc0 namespace: wc0 name: prometheus namespace: wc0 config: stringData: values.yaml: | prometheus: deployment: replicas: 1 containers: args: - --storage.tsdb.retention.time=5d - --config.file=/etc/config/prometheus.yml - --storage.tsdb.path=/data - --web.console.libraries=/etc/prometheus/console_libraries2 - --web.console.templates=/etc/prometheus/consoles - --web.enable-lifecycle service: type: NodePort port: 80 targetPort: 9090 pvc: accessMode: ReadWriteOnce storage: 150Gi config: prometheus_yml: | global: evaluation_interval: 1m scrape_interval: 1m scrape_timeout: 10s rule_files: - /etc/config/alerting_rules.yml - /etc/config/recording_rules.yml - /etc/config/alerts - /etc/config/rules scrape_configs: - job_name: 'prometheus' scrape_interval: 5s static_configs: - targets: ['localhost:9090'] - job_name: 'kube-state-metrics' static_configs: - targets: ['prometheus-kube-state-metrics.tanzu-system-monitoring.svc.cluster.local:8080'] - job_name: 'node-exporter' static_configs: - targets: ['prometheus-node-exporter.tanzu-system-monitoring.svc.cluster.local:9100'] - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node - job_name: kubernetes-nodes-cadvisor kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - replacement: kubernetes.default.svc:443 target_label: __address__ - regex: (.+) replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor source_labels: - __meta_kubernetes_node_name target_label: __metrics_path__ scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-apiservers kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: default;kubernetes;https source_labels: - __meta_kubernetes_namespace - __meta_kubernetes_service_name - __meta_kubernetes_endpoint_port_name scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token alerting: alertmanagers: - scheme: http static_configs: - targets: - alertmanager.tanzu-system-monitoring.svc:80 - kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_namespace] regex: default action: keep - source_labels: [__meta_kubernetes_pod_label_app] regex: prometheus action: keep - source_labels: [__meta_kubernetes_pod_label_component] regex: alertmanager action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe] regex: .* action: keep - source_labels: [__meta_kubernetes_pod_container_port_number] regex: action: drop alerting_rules_yml: | {} recording_rules_yml: | groups: - name: vmw-telco-namespace-cpu-rules interval: 1m rules: - record: tkg_namespace_cpu_usage_seconds expr: sum by (namespace) (rate (container_cpu_usage_seconds_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_cpu_throttled_seconds expr: sum by (namespace) (((rate(container_cpu_cfs_throttled_seconds_total[5m])) ) > 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_cpu_request_core expr: sum by (namespace) (kube_pod_container_resource_requests_cpu_cores) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_cpu_limits_core expr: sum by (namespace) (kube_pod_container_resource_limits_cpu_cores > 0.0 or kube_pod_info < bool 0.1) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-namespace-mem-rules interval: 1m rules: - record: tkg_namespace_mem_usage_mb expr: sum by (namespace) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_mem_rss_mb expr: sum by (namespace) (container_memory_rss{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_mem_workingset_mb expr: sum by (namespace) (container_memory_working_set_bytes{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_mem_request_mb expr: sum by (namespace) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_mem_limit_mb expr: sum by (namespace) ((kube_pod_container_resource_limits_memory_bytes / (1024*1024) )> 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-namespace-network-rules interval: 1m rules: - record: tkg_namespace_network_tx_bytes expr: sum by (namespace) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_rx_bytes expr: sum by (namespace) (rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_tx_packets expr: sum by (namespace) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_rx_packets expr: sum by (namespace) (rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_tx_drop_packets expr: sum by (namespace) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_rx_drop_packets expr: sum by (namespace) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_tx_errors expr: sum by (namespace) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_rx_errors expr: sum by (namespace) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_total_bytes expr: sum by (namespace) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_total_packets expr: sum by (namespace) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_total_drop_packets expr: sum by (namespace) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_network_total_errors expr: sum by (namespace) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-namespace-storage-rules interval: 1m rules: - record: tkg_namespace_storage_pvc_bound expr: sum by (namespace) ((kube_persistentvolumeclaim_status_phase{phase="Bound"}) > 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_storage_pvc_count expr: sum by (namespace) ((kube_pod_spec_volumes_persistentvolumeclaims_info)> 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-namespace-other-rules interval: 1m rules: - record: tkg_namespace_pods_qty_count expr: sum by (namespace) (kube_pod_info) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_pods_reboot_5m_count expr: sum by (namespace) (changes(kube_pod_status_ready{condition="true"}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_namespace_pods_broken_count expr: sum by (namespace) (kube_pod_status_ready{condition="false"}) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-pod-cpu-rules interval: 1m rules: - record: tkg_pod_cpu_usage_seconds expr: sum by (pod) (rate (container_cpu_usage_seconds_total{container!~"POD",pod!="",image!=""}[5m])) * 100 labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_cpu_request_core expr: sum by (pod) (kube_pod_container_resource_requests_cpu_cores) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_cpu_limit_core expr: sum by (pod) (kube_pod_container_resource_limits_cpu_cores > 0.0 or kube_pod_info < bool 0.1) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_cpu_throttled_seconds expr: sum by (pod) (((rate(container_cpu_cfs_throttled_seconds_total[5m])) ) > 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-pod-mem-rules interval: 1m rules: - record: tkg_pod_mem_usage_mb expr: sum by (pod) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_mem_rss_mb expr: sum by (pod) (container_memory_rss{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_mem_workingset_mb expr: sum by (pod) (container_memory_working_set_bytes{container!~"POD",container!=""}) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_mem_request_mb expr: sum by (pod) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_mem_limit_mb expr: sum by (pod) ((kube_pod_container_resource_limits_memory_bytes / (1024*1024) )> 0 or kube_pod_info < bool 0) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-pod-network-rules interval: 1m rules: - record: tkg_pod_network_tx_bytes expr: sum by (pod) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_rx_bytes expr: sum by (pod) (rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_tx_packets expr: sum by (pod) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_rx_packets expr: sum by (pod) (rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_tx_dropped_packets expr: sum by (pod) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_rx_dropped_packets expr: sum by (pod) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_tx_errors expr: sum by (pod) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_rx_errors expr: sum by (pod) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_total_bytes expr: sum by (pod) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_total_packets expr: sum by (pod) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_total_drop_packets expr: sum by (pod) (rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_network_total_errors expr: sum by (pod) (rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-pod-other-rules interval: 1m rules: - record: tkg_pod_health_container_restarts_1hr_count expr: sum by (pod) (increase(kube_pod_container_status_restarts_total[1h])) labels: job: kubernetes-nodes-cadvisor - record: tkg_pod_health_unhealthy_count expr: min_over_time(sum by (pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) labels: job: kubernetes-nodes-cadvisor - name: vmw-telco-node-cpu-rules interval: 1m rules: - record: tkg_node_cpu_capacity_core expr: sum by (node) (kube_node_status_capacity_cpu_cores) labels: job: kubernetes-service-endpoints - record: tkg_node_cpu_allocate_core expr: sum by (node) (kube_node_status_allocatable_cpu_cores) labels: job: kubernetes-service-endpoints - record: tkg_node_cpu_usage_seconds expr: (label_replace(sum by (instance) (rate(container_cpu_usage_seconds_total[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_cpu_throttled_seconds expr: sum by (instance) (rate(container_cpu_cfs_throttled_seconds_total[5m])) labels: job: kubernetes-service-endpoints - record: tkg_node_cpu_request_core expr: sum by (node) (kube_pod_container_resource_requests_cpu_cores) labels: job: kubernetes-service-endpoints - record: tkg_node_cpu_limits_core expr: sum by (node) (kube_pod_container_resource_limits_cpu_cores) labels: job: kubernetes-service-endpoints - name: vmw-telco-node-mem-rules interval: 1m rules: - record: tkg_node_mem_capacity_mb expr: sum by (node) (kube_node_status_capacity_memory_bytes / (1024*1024)) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_allocate_mb expr: sum by (node) (kube_node_status_allocatable_memory_bytes / (1024*1024)) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_request_mb expr: sum by (node) (kube_pod_container_resource_requests_memory_bytes) / (1024*1024) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_limits_mb expr: sum by (node) (kube_pod_container_resource_limits_memory_bytes) / (1024*1024) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_available_mb expr: sum by (node) ((node_memory_MemAvailable_bytes / (1024*1024) )) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_free_mb expr: sum by (node) ((node_memory_MemFree_bytes / (1024*1024) )) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_usage_mb expr: (label_replace(sum by (instance) (container_memory_usage_bytes{container!~"POD",container!=""}) / (1024*1024), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_mem_free_pc expr: sum ((node_memory_MemFree_bytes{job="kubernetes-pods"} / node_memory_MemTotal_bytes) *100) by (node) labels: job: kubernetes-service-endpoints - record: tkg_node_oom_kill expr: sum by(node) (node_vmstat_oom_kill) labels: job: kubernetes-service-endpoints - name: vmw-telco-node-network-rules interval: 1m rules: - record: tkg_node_network_tx_bytes expr: (label_replace(sum by (instance) (rate(container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_rx_bytes expr: (label_replace(sum by (instance) (rate(container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_tx_packets expr: (label_replace(sum by (instance) (rate(container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_rx_packets expr: (label_replace(sum by (instance) (rate(container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_tx_dropped_packets expr: (label_replace(sum by (instance) (rate(container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_rx_dropped_packets expr: (label_replace(sum by (instance) (rate(container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_tx_errors expr: (label_replace(sum by (instance) (rate(container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_rx_errors expr: (label_replace(sum by (instance) (rate(container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m])), "node", "$1", "instance", "(.*)")) labels: job: kubernetes-service-endpoints - record: tkg_node_network_total_bytes expr: label_replace((sum by (instance) (rate (container_network_transmit_bytes_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_bytes_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)") labels: job: kubernetes-service-endpoints - record: tkg_node_network_total_packets expr: label_replace((sum by (instance) (rate (container_network_transmit_packets_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)") labels: job: kubernetes-service-endpoints - record: tkg_node_network_total_drop_packets expr: label_replace((sum by (instance) (rate (container_network_transmit_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_packets_dropped_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)") labels: job: kubernetes-service-endpoints - record: tkg_node_network_total_errors expr: label_replace((sum by (instance) (rate (container_network_transmit_errors_total{container!~"POD",pod!="",image!=""}[5m]) + rate (container_network_receive_errors_total{container!~"POD",pod!="",image!=""}[5m]))), "node", "$1", "instance", "(.*)") labels: job: kubernetes-service-endpoints - name: vmw-telco-node-other-rules interval: 1m rules: - record: tkg_node_status_mempressure_count expr: sum by (node) (kube_node_status_condition{condition="MemoryPressure",status="true"}) labels: job: kubernetes-service-endpoints - record: tkg_node_status_diskpressure_count expr: sum by (node) (kube_node_status_condition{condition="DiskPressure",status="true"}) labels: job: kubernetes-service-endpoints - record: tkg_node_status_pidpressure_count expr: sum by (node) (kube_node_status_condition{condition="PIDPressure",status="true"}) labels: job: kubernetes-service-endpoints - record: tkg_node_status_networkunavailable_count expr: sum by (node) (kube_node_status_condition{condition="NetworkUnavailable",status="true"}) labels: job: kubernetes-service-endpoints - record: tkg_node_status_etcdb_bytes expr: (label_replace(etcd_db_total_size_in_bytes, "instance", "$1", "instance", "(.+):(\\d+)")) * on (instance) group_left (node) (avg by (instance, node) (label_replace ((kube_pod_info), "instance", "$1", "host_ip", "(.*)")) ) labels: job: kubernetes-service-endpoints - record: tkg_node_status_apiserver_request_total expr: sum((label_replace(apiserver_request_total, "instance", "$1", "instance", "(.+):(\\d+)")) * on (instance) group_left (node) (avg by (instance, node) (label_replace ((kube_pod_info), "instance", "$1", "host_ip", "(.*)")) )) by (node) labels: job: kubernetes-service-endpoints ingress: enabled: false virtual_host_fqdn: prometheus.system.tanzu prometheus_prefix: / alertmanager_prefix: /alertmanager/ prometheusServicePort: 80 alertmanagerServicePort: 80 alertmanager: deployment: replicas: 1 service: type: ClusterIP port: 80 targetPort: 9093 pvc: accessMode: ReadWriteOnce storage: 2Gi config: alertmanager_yml: | global: {} receivers: - name: default-receiver templates: - '/etc/alertmanager/templates/*.tmpl' route: group_interval: 5m group_wait: 10s receiver: default-receiver repeat_interval: 3h kube_state_metrics: deployment: replicas: 1 service: type: ClusterIP port: 80 targetPort: 8080 telemetryPort: 81 telemetryTargetPort: 8081 node_exporter: daemonset: hostNetwork: false updatestrategy: RollingUpdate service: type: ClusterIP port: 9100 targetPort: 9100 pushgateway: deployment: replicas: 1 service: type: ClusterIP port: 9091 targetPort: 9091 cadvisor: daemonset: updatestrategy: RollingUpdate
In this sample CR:
The TSDB retention time in parameter
prometheus.deployment.containers.args
is changed to 5 days instead of default 42 days.Some recording rules are added to
prometheus.config.recording_rules_yml
. Customize them or add more as needed.The
prometheus.service.type
is changed to NodePort so that it can be integrated with external components(e.g. vROPS or Grafana). See Prometheus service type.
Prometheus service type
By default, Prometheus is deployed with a service type of ClusterIP, this means it is NOT exposable to the outside world.
There are three options available for the prometheus.service.type
:
ClusterIP – use the default configuration then prometheus service only can be accessed in workload cluster. The service can also be exposed via ingress however this depends on ingress controller and some other munual configuraitons.
NodePort(recommended) – exposes the prometheus service on a nodeport. TCA does not support to specify the actual Nodeport, K8s will allocate a random nodeport number(a high-range port number between 30,000 and 32,767). To determine what this nodeport number is (post configuration), user must view the service configuration from the TCA cluster with command
kubectl get svc -n tanzu-system-monitoring prometheus-server
, as can be seen in the following output, the prometheus-server is exposed on node port 32020, then other external components can integrate with Prometheus with URL http://<cluster-endpoint-ip>:32020capv@cp0-control-plane-kz5k6 [ ~ ]$ kubectl get svc -n tanzu-system-monitoring prometheus-server NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-server NodePort 100.65.8.127 <node> 80:32020/TCP 25s
Loadbalancer - leverages load balancer provider on Kubernetes to expose service. VMware recommends Avi load balancer which is deployed by load-balancer-and-ingress-service addon. Other load balancer provider can be used, but it will not be supported by VMware. TCA does not support to specify static VIP for prometheus service, a VIP from default VIP pool will be allocated for prometheus service, then other external components can integrate with Prometheus with URL http://<prometheus-VIP>.