本主題提供 Prometheus 套件的參考資訊。
關於 Prometheus 和 Altertmanager
Prometheus (https://prometheus.io/) 是一個系統和服務監控系統。Prometheus 以指定的時間間隔從設定的目標收集度量,評估規則運算式並顯示結果。Alertmanager 用於在觀察到滿足某些條件時觸發警示。
安裝 Prometheus 套件:
Prometheus 元件
Prometheus 套件將在 TKG 叢集上安裝下表中列出的容器。此套件將從套件存放庫中指定的 VMware 公開登錄中提取容器。
容器 | 資源類型 | 複本 | 說明 |
---|---|---|---|
prometheus-alertmanager |
部署 | 1 | 處理由用戶端應用程式 (例如 Prometheus 伺服器) 傳送的警示。 |
prometheus-cadvisor |
DaemonSet | 5 | 分析並公開執行中容器的資源使用率和效能資料 |
prometheus-kube-state-metrics |
部署 | 1 | 監控節點狀態和容量、複本集合規性、網繭、工作和 cronjob 狀態、資源請求和限制。 |
prometheus-node-exporter |
DaemonSet | 5 | 由核心公開之硬體和作業系統度量的匯出工具。 |
prometheus-pushgateway |
部署 | 1 | 可讓您從無法抓取的工作推送度量的服務。 |
prometheus-server |
部署 | 1 | 提供核心功能,包括抓取、規則處理和警示。 |
Prometheus 資料值
下面是一個範例 prometheus-data-values.yaml
檔案。
請注意下列事項:
- 已啟用入口 (ingress: enabled: true)。
- 針對以 /alertmanager/ (alertmanager_prefix:) 和 / (prometheus_prefix:) 結尾的 URL 設定了入口。
- Prometheus FQDN 為
prometheus.system.tanzu
(virtual_host_fqdn:)。 - 在入口一節中提供您自己的自訂憑證 (tls.crt、tls.key、ca.crt)。
- alertmanager 的 pvc 為 2GiB。對於
storageClassName
,提供預設儲存區原則。 - prometheus 的 pvc 為 20 GiB。對於
storageClassName
,提供 vSphere 儲存區原則。
namespace: prometheus-monitoring alertmanager: config: alertmanager_yml: | global: {} receivers: - name: default-receiver templates: - '/etc/alertmanager/templates/*.tmpl' route: group_interval: 5m group_wait: 10s receiver: default-receiver repeat_interval: 3h deployment: replicas: 1 rollingUpdate: maxSurge: 25% maxUnavailable: 25% updateStrategy: Recreate pvc: accessMode: ReadWriteOnce storage: 2Gi storageClassName: default service: port: 80 targetPort: 9093 type: ClusterIP ingress: alertmanager_prefix: /alertmanager/ alertmanagerServicePort: 80 enabled: true prometheus_prefix: / prometheusServicePort: 80 tlsCertificate: ca.crt: | -----BEGIN CERTIFICATE----- MIIFczCCA1ugAwIBAgIQTYJITQ3SZ4BBS9UzXfJIuTANBgkqhkiG9w0BAQsFADBM ... w0oGuTTBfxSMKs767N3G1q5tz0mwFpIqIQtXUSmaJ+9p7IkpWcThLnyYYo1IpWm/ ZHtjzZMQVA== -----END CERTIFICATE----- tls.crt: | -----BEGIN CERTIFICATE----- MIIHxTCCBa2gAwIBAgITIgAAAAQnSpH7QfxTKAAAAAAABDANBgkqhkiG9w0BAQsF ... YYsIjp7/f+Pk1DjzWx8JIAbzItKLucDreAmmDXqk+DrBP9LYqtmjB0n7nSErgK8G sA3kGCJdOkI0kgF10gsinaouG2jVlwNOsw== -----END CERTIFICATE----- tls.key: | -----BEGIN PRIVATE KEY----- MIIJRAIBADANBgkqhkiG9w0BAQEFAASCCS4wggkqAgEAAoICAQDOGHT8I12KyQGS ... l1NzswracGQIzo03zk/X3Z6P2YOea4BkZ0Iwh34wOHJnTkfEeSx6y+oSFMcFRthT yfFCZUk/sVCc/C1a4VigczXftUGiRrTR -----END PRIVATE KEY----- virtual_host_fqdn: prometheus.system.tanzu kube_state_metrics: deployment: replicas: 1 service: port: 80 targetPort: 8080 telemetryPort: 81 telemetryTargetPort: 8081 type: ClusterIP node_exporter: daemonset: hostNetwork: false updatestrategy: RollingUpdate service: port: 9100 targetPort: 9100 type: ClusterIP prometheus: pspNames: "vmware-system-restricted" config: alerting_rules_yml: | {} alerts_yml: | {} prometheus_yml: | global: evaluation_interval: 1m scrape_interval: 1m scrape_timeout: 10s rule_files: - /etc/config/alerting_rules.yml - /etc/config/recording_rules.yml - /etc/config/alerts - /etc/config/rules scrape_configs: - job_name: 'prometheus' scrape_interval: 5s static_configs: - targets: ['localhost:9090'] - job_name: 'kube-state-metrics' static_configs: - targets: ['prometheus-kube-state-metrics.prometheus.svc.cluster.local:8080'] - job_name: 'node-exporter' static_configs: - targets: ['prometheus-node-exporter.prometheus.svc.cluster.local:9100'] - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: kubernetes-nodes-cadvisor kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - replacement: kubernetes.default.svc:443 target_label: __address__ - regex: (.+) replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor source_labels: - __meta_kubernetes_node_name target_label: __metrics_path__ scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-apiservers kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: default;kubernetes;https source_labels: - __meta_kubernetes_namespace - __meta_kubernetes_service_name - __meta_kubernetes_endpoint_port_name scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token alerting: alertmanagers: - scheme: http static_configs: - targets: - alertmanager.prometheus.svc:80 - kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_namespace] regex: default action: keep - source_labels: [__meta_kubernetes_pod_label_app] regex: prometheus action: keep - source_labels: [__meta_kubernetes_pod_label_component] regex: alertmanager action: keep - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe] regex: .* action: keep - source_labels: [__meta_kubernetes_pod_container_port_number] regex: action: drop recording_rules_yml: | groups: - name: kube-apiserver.rules interval: 3m rules: - expr: |2 ( ( sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[1d])) - ( ( sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[1d])) or vector(0) ) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[1d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[1d])) ) ) + # errors sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[1d])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[1d])) labels: verb: read record: apiserver_request:burnrate1d - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[1h])) - ( ( sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[1h])) or vector(0) ) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[1h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[1h])) ) ) + # errors sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[1h])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[1h])) labels: verb: read record: apiserver_request:burnrate1h - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[2h])) - ( ( sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[2h])) or vector(0) ) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[2h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[2h])) ) ) + # errors sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[2h])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[2h])) labels: verb: read record: apiserver_request:burnrate2h - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[30m])) - ( ( sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30m])) or vector(0) ) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[30m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[30m])) ) ) + # errors sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[30m])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[30m])) labels: verb: read record: apiserver_request:burnrate30m - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[3d])) - ( ( sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[3d])) or vector(0) ) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[3d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[3d])) ) ) + # errors sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[3d])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[3d])) labels: verb: read record: apiserver_request:burnrate3d - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[5m])) - ( ( sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[5m])) or vector(0) ) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[5m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[5m])) ) ) + # errors sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[5m])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[5m])) labels: verb: read record: apiserver_request:burnrate5m - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[6h])) - ( ( sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[6h])) or vector(0) ) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[6h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[6h])) ) ) + # errors sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET",code=~"5.."}[6h])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[6h])) labels: verb: read record: apiserver_request:burnrate6h - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[1d])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[1d])) ) + sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[1d])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[1d])) labels: verb: write record: apiserver_request:burnrate1d - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[1h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[1h])) ) + sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[1h])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[1h])) labels: verb: write record: apiserver_request:burnrate1h - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[2h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[2h])) ) + sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[2h])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[2h])) labels: verb: write record: apiserver_request:burnrate2h - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[30m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[30m])) ) + sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[30m])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[30m])) labels: verb: write record: apiserver_request:burnrate30m - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[3d])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[3d])) ) + sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[3d])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[3d])) labels: verb: write record: apiserver_request:burnrate3d - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[5m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[5m])) ) + sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[5m])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[5m])) labels: verb: write record: apiserver_request:burnrate5m - expr: |2 ( ( # too slow sum(rate(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[6h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",le="1"}[6h])) ) + sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[6h])) ) / sum(rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[6h])) labels: verb: write record: apiserver_request:burnrate6h - expr: | sum by (code,resource) (rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"LIST|GET"}[5m])) labels: verb: read record: code_resource:apiserver_request_total:rate5m - expr: | sum by (code,resource) (rate(apiserver_request_total{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[5m])) labels: verb: write record: code_resource:apiserver_request_total:rate5m - expr: | histogram_quantile(0.99, sum by (le, resource) (rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET"}[5m]))) > 0 labels: quantile: "0.99" verb: read record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.99, sum by (le, resource) (rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) > 0 labels: quantile: "0.99" verb: write record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - expr: |2 sum(rate(apiserver_request_duration_seconds_sum{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod) / sum(rate(apiserver_request_duration_seconds_count{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod) record: cluster:apiserver_request_duration_seconds:mean5m - expr: | histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod)) labels: quantile: "0.99" record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.9, sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod)) labels: quantile: "0.9" record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - expr: | histogram_quantile(0.5, sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) without(instance, pod)) labels: quantile: "0.5" record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile - interval: 3m name: kube-apiserver-availability.rules rules: - expr: |2 1 - ( ( # write too slow sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d])) ) + ( # read too slow sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d])) - ( ( sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) or vector(0) ) + sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{verb=~"LIST|GET",scope="cluster",le="5"}[30d])) ) ) + # errors sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0)) ) / sum(code:apiserver_request_total:increase30d) labels: verb: all record: apiserver_request:availability30d - expr: |2 1 - ( sum(increase(apiserver_request_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET"}[30d])) - ( # too slow ( sum(increase(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope=~"resource|",le="0.1"}[30d])) or vector(0) ) + sum(increase(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="namespace",le="0.5"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers",verb=~"LIST|GET",scope="cluster",le="5"}[30d])) ) + # errors sum(code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0)) ) / sum(code:apiserver_request_total:increase30d{verb="read"}) labels: verb: read record: apiserver_request:availability30d - expr: |2 1 - ( ( # too slow sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{verb=~"POST|PUT|PATCH|DELETE",le="1"}[30d])) ) + # errors sum(code:apiserver_request_total:increase30d{verb="write",code=~"5.."} or vector(0)) ) / sum(code:apiserver_request_total:increase30d{verb="write"}) labels: verb: write record: apiserver_request:availability30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="LIST",code=~"2.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="GET",code=~"2.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="POST",code=~"2.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PUT",code=~"2.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PATCH",code=~"2.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="DELETE",code=~"2.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="LIST",code=~"3.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="GET",code=~"3.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="POST",code=~"3.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PUT",code=~"3.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PATCH",code=~"3.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="DELETE",code=~"3.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="LIST",code=~"4.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="GET",code=~"4.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="POST",code=~"4.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PUT",code=~"4.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PATCH",code=~"4.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="DELETE",code=~"4.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="LIST",code=~"5.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="GET",code=~"5.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="POST",code=~"5.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PUT",code=~"5.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="PATCH",code=~"5.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code, verb) (increase(apiserver_request_total{job="kubernetes-apiservers",verb="DELETE",code=~"5.."}[30d])) record: code_verb:apiserver_request_total:increase30d - expr: | sum by (code) (code_verb:apiserver_request_total:increase30d{verb=~"LIST|GET"}) labels: verb: read record: code:apiserver_request_total:increase30d - expr: | sum by (code) (code_verb:apiserver_request_total:increase30d{verb=~"POST|PUT|PATCH|DELETE"}) labels: verb: write record: code:apiserver_request_total:increase30d rules_yml: | {} deployment: configmapReload: containers: args: - --volume-dir=/etc/config - --webhook-url=http://127.0.0.1:9090/-/reload containers: args: - --storage.tsdb.retention.time=42d - --config.file=/etc/config/prometheus.yml - --storage.tsdb.path=/data - --web.console.libraries=/etc/prometheus/console_libraries - --web.console.templates=/etc/prometheus/consoles - --web.enable-lifecycle replicas: 1 rollingUpdate: maxSurge: 25% maxUnavailable: 25% updateStrategy: Recreate pvc: accessMode: ReadWriteOnce storage: 20Gi storageClassName: default service: port: 80 targetPort: 9090 type: ClusterIP pushgateway: deployment: replicas: 1 service: port: 9091 targetPort: 9091 type: ClusterIP
Prometheus 組態
Prometheus 組態在
prometheus-data-values.yaml
檔案中設定。下表列出並說明了可用參數。
參數 | 說明 | 類型 | 預設值 |
---|---|---|---|
monitoring.namespace | 將部署 Prometheus 的命名空間 | string | tanzu-system-monitoring |
monitoring.create_namespace | 旗標指示是否建立 monitoring.namespace 指定的命名空間 | 布林值 | false |
monitoring.prometheus_server.config.prometheus_yaml | 要傳遞到 Prometheus 的 Kubernetes 叢集監控組態詳細資料 | Yaml 檔案 | prometheus.yaml |
monitoring.prometheus_server.config.alerting_rules_yaml | Prometheus 中定義的詳細警示規則 | Yaml 檔案 | alerting_rules.yaml |
monitoring.prometheus_server.config.recording_rules_yaml | Prometheus 中定義的詳細記錄規則 | Yaml 檔案 | recording_rules.yaml |
monitoring.prometheus_server.service.type | 用於公開 Prometheus 的服務類型。支援的值:ClusterIP | string | ClusterIP |
monitoring.prometheus_server.enable_alerts.kubernetes_api | 在 Prometheus 中為 Kubernetes API 啟用 SLO 警示 | 布林值 | true |
monitoring.prometheus_server.sc.aws_type | 針對 AWS 上的 storageclass 定義的 AWS 類型 | string | gp2 |
monitoring.prometheus_server.sc.aws_fsType | 針對 AWS 上的 storageclass 定義的 AWS 檔案系統類型 | string | ext4 |
monitoring.prometheus_server.sc.allowVolumeExpansion | 定義是否允許針對 AWS 上的 storageclass 進行磁碟區擴充 | 布林值 | true |
monitoring.prometheus_server.pvc.annotations | 儲存區類別註解 | 對應 | {} |
monitoring.prometheus_server.pvc.storage_class | 用於持續性磁碟區宣告的儲存區類別。依預設,此為空值,且會使用預設佈建程式 | string | 空值 |
monitoring.prometheus_server.pvc.accessMode | 定義持續性磁碟區宣告的存取模式。支援的值:ReadWriteOnce、ReadOnlyMany、ReadWriteMany | string | ReadWriteOnce |
monitoring.prometheus_server.pvc.storage | 定義持續性磁碟區宣告的儲存區大小。 | string | 8Gi |
monitoring.prometheus_server.deployment.replicas | prometheus 複本數 | 整數 | 1 |
monitoring.prometheus_server.image.repository | 具有 Prometheus 映像的存放庫的位置。預設為公用 VMware 登錄。如果您要使用私人存放庫 (例如氣隙環境),請變更此值。 | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.prometheus_server.image.name | Prometheus 映像的名稱 | string | prometheus |
monitoring.prometheus_server.image.tag | Prometheus 映像標籤。如果您要升級版本,則可能需要更新此值。 | string | v2.17.1_vmware.1 |
monitoring.prometheus_server.image.pullPolicy | Prometheus 映像提取原則 | string | IfNotPresent |
monitoring.alertmanager.config.slack_demo | Alertmanager 的 Slack 通知組態 | string | slack_demo: name: slack_demo slack_configs: - api_url: https://hooks.slack.com channel: '#alertmanager-test' |
monitoring.alertmanager.config.email_receiver | Alertmanager 的電子郵件通知組態 | string | email_receiver: name: email-receiver email_configs: - to: [email protected] send_resolved: false from: [email protected] smarthost: smtp.eample.com:25 require_tls: false |
monitoring.alertmanager.service.type | 用於公開 Alertmanager 的服務類型。支援的值:ClusterIP | string | ClusterIP |
monitoring.alertmanager.image.repository | 具有 Alertmanager 映像的存放庫的位置。預設為公用 VMware 登錄。如果您要使用私人存放庫 (例如氣隙環境),請變更此值。 | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.alertmanager.image.name | Alertmanager 映像的名稱 | string | alertmanager |
monitoring.alertmanager.image.tag | Alertmanager 映像標籤。如果您要升級版本,則可能需要更新此值。 | string | v0.20.0_vmware.1 |
monitoring.alertmanager.image.pullPolicy | Alertmanager 映像提取原則 | string | IfNotPresent |
monitoring.alertmanager.pvc.annotations | 儲存區類別註解 | 對應 | {} |
monitoring.alertmanager.pvc.storage_class | 用於持續性磁碟區宣告的儲存區類別。依預設,此為空值,且會使用預設佈建程式。 | string | 空值 |
monitoring.alertmanager.pvc.accessMode | 定義持續性磁碟區宣告的存取模式。支援的值:ReadWriteOnce、ReadOnlyMany、ReadWriteMany | string | ReadWriteOnce |
monitoring.alertmanager.pvc.storage | 定義持續性磁碟區宣告的儲存區大小。 | string | 2Gi |
monitoring.alertmanager.deployment.replicas | alertmanager 複本數 | 整數 | 1 |
monitoring.kube_state_metrics.image.repository | 包含 kube-state-metircs 映像的存放庫。預設為公用 VMware 登錄。如果您要使用私人存放庫 (例如氣隙環境),請變更此值。 | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.kube_state_metrics.image.name | kube-state-metircs 映像的名稱 | string | kube-state-metrics |
monitoring.kube_state_metrics.image.tag | kube-state-metircs 映像標籤。如果您要升級版本,則可能需要更新此值。 | string | v1.9.5_vmware.1 |
monitoring.kube_state_metrics.image.pullPolicy | kube-state-metircs 映像提取原則 | string | IfNotPresent |
monitoring.kube_state_metrics.deployment.replicas | Kube-state-metrics 複本數 | 整數 | 1 |
monitoring.node_exporter.image.repository | 包含 node-exporter 映像的存放庫。預設為公用 VMware 登錄。如果您要使用私人存放庫 (例如氣隙環境),請變更此值。 | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.node_exporter.image.name | node-exporter 映像的名稱 | string | node-exporter |
monitoring.node_exporter.image.tag | node-exporter 映像標籤。如果您要升級版本,則可能需要更新此值。 | string | v0.18.1_vmware.1 |
monitoring.node_exporter.image.pullPolicy | node-exporter 映像提取原則 | string | IfNotPresent |
monitoring.node_exporter.hostNetwork | 如果設定為 hostNetwork: true ,則網繭可以使用節點的網路命名空間和網路資源。 |
布林值 | false |
monitoring.node_exporter.deployment.replicas | Node-exporter 複本數 | 整數 | 1 |
monitoring.pushgateway.image.repository | 包含 pushgateway 映像的存放庫。預設為公用 VMware 登錄。如果您要使用私人存放庫 (例如氣隙環境),請變更此值。 | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.pushgateway.image.name | pushgateway 映像的名稱 | string | pushgateway |
monitoring.pushgateway.image.tag | pushgateway 映像標籤。如果您要升級版本,則可能需要更新此值。 | string | v1.2.0_vmware.1 |
monitoring.pushgateway.image.pullPolicy | pushgateway 映像提取原則 | string | IfNotPresent |
monitoring.pushgateway.deployment.replicas | pushgateway 複本數 | 整數 | 1 |
monitoring.cadvisor.image.repository | 包含 cadvisor 映像的存放庫。預設為公用 VMware 登錄。如果您要使用私人存放庫 (例如氣隙環境),請變更此值。 | string | projects.registry.vmware.com/tkg/prometheus |
monitoring.cadvisor.image.name | cadvisor 映像的名稱 | string | cadvisor |
monitoring.cadvisor.image.tag | cadvisor 映像標籤。如果您要升級版本,則可能需要更新此值。 | string | v0.36.0_vmware.1 |
monitoring.cadvisor.image.pullPolicy | cadvisor 映像提取原則 | string | IfNotPresent |
monitoring.cadvisor.deployment.replicas | cadvisor 複本數 | 整數 | 1 |
monitoring.ingress.enabled | 啟用/停用 prometheus 和 alertmanager 的入口 | 布林值 | false 若要使用入口,請將此欄位設定為 |
monitoring.ingress.virtual_host_fqdn | 用於存取 Prometheus 和 Alertmanager 的主機名稱 | string | prometheus.system.tanzu |
monitoring.ingress.prometheus_prefix | prometheus 的路徑前置詞 | string | / |
monitoring.ingress.alertmanager_prefix | alertmanager 的路徑前置詞 | string | /alertmanager/ |
monitoring.ingress.tlsCertificate.tls.crt | 如果您想要使用自己的 TLS 憑證,請為入口提供可選憑證。依預設會產生自我簽署憑證 | string | 產生的憑證 |
monitoring.ingress.tlsCertificate.tls.key | 如果您想要使用自己的 TLS 憑證,請為入口提供可選憑證私密金鑰。 | string | 產生的憑證金鑰 |
參數 | 說明 | 類型 | 預設值 |
---|---|---|---|
evaluation_interval | 評估規則的頻率 | 持續時間 | 1m |
scrape_interval | 抓取目標的頻率 | 持續時間 | 1m |
scrape_timeout | 經過多長時間後抓取要求逾時 | 持續時間 | 10s |
rule_files | 規則檔案可指定 globs 的清單。從所有符合的檔案讀取規則和警示 | Yaml 檔案 | |
scrape_configs | 抓取組態的清單。 | 清單 | |
job_name | 依預設指派給已抓取度量的工作名稱 | string | |
kubernetes_sd_configs | Kubernetes 服務探索組態的清單。 | 清單 | |
relabel_configs | 目標重新指派標籤組態的清單。 | 清單 | |
動作 | 根據 Regex 比對要執行的動作。 | string | |
Regex | 比對擷取值所依據的規則運算式。 | string | |
source_labels | 來源標籤會從現有標籤中選取值。 | string | |
配置 | 設定用於要求的通訊協定配置。 | string | |
tls_config | 設定抓取要求的 TLS 設定。 | string | |
ca_file | 用於驗證 API 伺服器憑證的 CA 憑證。 | filename | |
insecure_skip_verify | 停用伺服器憑證的驗證。 | 布林值 | |
bearer_token_file | 可選的 Bearer Token 檔案驗證資訊。 | filename | |
取代 | 正則運算式比對時執行 Regex 取代所依據的取代值。 | string | |
target_label | 在取代動作中結果值要寫入到的標籤。 | string |
參數 | 說明 | 類型 | 預設值 |
---|---|---|---|
resolve_timeout | ResolveTimeout 是 alertmanager 使用的預設值 (如果警示不包含 EndsAt) | 持續時間 | 5m |
smtp_smarthost | 傳送電子郵件時所使用的 SMTP 主機。 | 持續時間 | 1m |
slack_api_url | Slack webhook URL。 | string | global.slack_api_url |
pagerduty_url | API 要求傳送到的 pagerduty URL。 | string | global.pagerduty_url |
範本 | 從其讀取自訂通知範本定義的檔案 | 檔案路徑 | |
group_by | 按標籤將警示分組 | string | |
group_interval | 設定在傳送新增到群組之新警示的相關通知之前要等待的時間 | 持續時間 | 5m |
group_wait | 傳送一組警示通知的初始等待時長 | 持續時間 | 30s |
repeat_interval | 如果已成功傳送警示,則再次傳送通知前要等待的時長 | 持續時間 | 4h |
接收器 | 通知接收器清單。 | 清單 | |
severity | 事件的嚴重性。 | string | |
通道 | 通知要傳送到的通道或使用者。 | string | |
html | 電子郵件通知的 HTML 主體。 | string | |
文字 | 電子郵件通知的文字內文。 | string | |
send_resolved | 是否通知已解決的警示。 | filename | |
email_configs | 電子郵件整合的組態 | 布林值 |
使用網繭上的註解可精細控制抓取程序。這些註解必須是網繭中繼資料的一部分。如果已針對其他物件 (例如,服務或 DaemonSet) 進行設定,則並不會起作用。
網繭註解 | 說明 |
---|---|
prometheus.io/scrape |
預設組態會抓取所有網繭,如果設定為 false,則此註解將會從抓取程序中排除該網繭。 |
prometheus.io/path |
如果度量路徑不是 /metrics,則使用此註解進行定義。 |
prometheus.io/port |
在指示的連接埠上抓取網繭,而非網繭的專用連接埠 (如果未進行宣告,則預設為無連接埠目標)。 |
下面的 DaemonSet 資訊清單將指示 Prometheus 在連接埠 9102 上抓取其所有網繭。
apiVersion: apps/v1beta2 # for versions before 1.8.0 use extensions/v1beta1 kind: DaemonSet metadata: name: fluentd-elasticsearch namespace: weave labels: app: fluentd-logging spec: selector: matchLabels: name: fluentd-elasticsearch template: metadata: labels: name: fluentd-elasticsearch annotations: prometheus.io/scrape: 'true' prometheus.io/port: '9102' spec: containers: - name: fluentd-elasticsearch image: gcr.io/google-containers/fluentd-elasticsearch:1.20