Prometheus监控与报警系统完整搭建指南
目录
- 1. 概述
- 2. Prometheus架构与组件
- 3. Prometheus部署
- 4. 监控目标配置
- 5. Exporter配置
- 6. AlertManager报警系统
- 7. 报警规则配置
- 8. Grafana可视化
- 9. PromQL查询语言
- 10. 企业级最佳实践
- 11. 常见问题与解决方案
1. 概述
1.1 什么是Prometheus?
Prometheus是一个开源的系统监控和报警工具,最初由SoundCloud开发,现已成为CNCF(Cloud Native Computing Foundation)的毕业项目。
1.2 核心特性
- 多维数据模型:基于时间序列,通过指标名称和标签区分
- 灵活的查询语言:PromQL支持复杂的数据查询和聚合
- 不依赖分布式存储:单机节点即可高效运行
- HTTP拉取模式:通过HTTP协议拉取监控数据
- 支持推送网关:短期作业可通过Pushgateway推送数据
- 多种可视化集成:与Grafana完美集成
- 高效报警机制:通过AlertManager处理报警
1.3 应用场景
- 基础设施监控:服务器、网络、存储等
- 应用性能监控:微服务、API、数据库等
- 业务指标监控:用户量、订单量、交易额等
- 容器监控:Docker、Kubernetes集群
- 自定义监控:业务特定的指标
1.4 完整监控架构
┌─────────────────────────────────────────────────────────────┐
│ 监控数据流向 │
└─────────────────────────────────────────────────────────────┘
应用/服务器/中间件
↓
Exporter (暴露metrics接口)
↓
Prometheus (拉取并存储数据)
↓
├─────────────┬─────────────┐
↓ ↓ ↓
Grafana AlertManager HTTP API
(可视化) (告警处理) (查询接口)
↓
├───────┼───────┐
↓ ↓ ↓
邮件 企业微信 钉钉
2. Prometheus架构与组件
2.1 核心组件
| 组件 | 作用 | 是否必须 |
|---|---|---|
| Prometheus Server | 核心服务器,抓取和存储时间序列数据 | ✅ 必须 |
| Exporters | 暴露监控指标的程序 | ✅ 必须 |
| Pushgateway | 接收短期作业推送的指标 | ❌ 可选 |
| AlertManager | 处理报警通知 | ⚠️ 推荐 |
| Grafana | 数据可视化 | ⚠️ 推荐 |
| Service Discovery | 服务发现 | ❌ 可选 |
2.2 数据模型
Prometheus采用多维度时间序列数据模型:
<metric_name>{<label_name>=<label_value>, ...} value timestamp
示例:
http_requests_total{method="GET", endpoint="/api/users", status="200"} 1234 1638360000
- metric_name: 指标名称(如 http_requests_total)
- labels: 标签键值对(如 method=“GET”)
- value: 数值
- timestamp: 时间戳
2.3 指标类型
| 类型 | 说明 | 示例 |
|---|---|---|
| Counter | 计数器,只增不减 | 请求总数、错误总数 |
| Gauge | 仪表盘,可增可减 | CPU使用率、内存使用量 |
| Histogram | 直方图,统计分布 | 请求延迟分布 |
| Summary | 摘要,统计分位数 | 请求延迟的50/90/99分位 |
3. Prometheus部署
3.1 使用Docker部署
3.1.1 快速部署单机版
创建配置文件 prometheus.yml:
# Prometheus全局配置
global:
scrape_interval: 15s # 抓取间隔,默认15秒
evaluation_interval: 15s # 规则评估间隔
scrape_timeout: 10s # 抓取超时时间
external_labels: # 外部标签,所有时间序列都会添加
cluster: 'production'
region: 'cn-hangzhou'
# AlertManager配置
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
# 加载报警规则文件
rule_files:
- '/etc/prometheus/rules/*.yml'
# 监控目标配置
scrape_configs:
# Prometheus自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'prometheus-server'
# Node Exporter监控
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter:9100'
labels:
env: 'production'
创建 docker-compose.yml:
version: '3.8'
services:
# Prometheus服务
prometheus:
image: prom/prometheus:v2.48.0
container_name: prometheus
restart: always
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rules:/etc/prometheus/rules
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d' # 数据保留30天
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle' # 启用热加载
networks:
- monitoring
# Node Exporter - 主机监控
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: always
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
# AlertManager - 报警管理
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
restart: always
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
networks:
- monitoring
# Grafana - 可视化
grafana:
image: grafana/grafana:10.2.2
container_name: grafana
restart: always
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
volumes:
- grafana-data:/var/lib/grafana
networks:
- monitoring
# Pushgateway - 接收推送数据
pushgateway:
image: prom/pushgateway:v1.6.2
container_name: pushgateway
restart: always
ports:
- "9091:9091"
networks:
- monitoring
volumes:
prometheus-data:
alertmanager-data:
grafana-data:
networks:
monitoring:
driver: bridge
启动服务:
# 创建规则目录
mkdir -p rules
# 启动所有服务
docker-compose up -d
# 查看服务状态
docker-compose ps
# 查看Prometheus日志
docker-compose logs -f prometheus
# 查看所有日志
docker-compose logs -f
访问服务:
- Prometheus: http://localhost:9090
- AlertManager: http://localhost:9093
- Grafana: http://localhost:3000 (admin/admin123)
- Node Exporter: http://localhost:9100/metrics
- Pushgateway: http://localhost:9091
3.1.2 验证部署
# 检查Prometheus目标状态
curl http://localhost:9090/api/v1/targets
# 检查配置是否生效
curl http://localhost:9090/api/v1/status/config
# 热加载配置(需要启用--web.enable-lifecycle)
curl -X POST http://localhost:9090/-/reload
3.2 使用Kubernetes部署
3.2.1 创建命名空间
kubectl create namespace monitoring
3.2.2 创建ConfigMap(Prometheus配置)
创建 prometheus-configmap.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'k8s-cluster'
environment: 'production'
alerting:
alertmanagers:
- kubernetes_sd_configs:
- role: pod
namespaces:
names:
- monitoring
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: alertmanager
action: keep
rule_files:
- '/etc/prometheus/rules/*.yml'
scrape_configs:
# Prometheus自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Kubernetes节点
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Kubernetes Pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
应用配置:
kubectl apply -f prometheus-configmap.yaml
3.2.3 创建RBAC权限
创建 prometheus-rbac.yaml:
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
应用配置:
kubectl apply -f prometheus-rbac.yaml
3.2.4 创建报警规则ConfigMap
创建 prometheus-rules.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
node-alerts.yml: |
groups:
- name: node-alerts
interval: 30s
rules:
- alert: NodeDown
expr: up{job="kubernetes-nodes"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "节点 {{ $labels.instance }} 不可用"
description: "节点已经下线超过1分钟"
- alert: NodeHighCPU
expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "节点 {{ $labels.instance }} CPU使用率过高"
description: "CPU使用率: {{ $value | humanize }}%"
- alert: NodeHighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "节点 {{ $labels.instance }} 内存使用率过高"
description: "内存使用率: {{ $value | humanize }}%"
应用配置:
kubectl apply -f prometheus-rules.yaml
3.2.5 创建PVC
创建 prometheus-pvc.yaml:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-pvc
namespace: monitoring
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: alicloud-disk-ssd # 根据实际情况修改
应用配置:
kubectl apply -f prometheus-pvc.yaml
3.2.6 创建Deployment
创建 prometheus-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.48.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
ports:
- containerPort: 9090
name: http
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: rules
mountPath: /etc/prometheus/rules
- name: data
mountPath: /prometheus
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 10
volumes:
- name: config
configMap:
name: prometheus-config
- name: rules
configMap:
name: prometheus-rules
- name: data
persistentVolumeClaim:
claimName: prometheus-pvc
应用配置:
kubectl apply -f prometheus-deployment.yaml
3.2.7 创建Service
创建 prometheus-service.yaml:
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
type: ClusterIP
ports:
- port: 9090
targetPort: 9090
protocol: TCP
name: http
selector:
app: prometheus
应用配置:
kubectl apply -f prometheus-service.yaml
3.2.8 验证部署
# 查看Pod状态
kubectl get pods -n monitoring
# 查看Service
kubectl get svc -n monitoring
# 查看日志
kubectl logs -f deployment/prometheus -n monitoring
# 端口转发(本地访问)
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# 访问 http://localhost:9090
3.3 Kubernetes核心概念详解
在部署Prometheus到Kubernetes时,我们使用了多个核心概念。本节详细解释这些概念及其工作原理。
3.3.1 什么是PVC(持久化卷声明)?
PVC = PersistentVolumeClaim(持久化卷声明)
PVC是Kubernetes中用于申请存储资源的一种API对象,是用户向集群"申请存储空间"的方式。
通俗理解
可以把PVC理解为租房申请单:
- PVC:你填写需要多大的存储空间、什么类型的存储
- Kubernetes:作为"中介",根据申请匹配合适的存储
- PV(PersistentVolume):实际的存储资源(云盘、NFS、本地磁盘等)
Kubernetes存储体系
┌─────────────────────────────────────────────────────────┐
│ Kubernetes存储体系 │
└─────────────────────────────────────────────────────────┘
1. StorageClass (存储类)
├─ 定义:存储的"类型/规格"
└─ 示例:SSD云盘、HDD云盘、NFS
2. PersistentVolume (PV) - 持久化卷
├─ 定义:实际的存储资源
├─ 由管理员创建 或 StorageClass自动创建
└─ 示例:阿里云的100GB SSD云盘
3. PersistentVolumeClaim (PVC) - 持久化卷声明
├─ 定义:用户的存储申请
├─ 由用户创建
└─ 示例:申请100GB的SSD存储
4. Pod使用PVC
└─ Pod通过PVC名称引用存储,数据持久化
PVC配置示例
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-pvc # PVC的名字
namespace: monitoring
spec:
# 访问模式
accessModes:
- ReadWriteOnce # RWO: 单节点读写
# - ReadOnlyMany # ROX: 多节点只读
# - ReadWriteMany # RWX: 多节点读写
# 存储需求
resources:
requests:
storage: 100Gi # 申请100GB空间
# 存储类型(可选)
storageClassName: alicloud-disk-ssd
PVC的生命周期
1. 创建PVC
↓
状态:Pending(等待绑定)
2. 自动绑定PV
↓
Kubernetes自动寻找或创建符合条件的PV
状态:Bound(已绑定)
3. Pod使用PVC
↓
数据写入到绑定的PV中
4. 删除Pod
↓
PVC和PV依然存在,数据不会丢失
5. 删除PVC
↓
根据回收策略处理PV和底层存储
PVC的关键特性
| 特性 | 说明 |
|---|---|
| 持久性 | Pod删除后,PVC和数据依然存在 |
| 独立性 | PVC的生命周期独立于Pod |
| 动态供应 | 可以自动创建PV(需要StorageClass支持) |
| 跨Pod共享 | 根据访问模式,可能支持多Pod访问 |
| 容量保证 | 保证至少有申请的存储空间 |
查看PVC状态
# 查看PVC列表
kubectl get pvc -n monitoring
# 输出示例
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
prometheus-pvc Bound pvc-a1b2c3d4-5678-90ab 100Gi RWO alicloud-disk-ssd 5d
# 查看PVC详情
kubectl describe pvc prometheus-pvc -n monitoring
3.3.2 ConfigMap与数据卷挂载
ConfigMap = Configuration Map(配置映射)
用于存储非敏感的配置数据(键值对或文件内容)。
ConfigMap的核心作用
1. 配置与镜像分离
不使用ConfigMap的问题:
# 配置写死在镜像里
COPY prometheus.yml /etc/prometheus/
- ❌ 修改配置需要重新构建镜像
- ❌ 不同环境需要不同镜像
- ❌ 配置变更需要重新部署
使用ConfigMap的优势:
volumes:
- name: config
configMap:
name: prometheus-config
- ✅ 配置独立管理,无需修改镜像
- ✅ 同一镜像适用多环境(dev/test/prod)
- ✅ 配置更新可以动态生效
2. Key-Value到文件的映射
这是ConfigMap最重要的概念:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
# Key(键)会变成文件名
# Value(值)会变成文件内容
prometheus.yml: | # ← Key = 文件名
global: # ← Value开始 = 文件内容
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
挂载后的效果:
# 容器内的文件系统
/etc/prometheus/
└── prometheus.yml # ← Key变成了文件名
# 文件内容就是Value
$ cat /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
3. 支持多个配置文件
一个ConfigMap可以包含多个文件:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
data:
# 第一个文件
node-alerts.yml: |
groups:
- name: node-alerts
rules:
- alert: NodeDown
expr: up == 0
# 第二个文件
app-alerts.yml: |
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: rate(errors[5m]) > 0.05
# 第三个文件
db-alerts.yml: |
groups:
- name: db-alerts
rules:
- alert: MySQLDown
expr: mysql_up == 0
挂载后的文件结构:
/etc/prometheus/rules/
├── node-alerts.yml # ← Key1 变成文件
├── app-alerts.yml # ← Key2 变成文件
└── db-alerts.yml # ← Key3 变成文件
ConfigMap映射过程图解
┌─────────────────────────────────────────────────────────┐
│ ConfigMap存储结构 │
└─────────────────────────────────────────────────────────┘
ConfigMap: prometheus-config
│
├─ data:
│ ├─ Key: "prometheus.yml"
│ │ └─ Value: "global:\n scrape_interval: 15s..."
│ │
│ ├─ Key: "rules.yml"
│ │ └─ Value: "groups:\n - name: alerts..."
│ │
│ └─ Key: "config.json"
│ └─ Value: "{\"port\": 8080}"
↓ 挂载到Pod (mountPath: /etc/prometheus)
┌─────────────────────────────────────────────────────────┐
│ Pod中的文件系统(容器内) │
└─────────────────────────────────────────────────────────┘
/etc/prometheus/
│
├─ prometheus.yml ← Key变成文件名
│ └─ 内容: global: ← Value变成文件内容
│ scrape_interval: 15s
│
├─ rules.yml
│ └─ 内容: groups:
│ - name: alerts
│
└─ config.json
└─ 内容: {"port": 8080}
3.3.3 Volume(卷)配置详解
在Kubernetes的Deployment中,我们定义了三种类型的Volume:
volumes:
# 1. ConfigMap卷 - 配置文件
- name: config
configMap:
name: prometheus-config
# 2. ConfigMap卷 - 规则文件
- name: rules
configMap:
name: prometheus-rules
# 3. PVC卷 - 持久化数据
- name: data
persistentVolumeClaim:
claimName: prometheus-pvc
Volume类型对比
| 卷名 | 类型 | 数据源 | 用途 | 持久化 | 读写权限 |
|---|---|---|---|---|---|
| config | ConfigMap | prometheus-config | 存储prometheus.yml配置 | ❌ | 只读 |
| rules | ConfigMap | prometheus-rules | 存储告警规则文件 | ❌ | 只读 |
| data | PVC | prometheus-pvc | 存储时间序列数据 | ✅ | 读写 |
完整的挂载示例
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
template:
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.48.0
# 容器内的挂载点
volumeMounts:
# 挂载配置文件
- name: config # 引用volumes中的config
mountPath: /etc/prometheus # 挂载到容器的这个目录
# 挂载规则文件
- name: rules # 引用volumes中的rules
mountPath: /etc/prometheus/rules # 挂载到容器的这个目录
# 挂载数据目录
- name: data # 引用volumes中的data
mountPath: /prometheus # 挂载到容器的这个目录
# 卷定义
volumes:
# 配置文件卷(来自ConfigMap)
- name: config
configMap:
name: prometheus-config
# 规则文件卷(来自ConfigMap)
- name: rules
configMap:
name: prometheus-rules
# 数据存储卷(来自PVC)
- name: data
persistentVolumeClaim:
claimName: prometheus-pvc
挂载后容器内的文件结构:
# 容器内文件系统
/etc/prometheus/ # ← config卷挂载点
├── prometheus.yml # ← 来自prometheus-config ConfigMap
└── rules/ # ← rules卷挂载点
├── node-alerts.yml # ← 来自prometheus-rules ConfigMap
├── app-alerts.yml
└── db-alerts.yml
/prometheus/ # ← data卷挂载点
├── chunks_head/ # ← Prometheus数据(持久化)
├── wal/
└── 01ABCD.../
3.3.4 ConfigMap修改的全局影响
核心问题:如果修改了ConfigMap,是否所有引用该ConfigMap的Pod都会被修改?
答案:是的! ConfigMap修改后,所有引用该ConfigMap的Pod都会受到影响。
自动更新机制
1. 修改ConfigMap
kubectl edit configmap prometheus-config -n monitoring
↓
2. Kubernetes检测到变化
↓
3. 约30秒-2分钟后
↓
4. 所有挂载该ConfigMap的Pod内的文件自动更新
↓
5. 应用是否生效取决于应用本身
- 有的应用会自动重新加载(如Nginx)
- 有的应用需要手动reload(如Prometheus)
- 有的应用需要重启才生效
全局影响示例
假设有3个Pod都引用同一个ConfigMap:
# ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
config.json: |
{
"log_level": "info",
"timeout": 30
}
修改ConfigMap后:
kubectl edit configmap app-config
# 将 log_level 从 "info" 改为 "debug"
结果:
- ✅ Pod1(开发环境)的配置文件会更新
- ✅ Pod2(测试环境)的配置文件会更新
- ✅ Pod3(生产环境)的配置文件也会更新
- ⚠️ 所有引用该ConfigMap的Pod都受影响!
实际操作验证
# 步骤1:查看当前配置
kubectl get configmap prometheus-config -n monitoring -o yaml
# 步骤2:进入Pod查看文件
kubectl exec -it deployment/prometheus -n monitoring -- \
cat /etc/prometheus/prometheus.yml
# 步骤3:修改ConfigMap
kubectl edit configmap prometheus-config -n monitoring
# 修改内容,例如:scrape_interval: 15s → 30s
# 步骤4:等待约1-2分钟后,再次查看Pod内文件
kubectl exec -it deployment/prometheus -n monitoring -- \
cat /etc/prometheus/prometheus.yml
# 发现文件已经自动更新!
# 步骤5:让Prometheus重新加载配置
kubectl exec -it deployment/prometheus -n monitoring -- \
curl -X POST http://localhost:9090/-/reload
不会自动更新的情况
以下方式使用ConfigMap 不会自动更新:
1. 环境变量方式
env:
- name: LOG_LEVEL
valueFrom:
configMapKeyRef:
name: app-config
key: log_level
❌ 环境变量在Pod创建时设置,之后不会改变
✅ 解决方案:重启Pod
2. subPath方式
volumeMounts:
- name: config
mountPath: /etc/app/config.json
subPath: config.json # ← 使用subPath
❌ subPath挂载不支持动态更新
✅ 解决方案:不使用subPath,挂载整个目录
避免全局影响的最佳实践
方案1:使用不同的ConfigMap
# 开发环境
configMap:
name: app-config-dev
# 测试环境
configMap:
name: app-config-test
# 生产环境
configMap:
name: app-config-prod
方案2:使用immutable ConfigMap(不可变)
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
config.json: |
{"log_level": "info"}
immutable: true # ← 标记为不可变
特点:
- ✅ 修改会报错,必须删除重建
- ✅ 避免意外修改
- ✅ 提升性能(Kubernetes不需要watch)
- ❌ 更新配置需要删除重建ConfigMap和Pod
方案3:版本化ConfigMap
# 每次更新创建新版本
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config-v1 # ← 版本1
---
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config-v2 # ← 版本2
部署时指定版本:
volumes:
- name: config
configMap:
name: app-config-v2 # 切换到新版本
3.3.5 ConfigMap vs PVC 对比总结
| 特性 | ConfigMap卷 | PVC卷 |
|---|---|---|
| 用途 | 配置文件、规则文件 | 业务数据、数据库文件 |
| 持久化 | ❌ Pod删除后重新挂载 | ✅ Pod删除后数据保留 |
| 读写权限 | 只读 | 可读写 |
| 大小限制 | 最大1MB | 根据申请大小(如100GB) |
| 更新方式 | 自动同步(1-2分钟) | 应用手动写入 |
| 适用场景 | 应用配置、环境变量 | 数据库、日志、监控数据 |
| 性能 | 内存映射,快 | 磁盘IO,相对慢 |
| 共享 | 多Pod可共享同一ConfigMap | 根据访问模式决定 |
| 版本控制 | 可以版本化管理 | 通过快照备份 |
3.3.6 最佳实践建议
ConfigMap使用建议
✅ 推荐做法:
- 开发/测试/生产环境使用不同的ConfigMap
- 生产环境使用
immutable: true标记 - 重要配置变更走完整发布流程
- 修改前备份ConfigMap
kubectl get configmap prometheus-config -o yaml > backup.yaml - 测试配置语法正确性
promtool check config prometheus.yml
❌ 避免的做法:
- 生产环境直接
kubectl edit修改ConfigMap - 多环境共用一个ConfigMap
- 不做备份就修改配置
- 修改后不验证应用是否生效
PVC使用建议
✅ 推荐做法:
- 根据数据重要性选择合适的StorageClass
- 定期备份重要数据(使用快照或备份工具)
- 监控存储使用情况,及时扩容
- 为不同用途使用不同的PVC(日志、数据分离)
- 设置合理的回收策略(Retain用于生产环境)
❌ 避免的做法:
- 在Pod定义中直接创建临时卷存储重要数据
- 不考虑访问模式就申请RWX(ReadWriteMany)
- 不备份就删除PVC
- 忽略存储容量告警
3.3.7 故障排查指南
查看ConfigMap
# 列出所有ConfigMap
kubectl get configmap -n monitoring
# 查看ConfigMap内容
kubectl get configmap prometheus-config -n monitoring -o yaml
# 查看ConfigMap描述
kubectl describe configmap prometheus-config -n monitoring
查看PVC状态
# 列出PVC
kubectl get pvc -n monitoring
# 查看PVC详情
kubectl describe pvc prometheus-pvc -n monitoring
# 查看PVC绑定的PV
kubectl get pv
验证Volume挂载
# 进入Pod
kubectl exec -it deployment/prometheus -n monitoring -- sh
# 查看挂载的ConfigMap文件
ls -l /etc/prometheus/
cat /etc/prometheus/prometheus.yml
# 查看挂载的规则文件
ls -l /etc/prometheus/rules/
cat /etc/prometheus/rules/node-alerts.yml
# 查看PVC挂载的数据目录
ls -l /prometheus/
du -sh /prometheus/ # 查看数据大小
df -h /prometheus/ # 查看磁盘使用情况
监控ConfigMap变化
# 实时监控ConfigMap变化
kubectl get configmap -n monitoring --watch
# 查看ConfigMap修改事件
kubectl get events --field-selector involvedObject.name=prometheus-config \
-n monitoring
4. 监控目标配置
4.1 静态配置
适用于监控目标固定的场景。
scrape_configs:
- job_name: 'static-targets'
static_configs:
# 监控多个Node Exporter
- targets:
- '192.168.1.10:9100'
- '192.168.1.11:9100'
- '192.168.1.12:9100'
labels:
env: 'production'
datacenter: 'dc1'
# 监控应用服务
- targets:
- 'app1.example.com:8080'
- 'app2.example.com:8080'
labels:
app: 'microservice'
env: 'production'
4.2 服务发现
4.2.1 Kubernetes服务发现
通过annotation控制是否监控:
apiVersion: v1
kind: Service
metadata:
name: my-service
annotations:
prometheus.io/scrape: "true" # 启用监控
prometheus.io/port: "8080" # metrics端口
prometheus.io/path: "/metrics" # metrics路径
spec:
selector:
app: my-app
ports:
- port: 8080
4.2.2 文件服务发现
创建 targets.json:
[
{
"targets": ["192.168.1.10:9100", "192.168.1.11:9100"],
"labels": {
"env": "production",
"job": "node"
}
},
{
"targets": ["app1:8080", "app2:8080"],
"labels": {
"env": "production",
"job": "application"
}
}
]
在 prometheus.yml 中配置:
scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
refresh_interval: 30s
5. Exporter配置
5.1 Node Exporter
监控Linux/Unix系统的硬件和操作系统指标。
5.1.1 安装(二进制)
# 下载
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
# 解压
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64
# 运行
./node_exporter &
# 或者使用systemd
sudo cp node_exporter /usr/local/bin/
创建systemd服务 /etc/systemd/system/node_exporter.service:
[Unit]
Description=Node Exporter
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter \
--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector
Restart=on-failure
[Install]
WantedBy=multi-user.target
启动服务:
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter
5.1.2 常用指标
| 指标 | 说明 |
|---|---|
node_cpu_seconds_total | CPU时间 |
node_memory_MemTotal_bytes | 总内存 |
node_memory_MemAvailable_bytes | 可用内存 |
node_filesystem_size_bytes | 文件系统大小 |
node_filesystem_avail_bytes | 可用空间 |
node_network_receive_bytes_total | 网络接收字节数 |
node_network_transmit_bytes_total | 网络发送字节数 |
node_disk_read_bytes_total | 磁盘读取字节数 |
node_disk_written_bytes_total | 磁盘写入字节数 |
node_load1 | 1分钟平均负载 |
5.2 应用监控
5.2.1 Spring Boot应用
添加依赖(pom.xml):
<dependencies>
<!-- Actuator -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- Micrometer Prometheus -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
</dependencies>
配置(application.yml):
management:
endpoints:
web:
exposure:
include: '*' # 暴露所有端点
endpoint:
health:
show-details: always
prometheus:
enabled: true
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name}
environment: production
自定义指标:
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Component;
@Component
public class CustomMetrics {
private final Counter orderCounter;
private final Timer orderProcessTimer;
public CustomMetrics(MeterRegistry registry) {
// 计数器
this.orderCounter = Counter.builder("orders_total")
.description("Total number of orders")
.tag("type", "online")
.register(registry);
// 计时器
this.orderProcessTimer = Timer.builder("order_process_duration")
.description("Order processing time")
.register(registry);
}
public void recordOrder() {
orderCounter.increment();
}
public void recordProcessTime(Runnable task) {
orderProcessTimer.record(task);
}
}
访问metrics端点:
http://localhost:8080/actuator/prometheus
在Prometheus中配置:
scrape_configs:
- job_name: 'spring-boot-apps'
metrics_path: '/actuator/prometheus'
static_configs:
- targets:
- 'app1:8080'
- 'app2:8080'
5.2.2 自定义Python应用
安装Prometheus客户端:
pip install prometheus-client
示例代码:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random
# 定义指标
request_counter = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
active_connections = Gauge(
'active_connections',
'Number of active connections'
)
# 业务逻辑
def process_request(method, endpoint):
# 记录请求
status = random.choice(['200', '404', '500'])
request_counter.labels(method=method, endpoint=endpoint, status=status).inc()
# 记录耗时
with request_duration.labels(method=method, endpoint=endpoint).time():
time.sleep(random.random()) # 模拟处理时间
# 更新活跃连接数
active_connections.set(random.randint(1, 100))
if __name__ == '__main__':
# 启动HTTP服务器暴露metrics
start_http_server(8000)
# 模拟业务
while True:
process_request('GET', '/api/users')
process_request('POST', '/api/orders')
time.sleep(1)
运行后访问:
http://localhost:8000/metrics
5.3 中间件监控
5.3.1 MySQL Exporter
使用Docker部署:
# docker-compose.yml
mysqld-exporter:
image: prom/mysqld-exporter:v0.15.1
container_name: mysqld-exporter
restart: always
ports:
- "9104:9104"
environment:
DATA_SOURCE_NAME: "exporter:password@(mysql:3306)/"
networks:
- monitoring
创建MySQL用户:
CREATE USER 'exporter'@'%' IDENTIFIED BY 'password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
FLUSH PRIVILEGES;
Prometheus配置:
scrape_configs:
- job_name: 'mysql'
static_configs:
- targets: ['mysqld-exporter:9104']
labels:
instance: 'mysql-prod'
5.3.2 Redis Exporter
redis-exporter:
image: oliver006/redis_exporter:v1.55.0
container_name: redis-exporter
restart: always
ports:
- "9121:9121"
environment:
REDIS_ADDR: "redis:6379"
REDIS_PASSWORD: "your-redis-password"
networks:
- monitoring
Prometheus配置:
scrape_configs:
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
5.3.3 MongoDB Exporter
mongodb-exporter:
image: percona/mongodb_exporter:0.40
container_name: mongodb-exporter
restart: always
ports:
- "9216:9216"
environment:
MONGODB_URI: "mongodb://exporter:password@mongodb:27017"
networks:
- monitoring
6. AlertManager报警系统
6.1 AlertManager配置
创建 alertmanager.yml:
# 全局配置
global:
resolve_timeout: 5m # 报警恢复超时时间
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'your-password'
smtp_require_tls: true
# 路由配置
route:
group_by: ['alertname', 'cluster', 'service'] # 分组依据
group_wait: 10s # 等待时间,收集同组告警
group_interval: 10s # 组内发送间隔
repeat_interval: 1h # 重复发送间隔
receiver: 'default' # 默认接收者
# 子路由
routes:
# 严重告警立即发送
- match:
severity: critical
receiver: 'critical-team'
group_wait: 0s
repeat_interval: 5m
# 数据库告警发给DBA
- match_re:
service: mysql|redis|mongodb
receiver: 'dba-team'
# 业务告警发给开发团队
- match:
team: dev
receiver: 'dev-team'
# 抑制规则
inhibit_rules:
# 节点宕机时,抑制该节点的其他告警
- source_match:
severity: 'critical'
alertname: 'NodeDown'
target_match:
severity: 'warning'
equal: ['instance']
# 当有服务不可用告警时,抑制该服务的高延迟告警
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['service', 'instance']
# 接收者配置
receivers:
# 默认接收者
- name: 'default'
email_configs:
- to: 'ops-team@example.com'
headers:
Subject: '[Prometheus] {{ .GroupLabels.alertname }}'
html: '{{ template "email.html" . }}'
webhook_configs:
- url: 'http://webhook-server:8080/alert'
send_resolved: true
# 严重告警接收者
- name: 'critical-team'
email_configs:
- to: 'critical-alerts@example.com'
send_resolved: true
# 企业微信
wechat_configs:
- corp_id: 'your-corp-id'
api_secret: 'your-api-secret'
to_party: '1' # 部门ID
agent_id: 'your-agent-id'
message: '{{ template "wechat.default.message" . }}'
# 钉钉
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
send_resolved: true
# DBA团队
- name: 'dba-team'
email_configs:
- to: 'dba@example.com'
# 开发团队
- name: 'dev-team'
email_configs:
- to: 'dev-team@example.com'
# 告警模板
templates:
- '/etc/alertmanager/templates/*.tmpl'
6.2 自定义告警模板
创建 email.tmpl:
{{ define "email.html" }}
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<style>
body { font-family: Arial, sans-serif; }
.alert { border: 1px solid #ddd; padding: 10px; margin: 10px 0; border-radius: 5px; }
.critical { background-color: #ffebee; border-color: #f44336; }
.warning { background-color: #fff3e0; border-color: #ff9800; }
.info { background-color: #e3f2fd; border-color: #2196f3; }
.label { display: inline-block; padding: 2px 8px; margin: 2px; background-color: #e0e0e0; border-radius: 3px; font-size: 12px; }
</style>
</head>
<body>
<h2>Prometheus 告警通知</h2>
<p><strong>告警时间:</strong> {{ .CommonAnnotations.timestamp }}</p>
<p><strong>告警组:</strong> {{ .GroupLabels.alertname }}</p>
<h3>触发的告警 ({{ .Alerts.Firing | len }})</h3>
{{ range .Alerts.Firing }}
<div class="alert {{ .Labels.severity }}">
<h4>{{ .Labels.alertname }}</h4>
<p><strong>级别:</strong> <span class="label">{{ .Labels.severity }}</span></p>
<p><strong>实例:</strong> {{ .Labels.instance }}</p>
<p><strong>描述:</strong> {{ .Annotations.description }}</p>
<p><strong>触发时间:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
<details>
<summary>详细标签</summary>
{{ range .Labels.SortedPairs }}
<span class="label">{{ .Name }}: {{ .Value }}</span>
{{ end }}
</details>
</div>
{{ end }}
{{ if .Alerts.Resolved }}
<h3>已恢复的告警 ({{ .Alerts.Resolved | len }})</h3>
{{ range .Alerts.Resolved }}
<div class="alert info">
<h4>✓ {{ .Labels.alertname }} [已恢复]</h4>
<p><strong>实例:</strong> {{ .Labels.instance }}</p>
<p><strong>恢复时间:</strong> {{ .EndsAt.Format "2006-01-02 15:04:05" }}</p>
<p><strong>持续时间:</strong> {{ .EndsAt.Sub .StartsAt }}</p>
</div>
{{ end }}
{{ end }}
<hr>
<p style="font-size: 12px; color: #666;">
此邮件由 Prometheus AlertManager 自动发送<br>
查看详情: <a href="http://prometheus.yourdomain.com">Prometheus</a> |
<a href="http://alertmanager.yourdomain.com">AlertManager</a>
</p>
</body>
</html>
{{ end }}
6.3 钉钉机器人集成
创建钉钉Webhook适配器 dingtalk-webhook.py:
from flask import Flask, request
import requests
import json
app = Flask(__name__)
DINGTALK_WEBHOOK = "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"
@app.route('/webhook/dingtalk', methods=['POST'])
def dingtalk_webhook():
data = request.json
# 构造钉钉消息
alerts = data.get('alerts', [])
firing_alerts = [a for a in alerts if a['status'] == 'firing']
resolved_alerts = [a for a in alerts if a['status'] == 'resolved']
message = {
"msgtype": "markdown",
"markdown": {
"title": "Prometheus告警",
"text": build_message(firing_alerts, resolved_alerts)
},
"at": {
"isAtAll": False
}
}
# 发送到钉钉
response = requests.post(DINGTALK_WEBHOOK, json=message)
return {'status': 'ok'}, 200
def build_message(firing, resolved):
text = "### 🚨 Prometheus 告警通知\n\n"
if firing:
text += f"**触发告警: {len(firing)}条**\n\n"
for alert in firing:
labels = alert['labels']
annotations = alert['annotations']
text += f"#### 【{labels.get('severity', 'unknown')}】{labels.get('alertname')}\n"
text += f"- **实例:** {labels.get('instance')}\n"
text += f"- **描述:** {annotations.get('description')}\n"
text += f"- **时间:** {alert['startsAt']}\n\n"
if resolved:
text += f"\n**已恢复: {len(resolved)}条**\n\n"
for alert in resolved:
labels = alert['labels']
text += f"- ✅ {labels.get('alertname')} ({labels.get('instance')})\n"
return text
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
7. 报警规则配置
7.1 基础设施告警规则
创建 rules/infrastructure-alerts.yml:
groups:
# 节点监控告警
- name: node-alerts
interval: 30s
rules:
# 节点宕机
- alert: NodeDown
expr: up{job="node"} == 0
for: 1m
labels:
severity: critical
team: ops
annotations:
summary: "节点 {{ $labels.instance }} 不可用"
description: "节点已经离线超过1分钟\n 当前状态: {{ $value }}"
# CPU使用率过高
- alert: NodeHighCPU
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "节点 {{ $labels.instance }} CPU使用率过高"
description: "CPU使用率已超过80%\n 当前值: {{ $value | humanize }}%"
# CPU使用率严重过高
- alert: NodeCriticalCPU
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 3m
labels:
severity: critical
team: ops
annotations:
summary: "节点 {{ $labels.instance }} CPU使用率严重过高"
description: "CPU使用率已超过95%,可能导致服务不可用\n 当前值: {{ $value | humanize }}%"
# 内存使用率过高
- alert: NodeHighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "节点 {{ $labels.instance }} 内存使用率过高"
description: "内存使用率已超过85%\n 当前值: {{ $value | humanize }}%"
# 磁盘空间不足
- alert: NodeDiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "节点 {{ $labels.instance }} 磁盘空间不足"
description: "磁盘 {{ $labels.mountpoint }} 使用率已超过85%\n 当前值: {{ $value | humanize }}%"
# 磁盘空间严重不足
- alert: NodeDiskSpaceCritical
expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes)) * 100 > 95
for: 1m
labels:
severity: critical
team: ops
annotations:
summary: "节点 {{ $labels.instance }} 磁盘空间严重不足"
description: "磁盘 {{ $labels.mountpoint }} 使用率已超过95%,请立即处理\n 当前值: {{ $value | humanize }}%"
# 系统负载过高
- alert: NodeHighLoad
expr: node_load5 / count by (instance) (node_cpu_seconds_total{mode="idle"}) > 2
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "节点 {{ $labels.instance }} 系统负载过高"
description: "5分钟平均负载已超过CPU核心数的2倍\n 当前值: {{ $value | humanize }}"
# 文件描述符使用率过高
- alert: NodeHighFileDescriptors
expr: node_filefd_allocated / node_filefd_maximum * 100 > 80
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "节点 {{ $labels.instance }} 文件描述符使用率过高"
description: "文件描述符使用率已超过80%\n 当前值: {{ $value | humanize }}%"
7.2 应用服务告警规则
创建 rules/application-alerts.yml:
groups:
- name: application-alerts
interval: 30s
rules:
# 服务不可用
- alert: ServiceDown
expr: up{job="application"} == 0
for: 1m
labels:
severity: critical
team: dev
annotations:
summary: "服务 {{ $labels.instance }} 不可用"
description: "服务已经离线超过1分钟"
# HTTP错误率过高
- alert: HighHTTPErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (instance, service)
/
sum(rate(http_requests_total[5m])) by (instance, service)
* 100 > 5
for: 5m
labels:
severity: warning
team: dev
annotations:
summary: "服务 {{ $labels.service }} HTTP错误率过高"
description: "5xx错误率超过5%\n 当前值: {{ $value | humanize }}%"
# API响应时间过长
- alert: HighAPILatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, instance, endpoint)
) > 1
for: 5m
labels:
severity: warning
team: dev
annotations:
summary: "API {{ $labels.endpoint }} 响应时间过长"
description: "P95延迟超过1秒\n 当前值: {{ $value | humanize }}s"
# JVM内存使用率过高
- alert: HighJVMMemoryUsage
expr: |
(jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) * 100 > 85
for: 5m
labels:
severity: warning
team: dev
annotations:
summary: "服务 {{ $labels.instance }} JVM堆内存使用率过高"
description: "堆内存使用率已超过85%\n 当前值: {{ $value | humanize }}%"
# GC频率过高
- alert: HighGCRate
expr: rate(jvm_gc_collection_seconds_count[5m]) > 10
for: 5m
labels:
severity: warning
team: dev
annotations:
summary: "服务 {{ $labels.instance }} GC频率过高"
description: "GC频率超过每秒10次\n 当前值: {{ $value | humanize }}/s"
# 数据库连接池耗尽
- alert: DatabaseConnectionPoolExhausted
expr: |
(hikaricp_connections_active / hikaricp_connections_max) * 100 > 90
for: 5m
labels:
severity: critical
team: dev
annotations:
summary: "服务 {{ $labels.instance }} 数据库连接池即将耗尽"
description: "连接池使用率已超过90%\n 当前值: {{ $value | humanize }}%"
7.3 中间件告警规则
创建 rules/middleware-alerts.yml:
groups:
# MySQL告警
- name: mysql-alerts
interval: 30s
rules:
- alert: MySQLDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
team: dba
annotations:
summary: "MySQL {{ $labels.instance }} 不可用"
description: "MySQL实例已经离线"
- alert: MySQLTooManyConnections
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
for: 5m
labels:
severity: warning
team: dba
annotations:
summary: "MySQL {{ $labels.instance }} 连接数过多"
description: "连接使用率超过80%\n 当前值: {{ $value | humanize }}%"
- alert: MySQLSlowQueries
expr: rate(mysql_global_status_slow_queries[5m]) > 10
for: 5m
labels:
severity: warning
team: dba
annotations:
summary: "MySQL {{ $labels.instance }} 慢查询过多"
description: "慢查询速率超过每秒10个\n 当前值: {{ $value | humanize }}/s"
# Redis告警
- name: redis-alerts
interval: 30s
rules:
- alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
team: dba
annotations:
summary: "Redis {{ $labels.instance }} 不可用"
description: "Redis实例已经离线"
- alert: RedisHighMemoryUsage
expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 85
for: 5m
labels:
severity: warning
team: dba
annotations:
summary: "Redis {{ $labels.instance }} 内存使用率过高"
description: "内存使用率超过85%\n 当前值: {{ $value | humanize }}%"
- alert: RedisHighConnectedClients
expr: redis_connected_clients > 1000
for: 5m
labels:
severity: warning
team: dba
annotations:
summary: "Redis {{ $labels.instance }} 连接数过多"
description: "连接数超过1000\n 当前值: {{ $value }}"
8. Grafana可视化
8.1 配置Prometheus数据源
- 访问Grafana:
http://localhost:3000 - 登录(admin/admin123)
- 点击左侧菜单 Configuration > Data Sources
- 点击 Add data source
- 选择 Prometheus
- 配置:
Name: Prometheus URL: http://prometheus:9090 Access: Server (default) - 点击 Save & Test
8.2 导入常用Dashboard
8.2.1 Node Exporter Full
- 点击左侧菜单 + > Import
- 输入Dashboard ID:
1860 - 点击 Load
- 选择Prometheus数据源
- 点击 Import
8.2.2 其他推荐Dashboard
| Dashboard | ID | 说明 |
|---|---|---|
| Node Exporter Full | 1860 | 完整的节点监控 |
| Kubernetes Cluster Monitoring | 7249 | K8s集群监控 |
| Spring Boot Statistics | 6756 | Spring Boot应用监控 |
| MySQL Overview | 7362 | MySQL监控 |
| Redis Dashboard | 11835 | Redis监控 |
| JVM (Micrometer) | 4701 | JVM监控 |
9. PromQL查询语言
9.1 基础查询
# 查询指标
http_requests_total
# 带标签过滤
http_requests_total{method="GET"}
# 多个标签
http_requests_total{method="GET", status="200"}
# 标签正则匹配
http_requests_total{status=~"2.."}
# 标签不等于
http_requests_total{status!="200"}
9.2 范围查询
# 过去5分钟的数据
http_requests_total[5m]
# 过去1小时的数据
http_requests_total[1h]
9.3 聚合函数
# 求和
sum(http_requests_total)
# 按标签分组求和
sum(http_requests_total) by (service)
# 平均值
avg(node_cpu_seconds_total)
# 最大值
max(node_memory_MemTotal_bytes)
# 最小值
min(node_memory_MemAvailable_bytes)
# 计数
count(up == 1)
9.4 速率函数
# 每秒增长率(适用于Counter)
rate(http_requests_total[5m])
# 瞬时增长率
irate(http_requests_total[5m])
# 增量
increase(http_requests_total[1h])
9.5 常用查询示例
# CPU使用率
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# 磁盘使用率
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
# QPS(每秒请求数)
sum(rate(http_requests_total[5m]))
# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# P95延迟
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# 可用性(过去24小时)
avg_over_time(up[24h]) * 100
# Top 5 最高CPU的Pod
topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (pod))
10. 企业级最佳实践
10.1 高可用部署
10.1.1 Prometheus联邦集群
# 全局Prometheus配置
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~".*"}'
static_configs:
- targets:
- 'prometheus-shard1:9090'
- 'prometheus-shard2:9090'
- 'prometheus-shard3:9090'
10.2 数据保留策略
# Prometheus启动参数
--storage.tsdb.retention.time=30d # 保留30天
--storage.tsdb.retention.size=100GB # 或最多100GB
10.3 监控指标规范
10.3.1 命名规范
- 使用小写字母和下划线
- 以应用名或库名作为前缀
- 以单位作为后缀(可选)
# 好的命名
http_requests_total
http_request_duration_seconds
database_connections_active
# 不好的命名
httpRequests
RequestDuration
DB-Connections
10.3.2 标签规范
- 使用有意义的标签名
- 避免高基数标签(如user_id、request_id)
- 保持标签数量适中
# 好的标签
http_requests_total{method="GET", endpoint="/api/users", status="200"}
# 避免高基数
http_requests_total{user_id="12345"} # 不推荐
10.4 告警规范
10.4.1 告警级别定义
| 级别 | 说明 | 响应时间 | 处理方式 |
|---|---|---|---|
| critical | 严重故障,影响服务 | 立即 | 7x24小时值班,电话/短信通知 |
| warning | 潜在问题,需要关注 | 30分钟内 | 工作时间处理 |
| info | 信息通知 | 不要求 | 记录日志 |
10.4.2 告警设计原则
- 可操作性:告警必须需要人工介入
- 准确性:避免误报,设置合理的阈值和持续时间
- 简洁性:告警信息简洁明了
- 分级处理:根据严重程度分级
- 避免告警疲劳:合并相似告警
10.5 性能优化
10.5.1 使用Recording Rules
对于复杂的查询,预计算并存储结果:
groups:
- name: recording-rules
interval: 30s
rules:
- record: instance:node_cpu_utilization:rate5m
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: instance:node_memory_utilization:ratio
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
使用预计算的指标:
# 直接使用
instance:node_cpu_utilization:rate5m
11. 常见问题与解决方案
11.1 Prometheus内存占用过高
原因:
- 监控目标过多
- 指标基数过高
- 数据保留时间过长
解决方案:
- 减少监控目标或增加抓取间隔
- 使用Recording Rules预计算
- 调整数据保留时间
- 增加内存限制
# 启动参数
--storage.tsdb.retention.time=15d
11.2 查询超时
解决方案:
- 优化PromQL查询
- 使用Recording Rules
- 增加查询超时时间
global:
query_timeout: 2m
11.3 告警规则不生效
排查步骤:
- 检查规则文件语法
promtool check rules rules/*.yml
- 检查Prometheus配置
promtool check config prometheus.yml
- 查看Prometheus UI的Alerts页面
- 检查AlertManager配置
12. 监控方案总结
12.1 完整监控架构图
┌──────────────────────────────────────────────────────────────┐
│ 监控系统架构 │
└──────────────────────────────────────────────────────────────┘
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 应用服务 │ │ 中间件 │ │ 基础设施 │
│ /actuator │ │ Exporters │ │ Node Exp │
└──────┬───────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└──────────────────┼──────────────────┘
│ (metrics)
↓
┌─────────────────┐
│ Prometheus │
│ (抓取&存储) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
↓ ↓ ↓
┌────────────┐ ┌──────────┐ ┌──────────┐
│ Grafana │ │AlertMgr │ │ HTTP API│
│ (可视化) │ │ (告警) │ │ (查询) │
└────────────┘ └────┬─────┘ └──────────┘
│
┌──────────┼──────────┐
↓ ↓ ↓
┌──────┐ ┌──────┐ ┌──────┐
│ 邮件 │ │企业微信│ │ 钉钉 │
└──────┘ └──────┘ └──────┘
12.2 关键配置清单
- ✅ Prometheus服务器部署
- ✅ Node Exporter安装
- ✅ 应用监控集成
- ✅ 中间件Exporter配置
- ✅ 服务发现配置
- ✅ 报警规则编写
- ✅ AlertManager部署
- ✅ 告警通知渠道配置
- ✅ Grafana部署与配置
- ✅ Dashboard创建
- ✅ 数据备份策略
- ✅ 高可用方案
附录
附录A:常用Exporter列表
| Exporter | 监控对象 | 端口 | 项目地址 |
|---|---|---|---|
| node_exporter | Linux/Unix主机 | 9100 | https://github.com/prometheus/node_exporter |
| mysqld_exporter | MySQL | 9104 | https://github.com/prometheus/mysqld_exporter |
| redis_exporter | Redis | 9121 | https://github.com/oliver006/redis_exporter |
| mongodb_exporter | MongoDB | 9216 | https://github.com/percona/mongodb_exporter |
| postgres_exporter | PostgreSQL | 9187 | https://github.com/prometheus-community/postgres_exporter |
| nginx_exporter | Nginx | 9113 | https://github.com/nginxinc/nginx-prometheus-exporter |
| kafka_exporter | Kafka | 9308 | https://github.com/danielqsj/kafka_exporter |
| blackbox_exporter | HTTP/TCP/ICMP | 9115 | https://github.com/prometheus/blackbox_exporter |
附录B:PromQL速查表
# 选择器
{job="prometheus"}
{job=~"prom.*"}
{job!="prometheus"}
# 聚合
sum()
avg()
max()
min()
count()
topk(5, )
bottomk(5, )
# 速率
rate() # 适合告警
irate() # 适合图表
# 时间范围
[5m] # 5分钟
[1h] # 1小时
[1d] # 1天
# 运算符
+ - * / %
== != > < >= <=
and or unless
# 函数
abs()
ceil()
floor()
round()
histogram_quantile()
文档版本: v1.0.0
最后更新: 2025-12-09
作者: Tiger IoT团队
适用范围: 企业级微服务监控系统
祝您监控愉快!
564

被折叠的 条评论
为什么被折叠?



