Prometheus监控与报警系统完整搭建指南

原创于 2025-12-09 15:08:38 发布 · 850 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#prometheus

从0到1的项目搭建专栏收录该内容

11 篇文章

订阅专栏

Prometheus监控与报警系统完整搭建指南

1. 概述

1.1 什么是Prometheus？

Prometheus是一个开源的系统监控和报警工具，最初由SoundCloud开发，现已成为CNCF（Cloud Native Computing Foundation）的毕业项目。

1.2 核心特性

多维数据模型：基于时间序列，通过指标名称和标签区分
灵活的查询语言：PromQL支持复杂的数据查询和聚合
不依赖分布式存储：单机节点即可高效运行
HTTP拉取模式：通过HTTP协议拉取监控数据
支持推送网关：短期作业可通过Pushgateway推送数据
多种可视化集成：与Grafana完美集成
高效报警机制：通过AlertManager处理报警

1.3 应用场景

基础设施监控：服务器、网络、存储等
应用性能监控：微服务、API、数据库等
业务指标监控：用户量、订单量、交易额等
容器监控：Docker、Kubernetes集群
自定义监控：业务特定的指标

1.4 完整监控架构

┌─────────────────────────────────────────────────────────────┐
│                      监控数据流向                              │
└─────────────────────────────────────────────────────────────┘

应用/服务器/中间件
      ↓
   Exporter (暴露metrics接口)
      ↓
  Prometheus (拉取并存储数据)
      ↓
   ├─────────────┬─────────────┐
   ↓             ↓             ↓
Grafana      AlertManager   HTTP API
(可视化)      (告警处理)    (查询接口)
                ↓
        ├───────┼───────┐
        ↓       ↓       ↓
      邮件    企业微信  钉钉

2. Prometheus架构与组件

2.1 核心组件

组件	作用	是否必须
Prometheus Server	核心服务器，抓取和存储时间序列数据	✅ 必须
Exporters	暴露监控指标的程序	✅ 必须
Pushgateway	接收短期作业推送的指标	❌ 可选
AlertManager	处理报警通知	⚠️ 推荐
Grafana	数据可视化	⚠️ 推荐
Service Discovery	服务发现	❌ 可选

2.2 数据模型

Prometheus采用多维度时间序列数据模型：

<metric_name>{<label_name>=<label_value>, ...} value timestamp

示例：

http_requests_total{method="GET", endpoint="/api/users", status="200"} 1234 1638360000

metric_name: 指标名称（如 http_requests_total）
labels: 标签键值对（如 method=“GET”）
value: 数值
timestamp: 时间戳

2.3 指标类型

类型	说明	示例
Counter	计数器，只增不减	请求总数、错误总数
Gauge	仪表盘，可增可减	CPU使用率、内存使用量
Histogram	直方图，统计分布	请求延迟分布
Summary	摘要，统计分位数	请求延迟的50/90/99分位

3. Prometheus部署

3.1 使用Docker部署

3.1.1 快速部署单机版

创建配置文件 prometheus.yml：

# Prometheus全局配置
global:
  scrape_interval: 15s       # 抓取间隔，默认15秒
  evaluation_interval: 15s   # 规则评估间隔
  scrape_timeout: 10s        # 抓取超时时间
  external_labels:           # 外部标签，所有时间序列都会添加
    cluster: 'production'
    region: 'cn-hangzhou'

# AlertManager配置
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager:9093'

# 加载报警规则文件
rule_files:
  - '/etc/prometheus/rules/*.yml'

# 监控目标配置
scrape_configs:
  # Prometheus自身监控
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: 'prometheus-server'
          
  # Node Exporter监控
  - job_name: 'node'
    static_configs:
      - targets: 
          - 'node-exporter:9100'
        labels:
          env: 'production'

创建 docker-compose.yml：

version: '3.8'

services:
  # Prometheus服务
  prometheus:
    image: prom/prometheus:v2.48.0
    container_name: prometheus
    restart: always
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./rules:/etc/prometheus/rules
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'  # 数据保留30天
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--web.enable-lifecycle'  # 启用热加载
    networks:
      - monitoring

  # Node Exporter - 主机监控
  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: always
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  # AlertManager - 报警管理
  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: alertmanager
    restart: always
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager-data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    networks:
      - monitoring

  # Grafana - 可视化
  grafana:
    image: grafana/grafana:10.2.2
    container_name: grafana
    restart: always
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
    volumes:
      - grafana-data:/var/lib/grafana
    networks:
      - monitoring

  # Pushgateway - 接收推送数据
  pushgateway:
    image: prom/pushgateway:v1.6.2
    container_name: pushgateway
    restart: always
    ports:
      - "9091:9091"
    networks:
      - monitoring

volumes:
  prometheus-data:
  alertmanager-data:
  grafana-data:

networks:
  monitoring:
    driver: bridge

启动服务：

# 创建规则目录
mkdir -p rules

# 启动所有服务
docker-compose up -d

# 查看服务状态
docker-compose ps

# 查看Prometheus日志
docker-compose logs -f prometheus

# 查看所有日志
docker-compose logs -f

访问服务：

Prometheus: http://localhost:9090
AlertManager: http://localhost:9093
Grafana: http://localhost:3000 (admin/admin123)
Node Exporter: http://localhost:9100/metrics
Pushgateway: http://localhost:9091

3.1.2 验证部署

# 检查Prometheus目标状态
curl http://localhost:9090/api/v1/targets

# 检查配置是否生效
curl http://localhost:9090/api/v1/status/config

# 热加载配置（需要启用--web.enable-lifecycle）
curl -X POST http://localhost:9090/-/reload

3.2 使用Kubernetes部署

3.2.1 创建命名空间

kubectl create namespace monitoring

3.2.2 创建ConfigMap（Prometheus配置）

创建 prometheus-configmap.yaml：

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: 'k8s-cluster'
        environment: 'production'

    alerting:
      alertmanagers:
        - kubernetes_sd_configs:
            - role: pod
              namespaces:
                names:
                  - monitoring
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_label_app]
              regex: alertmanager
              action: keep

    rule_files:
      - '/etc/prometheus/rules/*.yml'

    scrape_configs:
      # Prometheus自身
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']

      # Kubernetes节点
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)

      # Kubernetes Pods
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name

应用配置：

kubectl apply -f prometheus-configmap.yaml

3.2.3 创建RBAC权限

创建 prometheus-rbac.yaml：

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/proxy
      - services
      - endpoints
      - pods
    verbs: ["get", "list", "watch"]
  - apiGroups:
      - extensions
    resources:
      - ingresses
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: ["/metrics"]
    verbs: ["get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: monitoring

应用配置：

kubectl apply -f prometheus-rbac.yaml

3.2.4 创建报警规则ConfigMap

创建 prometheus-rules.yaml：

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: monitoring
data:
  node-alerts.yml: |
    groups:
      - name: node-alerts
        interval: 30s
        rules:
          - alert: NodeDown
            expr: up{job="kubernetes-nodes"} == 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "节点 {{ $labels.instance }} 不可用"
              description: "节点已经下线超过1分钟"

          - alert: NodeHighCPU
            expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "节点 {{ $labels.instance }} CPU使用率过高"
              description: "CPU使用率: {{ $value | humanize }}%"

          - alert: NodeHighMemory
            expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "节点 {{ $labels.instance }} 内存使用率过高"
              description: "内存使用率: {{ $value | humanize }}%"

应用配置：

kubectl apply -f prometheus-rules.yaml

3.2.5 创建PVC

创建 prometheus-pvc.yaml：

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-pvc
  namespace: monitoring
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: alicloud-disk-ssd  # 根据实际情况修改

应用配置：

kubectl apply -f prometheus-pvc.yaml

3.2.6 创建Deployment

创建 prometheus-deployment.yaml：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
        - name: prometheus
          image: prom/prometheus:v2.48.0
          args:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - '--storage.tsdb.retention.time=30d'
            - '--web.console.libraries=/usr/share/prometheus/console_libraries'
            - '--web.console.templates=/usr/share/prometheus/consoles'
            - '--web.enable-lifecycle'
            - '--web.enable-admin-api'
          ports:
            - containerPort: 9090
              name: http
          resources:
            requests:
              cpu: 500m
              memory: 2Gi
            limits:
              cpu: 2000m
              memory: 4Gi
          volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: rules
              mountPath: /etc/prometheus/rules
            - name: data
              mountPath: /prometheus
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: 9090
            initialDelaySeconds: 30
            timeoutSeconds: 10
          readinessProbe:
            httpGet:
              path: /-/ready
              port: 9090
            initialDelaySeconds: 30
            timeoutSeconds: 10
      volumes:
        - name: config
          configMap:
            name: prometheus-config
        - name: rules
          configMap:
            name: prometheus-rules
        - name: data
          persistentVolumeClaim:
            claimName: prometheus-pvc

应用配置：

kubectl apply -f prometheus-deployment.yaml

3.2.7 创建Service

创建 prometheus-service.yaml：

apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  type: ClusterIP
  ports:
    - port: 9090
      targetPort: 9090
      protocol: TCP
      name: http
  selector:
    app: prometheus

应用配置：

kubectl apply -f prometheus-service.yaml

3.2.8 验证部署

# 查看Pod状态
kubectl get pods -n monitoring

# 查看Service
kubectl get svc -n monitoring

# 查看日志
kubectl logs -f deployment/prometheus -n monitoring

# 端口转发（本地访问）
kubectl port-forward -n monitoring svc/prometheus 9090:9090

# 访问 http://localhost:9090

3.3 Kubernetes核心概念详解

在部署Prometheus到Kubernetes时，我们使用了多个核心概念。本节详细解释这些概念及其工作原理。

3.3.1 什么是PVC（持久化卷声明）？

PVC = PersistentVolumeClaim（持久化卷声明）

PVC是Kubernetes中用于申请存储资源的一种API对象，是用户向集群"申请存储空间"的方式。

通俗理解

可以把PVC理解为租房申请单：

PVC：你填写需要多大的存储空间、什么类型的存储
Kubernetes：作为"中介"，根据申请匹配合适的存储
PV（PersistentVolume）：实际的存储资源（云盘、NFS、本地磁盘等）

Kubernetes存储体系

┌─────────────────────────────────────────────────────────┐
│                  Kubernetes存储体系                        │
└─────────────────────────────────────────────────────────┘

1. StorageClass (存储类)
   ├─ 定义：存储的"类型/规格"
   └─ 示例：SSD云盘、HDD云盘、NFS

2. PersistentVolume (PV) - 持久化卷
   ├─ 定义：实际的存储资源
   ├─ 由管理员创建 或 StorageClass自动创建
   └─ 示例：阿里云的100GB SSD云盘

3. PersistentVolumeClaim (PVC) - 持久化卷声明
   ├─ 定义：用户的存储申请
   ├─ 由用户创建
   └─ 示例：申请100GB的SSD存储

4. Pod使用PVC
   └─ Pod通过PVC名称引用存储，数据持久化

PVC配置示例

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-pvc        # PVC的名字
  namespace: monitoring
spec:
  # 访问模式
  accessModes:
    - ReadWriteOnce          # RWO: 单节点读写
    # - ReadOnlyMany         # ROX: 多节点只读
    # - ReadWriteMany        # RWX: 多节点读写
  
  # 存储需求
  resources:
    requests:
      storage: 100Gi         # 申请100GB空间
  
  # 存储类型（可选）
  storageClassName: alicloud-disk-ssd

PVC的生命周期

1. 创建PVC
   ↓
   状态：Pending（等待绑定）
   
2. 自动绑定PV
   ↓
   Kubernetes自动寻找或创建符合条件的PV
   状态：Bound（已绑定）
   
3. Pod使用PVC
   ↓
   数据写入到绑定的PV中
   
4. 删除Pod
   ↓
   PVC和PV依然存在，数据不会丢失
   
5. 删除PVC
   ↓
   根据回收策略处理PV和底层存储

PVC的关键特性

特性	说明
持久性	Pod删除后，PVC和数据依然存在
独立性	PVC的生命周期独立于Pod
动态供应	可以自动创建PV（需要StorageClass支持）
跨Pod共享	根据访问模式，可能支持多Pod访问
容量保证	保证至少有申请的存储空间

查看PVC状态

# 查看PVC列表
kubectl get pvc -n monitoring

# 输出示例
NAME             STATUS   VOLUME                  CAPACITY   ACCESS MODES   STORAGECLASS        AGE
prometheus-pvc   Bound    pvc-a1b2c3d4-5678-90ab  100Gi      RWO            alicloud-disk-ssd   5d

# 查看PVC详情
kubectl describe pvc prometheus-pvc -n monitoring

3.3.2 ConfigMap与数据卷挂载

ConfigMap = Configuration Map（配置映射）

用于存储非敏感的配置数据（键值对或文件内容）。

ConfigMap的核心作用

1. 配置与镜像分离

不使用ConfigMap的问题：

# 配置写死在镜像里
COPY prometheus.yml /etc/prometheus/

❌ 修改配置需要重新构建镜像
❌ 不同环境需要不同镜像
❌ 配置变更需要重新部署

使用ConfigMap的优势：

volumes:
  - name: config
    configMap:
      name: prometheus-config

✅ 配置独立管理，无需修改镜像
✅ 同一镜像适用多环境（dev/test/prod）
✅ 配置更新可以动态生效

2. Key-Value到文件的映射

这是ConfigMap最重要的概念：

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  # Key（键）会变成文件名
  # Value（值）会变成文件内容
  prometheus.yml: |           # ← Key = 文件名
    global:                    # ← Value开始 = 文件内容
      scrape_interval: 15s
      evaluation_interval: 15s
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']

挂载后的效果：

# 容器内的文件系统
/etc/prometheus/
└── prometheus.yml    # ← Key变成了文件名

# 文件内容就是Value
$ cat /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

3. 支持多个配置文件

一个ConfigMap可以包含多个文件：

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
data:
  # 第一个文件
  node-alerts.yml: |
    groups:
      - name: node-alerts
        rules:
          - alert: NodeDown
            expr: up == 0
  
  # 第二个文件
  app-alerts.yml: |
    groups:
      - name: app-alerts
        rules:
          - alert: HighErrorRate
            expr: rate(errors[5m]) > 0.05
  
  # 第三个文件
  db-alerts.yml: |
    groups:
      - name: db-alerts
        rules:
          - alert: MySQLDown
            expr: mysql_up == 0

挂载后的文件结构：

/etc/prometheus/rules/
├── node-alerts.yml    # ← Key1 变成文件
├── app-alerts.yml     # ← Key2 变成文件
└── db-alerts.yml      # ← Key3 变成文件

ConfigMap映射过程图解

┌─────────────────────────────────────────────────────────┐
│               ConfigMap存储结构                          │
└─────────────────────────────────────────────────────────┘

ConfigMap: prometheus-config
│
├─ data:
│  ├─ Key: "prometheus.yml"
│  │  └─ Value: "global:\n  scrape_interval: 15s..."
│  │
│  ├─ Key: "rules.yml"
│  │  └─ Value: "groups:\n  - name: alerts..."
│  │
│  └─ Key: "config.json"
│     └─ Value: "{\"port\": 8080}"

            ↓ 挂载到Pod (mountPath: /etc/prometheus)

┌─────────────────────────────────────────────────────────┐
│          Pod中的文件系统（容器内）                         │
└─────────────────────────────────────────────────────────┘

/etc/prometheus/
│
├─ prometheus.yml    ← Key变成文件名
│  └─ 内容: global:  ← Value变成文件内容
│          scrape_interval: 15s
│
├─ rules.yml
│  └─ 内容: groups:
│          - name: alerts
│
└─ config.json
   └─ 内容: {"port": 8080}

3.3.3 Volume（卷）配置详解

在Kubernetes的Deployment中，我们定义了三种类型的Volume：

volumes:
  # 1. ConfigMap卷 - 配置文件
  - name: config
    configMap:
      name: prometheus-config
  
  # 2. ConfigMap卷 - 规则文件
  - name: rules
    configMap:
      name: prometheus-rules
  
  # 3. PVC卷 - 持久化数据
  - name: data
    persistentVolumeClaim:
      claimName: prometheus-pvc

Volume类型对比

卷名	类型	数据源	用途	持久化	读写权限
config	ConfigMap	prometheus-config	存储prometheus.yml配置	❌	只读
rules	ConfigMap	prometheus-rules	存储告警规则文件	❌	只读
data	PVC	prometheus-pvc	存储时间序列数据	✅	读写

完整的挂载示例

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  template:
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:v2.48.0
          
          # 容器内的挂载点
          volumeMounts:
            # 挂载配置文件
            - name: config                     # 引用volumes中的config
              mountPath: /etc/prometheus       # 挂载到容器的这个目录
              
            # 挂载规则文件
            - name: rules                      # 引用volumes中的rules
              mountPath: /etc/prometheus/rules # 挂载到容器的这个目录
              
            # 挂载数据目录
            - name: data                       # 引用volumes中的data
              mountPath: /prometheus           # 挂载到容器的这个目录
      
      # 卷定义
      volumes:
        # 配置文件卷（来自ConfigMap）
        - name: config
          configMap:
            name: prometheus-config
            
        # 规则文件卷（来自ConfigMap）
        - name: rules
          configMap:
            name: prometheus-rules
            
        # 数据存储卷（来自PVC）
        - name: data
          persistentVolumeClaim:
            claimName: prometheus-pvc

挂载后容器内的文件结构：

# 容器内文件系统
/etc/prometheus/              # ← config卷挂载点
├── prometheus.yml            # ← 来自prometheus-config ConfigMap
└── rules/                    # ← rules卷挂载点
    ├── node-alerts.yml       # ← 来自prometheus-rules ConfigMap
    ├── app-alerts.yml
    └── db-alerts.yml

/prometheus/                  # ← data卷挂载点
├── chunks_head/              # ← Prometheus数据（持久化）
├── wal/
└── 01ABCD.../

3.3.4 ConfigMap修改的全局影响

核心问题：如果修改了ConfigMap，是否所有引用该ConfigMap的Pod都会被修改？

答案：是的！ ConfigMap修改后，所有引用该ConfigMap的Pod都会受到影响。

自动更新机制

1. 修改ConfigMap
   kubectl edit configmap prometheus-config -n monitoring
   ↓
2. Kubernetes检测到变化
   ↓
3. 约30秒-2分钟后
   ↓
4. 所有挂载该ConfigMap的Pod内的文件自动更新
   ↓
5. 应用是否生效取决于应用本身
   - 有的应用会自动重新加载（如Nginx）
   - 有的应用需要手动reload（如Prometheus）
   - 有的应用需要重启才生效

全局影响示例

假设有3个Pod都引用同一个ConfigMap：

# ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  config.json: |
    {
      "log_level": "info",
      "timeout": 30
    }

修改ConfigMap后：

kubectl edit configmap app-config
# 将 log_level 从 "info" 改为 "debug"

结果：

✅ Pod1（开发环境）的配置文件会更新
✅ Pod2（测试环境）的配置文件会更新
✅ Pod3（生产环境）的配置文件也会更新
⚠️ 所有引用该ConfigMap的Pod都受影响！

实际操作验证

# 步骤1：查看当前配置
kubectl get configmap prometheus-config -n monitoring -o yaml

# 步骤2：进入Pod查看文件
kubectl exec -it deployment/prometheus -n monitoring -- \
  cat /etc/prometheus/prometheus.yml

# 步骤3：修改ConfigMap
kubectl edit configmap prometheus-config -n monitoring
# 修改内容，例如：scrape_interval: 15s → 30s

# 步骤4：等待约1-2分钟后，再次查看Pod内文件
kubectl exec -it deployment/prometheus -n monitoring -- \
  cat /etc/prometheus/prometheus.yml
# 发现文件已经自动更新！

# 步骤5：让Prometheus重新加载配置
kubectl exec -it deployment/prometheus -n monitoring -- \
  curl -X POST http://localhost:9090/-/reload

不会自动更新的情况

以下方式使用ConfigMap 不会自动更新：

1. 环境变量方式

env:
  - name: LOG_LEVEL
    valueFrom:
      configMapKeyRef:
        name: app-config
        key: log_level

❌ 环境变量在Pod创建时设置，之后不会改变
✅ 解决方案：重启Pod

2. subPath方式

volumeMounts:
  - name: config
    mountPath: /etc/app/config.json
    subPath: config.json  # ← 使用subPath

❌ subPath挂载不支持动态更新
✅ 解决方案：不使用subPath，挂载整个目录

避免全局影响的最佳实践

方案1：使用不同的ConfigMap

# 开发环境
configMap:
  name: app-config-dev

# 测试环境
configMap:
  name: app-config-test

# 生产环境
configMap:
  name: app-config-prod

方案2：使用immutable ConfigMap（不可变）

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  config.json: |
    {"log_level": "info"}
immutable: true  # ← 标记为不可变

特点：

✅ 修改会报错，必须删除重建
✅ 避免意外修改
✅ 提升性能（Kubernetes不需要watch）
❌ 更新配置需要删除重建ConfigMap和Pod

方案3：版本化ConfigMap

# 每次更新创建新版本
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-v1  # ← 版本1
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config-v2  # ← 版本2

部署时指定版本：

volumes:
  - name: config
    configMap:
      name: app-config-v2  # 切换到新版本

3.3.5 ConfigMap vs PVC 对比总结

特性	ConfigMap卷	PVC卷
用途	配置文件、规则文件	业务数据、数据库文件
持久化	❌ Pod删除后重新挂载	✅ Pod删除后数据保留
读写权限	只读	可读写
大小限制	最大1MB	根据申请大小（如100GB）
更新方式	自动同步（1-2分钟）	应用手动写入
适用场景	应用配置、环境变量	数据库、日志、监控数据
性能	内存映射，快	磁盘IO，相对慢
共享	多Pod可共享同一ConfigMap	根据访问模式决定
版本控制	可以版本化管理	通过快照备份

3.3.6 最佳实践建议

ConfigMap使用建议

✅ 推荐做法：

开发/测试/生产环境使用不同的ConfigMap
生产环境使用immutable: true标记
重要配置变更走完整发布流程

修改前备份ConfigMap

kubectl get configmap prometheus-config -o yaml > backup.yaml

测试配置语法正确性
```
promtool check config prometheus.yml
```

❌ 避免的做法：

生产环境直接kubectl edit修改ConfigMap
多环境共用一个ConfigMap
不做备份就修改配置
修改后不验证应用是否生效

PVC使用建议

✅ 推荐做法：

根据数据重要性选择合适的StorageClass
定期备份重要数据（使用快照或备份工具）
监控存储使用情况，及时扩容
为不同用途使用不同的PVC（日志、数据分离）
设置合理的回收策略（Retain用于生产环境）

❌ 避免的做法：

在Pod定义中直接创建临时卷存储重要数据
不考虑访问模式就申请RWX（ReadWriteMany）
不备份就删除PVC
忽略存储容量告警

3.3.7 故障排查指南

查看ConfigMap

# 列出所有ConfigMap
kubectl get configmap -n monitoring

# 查看ConfigMap内容
kubectl get configmap prometheus-config -n monitoring -o yaml

# 查看ConfigMap描述
kubectl describe configmap prometheus-config -n monitoring

查看PVC状态

# 列出PVC
kubectl get pvc -n monitoring

# 查看PVC详情
kubectl describe pvc prometheus-pvc -n monitoring

# 查看PVC绑定的PV
kubectl get pv

验证Volume挂载

# 进入Pod
kubectl exec -it deployment/prometheus -n monitoring -- sh

# 查看挂载的ConfigMap文件
ls -l /etc/prometheus/
cat /etc/prometheus/prometheus.yml

# 查看挂载的规则文件
ls -l /etc/prometheus/rules/
cat /etc/prometheus/rules/node-alerts.yml

# 查看PVC挂载的数据目录
ls -l /prometheus/
du -sh /prometheus/  # 查看数据大小
df -h /prometheus/   # 查看磁盘使用情况

监控ConfigMap变化

# 实时监控ConfigMap变化
kubectl get configmap -n monitoring --watch

# 查看ConfigMap修改事件
kubectl get events --field-selector involvedObject.name=prometheus-config \
  -n monitoring

4. 监控目标配置

4.1 静态配置

适用于监控目标固定的场景。

scrape_configs:
  - job_name: 'static-targets'
    static_configs:
      # 监控多个Node Exporter
      - targets:
          - '192.168.1.10:9100'
          - '192.168.1.11:9100'
          - '192.168.1.12:9100'
        labels:
          env: 'production'
          datacenter: 'dc1'
          
      # 监控应用服务
      - targets:
          - 'app1.example.com:8080'
          - 'app2.example.com:8080'
        labels:
          app: 'microservice'
          env: 'production'

4.2 服务发现

4.2.1 Kubernetes服务发现

通过annotation控制是否监控：

apiVersion: v1
kind: Service
metadata:
  name: my-service
  annotations:
    prometheus.io/scrape: "true"   # 启用监控
    prometheus.io/port: "8080"     # metrics端口
    prometheus.io/path: "/metrics" # metrics路径
spec:
  selector:
    app: my-app
  ports:
    - port: 8080

4.2.2 文件服务发现

创建 targets.json：

[
  {
    "targets": ["192.168.1.10:9100", "192.168.1.11:9100"],
    "labels": {
      "env": "production",
      "job": "node"
    }
  },
  {
    "targets": ["app1:8080", "app2:8080"],
    "labels": {
      "env": "production",
      "job": "application"
    }
  }
]

在 prometheus.yml 中配置：

scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
        refresh_interval: 30s

5. Exporter配置

5.1 Node Exporter

监控Linux/Unix系统的硬件和操作系统指标。

5.1.1 安装（二进制）

# 下载
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

# 解压
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# 运行
./node_exporter &

# 或者使用systemd
sudo cp node_exporter /usr/local/bin/

创建systemd服务 /etc/systemd/system/node_exporter.service：

[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter \
    --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
    --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
Restart=on-failure

[Install]
WantedBy=multi-user.target

启动服务：

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

5.1.2 常用指标

指标	说明
`node_cpu_seconds_total`	CPU时间
`node_memory_MemTotal_bytes`	总内存
`node_memory_MemAvailable_bytes`	可用内存
`node_filesystem_size_bytes`	文件系统大小
`node_filesystem_avail_bytes`	可用空间
`node_network_receive_bytes_total`	网络接收字节数
`node_network_transmit_bytes_total`	网络发送字节数
`node_disk_read_bytes_total`	磁盘读取字节数
`node_disk_written_bytes_total`	磁盘写入字节数
`node_load1`	1分钟平均负载

5.2 应用监控

5.2.1 Spring Boot应用

添加依赖（pom.xml）：

<dependencies>
    <!-- Actuator -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
    
    <!-- Micrometer Prometheus -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>
</dependencies>

配置（application.yml）：

management:
  endpoints:
    web:
      exposure:
        include: '*'  # 暴露所有端点
  endpoint:
    health:
      show-details: always
    prometheus:
      enabled: true
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}
      environment: production

自定义指标：

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Component;

@Component
public class CustomMetrics {
    
    private final Counter orderCounter;
    private final Timer orderProcessTimer;
    
    public CustomMetrics(MeterRegistry registry) {
        // 计数器
        this.orderCounter = Counter.builder("orders_total")
            .description("Total number of orders")
            .tag("type", "online")
            .register(registry);
        
        // 计时器
        this.orderProcessTimer = Timer.builder("order_process_duration")
            .description("Order processing time")
            .register(registry);
    }
    
    public void recordOrder() {
        orderCounter.increment();
    }
    
    public void recordProcessTime(Runnable task) {
        orderProcessTimer.record(task);
    }
}

访问metrics端点：

http://localhost:8080/actuator/prometheus

在Prometheus中配置：

scrape_configs:
  - job_name: 'spring-boot-apps'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets:
          - 'app1:8080'
          - 'app2:8080'

5.2.2 自定义Python应用

安装Prometheus客户端：

pip install prometheus-client

示例代码：

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random

# 定义指标
request_counter = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

# 业务逻辑
def process_request(method, endpoint):
    # 记录请求
    status = random.choice(['200', '404', '500'])
    request_counter.labels(method=method, endpoint=endpoint, status=status).inc()
    
    # 记录耗时
    with request_duration.labels(method=method, endpoint=endpoint).time():
        time.sleep(random.random())  # 模拟处理时间
    
    # 更新活跃连接数
    active_connections.set(random.randint(1, 100))

if __name__ == '__main__':
    # 启动HTTP服务器暴露metrics
    start_http_server(8000)
    
    # 模拟业务
    while True:
        process_request('GET', '/api/users')
        process_request('POST', '/api/orders')
        time.sleep(1)

运行后访问：

http://localhost:8000/metrics

5.3 中间件监控

5.3.1 MySQL Exporter

使用Docker部署：

# docker-compose.yml
mysqld-exporter:
  image: prom/mysqld-exporter:v0.15.1
  container_name: mysqld-exporter
  restart: always
  ports:
    - "9104:9104"
  environment:
    DATA_SOURCE_NAME: "exporter:password@(mysql:3306)/"
  networks:
    - monitoring

创建MySQL用户：

CREATE USER 'exporter'@'%' IDENTIFIED BY 'password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
FLUSH PRIVILEGES;

Prometheus配置：

scrape_configs:
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysqld-exporter:9104']
        labels:
          instance: 'mysql-prod'

5.3.2 Redis Exporter

redis-exporter:
  image: oliver006/redis_exporter:v1.55.0
  container_name: redis-exporter
  restart: always
  ports:
    - "9121:9121"
  environment:
    REDIS_ADDR: "redis:6379"
    REDIS_PASSWORD: "your-redis-password"
  networks:
    - monitoring

Prometheus配置：

scrape_configs:
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

5.3.3 MongoDB Exporter

mongodb-exporter:
  image: percona/mongodb_exporter:0.40
  container_name: mongodb-exporter
  restart: always
  ports:
    - "9216:9216"
  environment:
    MONGODB_URI: "mongodb://exporter:password@mongodb:27017"
  networks:
    - monitoring

6. AlertManager报警系统

6.1 AlertManager配置

创建 alertmanager.yml：

# 全局配置
global:
  resolve_timeout: 5m  # 报警恢复超时时间
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'your-password'
  smtp_require_tls: true

# 路由配置
route:
  group_by: ['alertname', 'cluster', 'service']  # 分组依据
  group_wait: 10s        # 等待时间，收集同组告警
  group_interval: 10s    # 组内发送间隔
  repeat_interval: 1h    # 重复发送间隔
  receiver: 'default'    # 默认接收者
  
  # 子路由
  routes:
    # 严重告警立即发送
    - match:
        severity: critical
      receiver: 'critical-team'
      group_wait: 0s
      repeat_interval: 5m
      
    # 数据库告警发给DBA
    - match_re:
        service: mysql|redis|mongodb
      receiver: 'dba-team'
      
    # 业务告警发给开发团队
    - match:
        team: dev
      receiver: 'dev-team'

# 抑制规则
inhibit_rules:
  # 节点宕机时，抑制该节点的其他告警
  - source_match:
      severity: 'critical'
      alertname: 'NodeDown'
    target_match:
      severity: 'warning'
    equal: ['instance']

  # 当有服务不可用告警时，抑制该服务的高延迟告警
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['service', 'instance']

# 接收者配置
receivers:
  # 默认接收者
  - name: 'default'
    email_configs:
      - to: 'ops-team@example.com'
        headers:
          Subject: '[Prometheus] {{ .GroupLabels.alertname }}'
        html: '{{ template "email.html" . }}'
    
    webhook_configs:
      - url: 'http://webhook-server:8080/alert'
        send_resolved: true

  # 严重告警接收者
  - name: 'critical-team'
    email_configs:
      - to: 'critical-alerts@example.com'
        send_resolved: true
    
    # 企业微信
    wechat_configs:
      - corp_id: 'your-corp-id'
        api_secret: 'your-api-secret'
        to_party: '1'  # 部门ID
        agent_id: 'your-agent-id'
        message: '{{ template "wechat.default.message" . }}'
    
    # 钉钉
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
        send_resolved: true

  # DBA团队
  - name: 'dba-team'
    email_configs:
      - to: 'dba@example.com'

  # 开发团队
  - name: 'dev-team'
    email_configs:
      - to: 'dev-team@example.com'

# 告警模板
templates:
  - '/etc/alertmanager/templates/*.tmpl'

6.2 自定义告警模板

创建 email.tmpl：

{{ define "email.html" }}
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <style>
        body { font-family: Arial, sans-serif; }
        .alert { border: 1px solid #ddd; padding: 10px; margin: 10px 0; border-radius: 5px; }
        .critical { background-color: #ffebee; border-color: #f44336; }
        .warning { background-color: #fff3e0; border-color: #ff9800; }
        .info { background-color: #e3f2fd; border-color: #2196f3; }
        .label { display: inline-block; padding: 2px 8px; margin: 2px; background-color: #e0e0e0; border-radius: 3px; font-size: 12px; }
    </style>
</head>
<body>
    <h2>Prometheus 告警通知</h2>
    <p><strong>告警时间:</strong> {{ .CommonAnnotations.timestamp }}</p>
    <p><strong>告警组:</strong> {{ .GroupLabels.alertname }}</p>
    
    <h3>触发的告警 ({{ .Alerts.Firing | len }})</h3>
    {{ range .Alerts.Firing }}
    <div class="alert {{ .Labels.severity }}">
        <h4>{{ .Labels.alertname }}</h4>
        <p><strong>级别:</strong> <span class="label">{{ .Labels.severity }}</span></p>
        <p><strong>实例:</strong> {{ .Labels.instance }}</p>
        <p><strong>描述:</strong> {{ .Annotations.description }}</p>
        <p><strong>触发时间:</strong> {{ .StartsAt.Format "2006-01-02 15:04:05" }}</p>
        
        <details>
            <summary>详细标签</summary>
            {{ range .Labels.SortedPairs }}
            <span class="label">{{ .Name }}: {{ .Value }}</span>
            {{ end }}
        </details>
    </div>
    {{ end }}
    
    {{ if .Alerts.Resolved }}
    <h3>已恢复的告警 ({{ .Alerts.Resolved | len }})</h3>
    {{ range .Alerts.Resolved }}
    <div class="alert info">
        <h4>✓ {{ .Labels.alertname }} [已恢复]</h4>
        <p><strong>实例:</strong> {{ .Labels.instance }}</p>
        <p><strong>恢复时间:</strong> {{ .EndsAt.Format "2006-01-02 15:04:05" }}</p>
        <p><strong>持续时间:</strong> {{ .EndsAt.Sub .StartsAt }}</p>
    </div>
    {{ end }}
    {{ end }}
    
    <hr>
    <p style="font-size: 12px; color: #666;">
        此邮件由 Prometheus AlertManager 自动发送<br>
        查看详情: <a href="http://prometheus.yourdomain.com">Prometheus</a> | 
        <a href="http://alertmanager.yourdomain.com">AlertManager</a>
    </p>
</body>
</html>
{{ end }}

6.3 钉钉机器人集成

创建钉钉Webhook适配器 dingtalk-webhook.py：

from flask import Flask, request
import requests
import json

app = Flask(__name__)

DINGTALK_WEBHOOK = "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"

@app.route('/webhook/dingtalk', methods=['POST'])
def dingtalk_webhook():
    data = request.json
    
    # 构造钉钉消息
    alerts = data.get('alerts', [])
    firing_alerts = [a for a in alerts if a['status'] == 'firing']
    resolved_alerts = [a for a in alerts if a['status'] == 'resolved']
    
    message = {
        "msgtype": "markdown",
        "markdown": {
            "title": "Prometheus告警",
            "text": build_message(firing_alerts, resolved_alerts)
        },
        "at": {
            "isAtAll": False
        }
    }
    
    # 发送到钉钉
    response = requests.post(DINGTALK_WEBHOOK, json=message)
    return {'status': 'ok'}, 200

def build_message(firing, resolved):
    text = "### 🚨 Prometheus 告警通知\n\n"
    
    if firing:
        text += f"**触发告警: {len(firing)}条**\n\n"
        for alert in firing:
            labels = alert['labels']
            annotations = alert['annotations']
            text += f"#### 【{labels.get('severity', 'unknown')}】{labels.get('alertname')}\n"
            text += f"- **实例:** {labels.get('instance')}\n"
            text += f"- **描述:** {annotations.get('description')}\n"
            text += f"- **时间:** {alert['startsAt']}\n\n"
    
    if resolved:
        text += f"\n**已恢复: {len(resolved)}条**\n\n"
        for alert in resolved:
            labels = alert['labels']
            text += f"- ✅ {labels.get('alertname')} ({labels.get('instance')})\n"
    
    return text

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

7. 报警规则配置

7.1 基础设施告警规则

创建 rules/infrastructure-alerts.yml：

groups:
  # 节点监控告警
  - name: node-alerts
    interval: 30s
    rules:
      # 节点宕机
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
          team: ops
        annotations:
          summary: "节点 {{ $labels.instance }} 不可用"
          description: "节点已经离线超过1分钟\n  当前状态: {{ $value }}"

      # CPU使用率过高
      - alert: NodeHighCPU
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "节点 {{ $labels.instance }} CPU使用率过高"
          description: "CPU使用率已超过80%\n  当前值: {{ $value | humanize }}%"

      # CPU使用率严重过高
      - alert: NodeCriticalCPU
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 3m
        labels:
          severity: critical
          team: ops
        annotations:
          summary: "节点 {{ $labels.instance }} CPU使用率严重过高"
          description: "CPU使用率已超过95%，可能导致服务不可用\n  当前值: {{ $value | humanize }}%"

      # 内存使用率过高
      - alert: NodeHighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "节点 {{ $labels.instance }} 内存使用率过高"
          description: "内存使用率已超过85%\n  当前值: {{ $value | humanize }}%"

      # 磁盘空间不足
      - alert: NodeDiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "节点 {{ $labels.instance }} 磁盘空间不足"
          description: "磁盘 {{ $labels.mountpoint }} 使用率已超过85%\n  当前值: {{ $value | humanize }}%"

      # 磁盘空间严重不足
      - alert: NodeDiskSpaceCritical
        expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes)) * 100 > 95
        for: 1m
        labels:
          severity: critical
          team: ops
        annotations:
          summary: "节点 {{ $labels.instance }} 磁盘空间严重不足"
          description: "磁盘 {{ $labels.mountpoint }} 使用率已超过95%，请立即处理\n  当前值: {{ $value | humanize }}%"

      # 系统负载过高
      - alert: NodeHighLoad
        expr: node_load5 / count by (instance) (node_cpu_seconds_total{mode="idle"}) > 2
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "节点 {{ $labels.instance }} 系统负载过高"
          description: "5分钟平均负载已超过CPU核心数的2倍\n  当前值: {{ $value | humanize }}"

      # 文件描述符使用率过高
      - alert: NodeHighFileDescriptors
        expr: node_filefd_allocated / node_filefd_maximum * 100 > 80
        for: 5m
        labels:
          severity: warning
          team: ops
        annotations:
          summary: "节点 {{ $labels.instance }} 文件描述符使用率过高"
          description: "文件描述符使用率已超过80%\n  当前值: {{ $value | humanize }}%"

7.2 应用服务告警规则

创建 rules/application-alerts.yml：

groups:
  - name: application-alerts
    interval: 30s
    rules:
      # 服务不可用
      - alert: ServiceDown
        expr: up{job="application"} == 0
        for: 1m
        labels:
          severity: critical
          team: dev
        annotations:
          summary: "服务 {{ $labels.instance }} 不可用"
          description: "服务已经离线超过1分钟"

      # HTTP错误率过高
      - alert: HighHTTPErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (instance, service)
          /
          sum(rate(http_requests_total[5m])) by (instance, service)
          * 100 > 5
        for: 5m
        labels:
          severity: warning
          team: dev
        annotations:
          summary: "服务 {{ $labels.service }} HTTP错误率过高"
          description: "5xx错误率超过5%\n  当前值: {{ $value | humanize }}%"

      # API响应时间过长
      - alert: HighAPILatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, instance, endpoint)
          ) > 1
        for: 5m
        labels:
          severity: warning
          team: dev
        annotations:
          summary: "API {{ $labels.endpoint }} 响应时间过长"
          description: "P95延迟超过1秒\n  当前值: {{ $value | humanize }}s"

      # JVM内存使用率过高
      - alert: HighJVMMemoryUsage
        expr: |
          (jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) * 100 > 85
        for: 5m
        labels:
          severity: warning
          team: dev
        annotations:
          summary: "服务 {{ $labels.instance }} JVM堆内存使用率过高"
          description: "堆内存使用率已超过85%\n  当前值: {{ $value | humanize }}%"

      # GC频率过高
      - alert: HighGCRate
        expr: rate(jvm_gc_collection_seconds_count[5m]) > 10
        for: 5m
        labels:
          severity: warning
          team: dev
        annotations:
          summary: "服务 {{ $labels.instance }} GC频率过高"
          description: "GC频率超过每秒10次\n  当前值: {{ $value | humanize }}/s"

      # 数据库连接池耗尽
      - alert: DatabaseConnectionPoolExhausted
        expr: |
          (hikaricp_connections_active / hikaricp_connections_max) * 100 > 90
        for: 5m
        labels:
          severity: critical
          team: dev
        annotations:
          summary: "服务 {{ $labels.instance }} 数据库连接池即将耗尽"
          description: "连接池使用率已超过90%\n  当前值: {{ $value | humanize }}%"

7.3 中间件告警规则

创建 rules/middleware-alerts.yml：

groups:
  # MySQL告警
  - name: mysql-alerts
    interval: 30s
    rules:
      - alert: MySQLDown
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
          team: dba
        annotations:
          summary: "MySQL {{ $labels.instance }} 不可用"
          description: "MySQL实例已经离线"

      - alert: MySQLTooManyConnections
        expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
        for: 5m
        labels:
          severity: warning
          team: dba
        annotations:
          summary: "MySQL {{ $labels.instance }} 连接数过多"
          description: "连接使用率超过80%\n  当前值: {{ $value | humanize }}%"

      - alert: MySQLSlowQueries
        expr: rate(mysql_global_status_slow_queries[5m]) > 10
        for: 5m
        labels:
          severity: warning
          team: dba
        annotations:
          summary: "MySQL {{ $labels.instance }} 慢查询过多"
          description: "慢查询速率超过每秒10个\n  当前值: {{ $value | humanize }}/s"

  # Redis告警
  - name: redis-alerts
    interval: 30s
    rules:
      - alert: RedisDown
        expr: redis_up == 0
        for: 1m
        labels:
          severity: critical
          team: dba
        annotations:
          summary: "Redis {{ $labels.instance }} 不可用"
          description: "Redis实例已经离线"

      - alert: RedisHighMemoryUsage
        expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
          team: dba
        annotations:
          summary: "Redis {{ $labels.instance }} 内存使用率过高"
          description: "内存使用率超过85%\n  当前值: {{ $value | humanize }}%"

      - alert: RedisHighConnectedClients
        expr: redis_connected_clients > 1000
        for: 5m
        labels:
          severity: warning
          team: dba
        annotations:
          summary: "Redis {{ $labels.instance }} 连接数过多"
          description: "连接数超过1000\n  当前值: {{ $value }}"

8. Grafana可视化

8.1 配置Prometheus数据源

访问Grafana: http://localhost:3000
登录（admin/admin123）
点击左侧菜单 Configuration > Data Sources
点击 Add data source
选择 Prometheus

配置：

Name: Prometheus
URL: http://prometheus:9090
Access: Server (default)

点击 Save & Test

8.2 导入常用Dashboard

8.2.1 Node Exporter Full

点击左侧菜单 + > Import
输入Dashboard ID: 1860
点击 Load
选择Prometheus数据源
点击 Import

8.2.2 其他推荐Dashboard

Dashboard	ID	说明
Node Exporter Full	1860	完整的节点监控
Kubernetes Cluster Monitoring	7249	K8s集群监控
Spring Boot Statistics	6756	Spring Boot应用监控
MySQL Overview	7362	MySQL监控
Redis Dashboard	11835	Redis监控
JVM (Micrometer)	4701	JVM监控

9. PromQL查询语言

9.1 基础查询

# 查询指标
http_requests_total

# 带标签过滤
http_requests_total{method="GET"}

# 多个标签
http_requests_total{method="GET", status="200"}

# 标签正则匹配
http_requests_total{status=~"2.."}

# 标签不等于
http_requests_total{status!="200"}

9.2 范围查询

# 过去5分钟的数据
http_requests_total[5m]

# 过去1小时的数据
http_requests_total[1h]

9.3 聚合函数

# 求和
sum(http_requests_total)

# 按标签分组求和
sum(http_requests_total) by (service)

# 平均值
avg(node_cpu_seconds_total)

# 最大值
max(node_memory_MemTotal_bytes)

# 最小值
min(node_memory_MemAvailable_bytes)

# 计数
count(up == 1)

9.4 速率函数

# 每秒增长率（适用于Counter）
rate(http_requests_total[5m])

# 瞬时增长率
irate(http_requests_total[5m])

# 增量
increase(http_requests_total[1h])

9.5 常用查询示例

# CPU使用率
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# 磁盘使用率
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# QPS（每秒请求数）
sum(rate(http_requests_total[5m]))

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# P95延迟
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 可用性（过去24小时）
avg_over_time(up[24h]) * 100

# Top 5 最高CPU的Pod
topk(5, sum(rate(container_cpu_usage_seconds_total[5m])) by (pod))

10. 企业级最佳实践

10.1 高可用部署

10.1.1 Prometheus联邦集群

# 全局Prometheus配置
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~".*"}'
    static_configs:
      - targets:
          - 'prometheus-shard1:9090'
          - 'prometheus-shard2:9090'
          - 'prometheus-shard3:9090'

10.2 数据保留策略

# Prometheus启动参数
--storage.tsdb.retention.time=30d  # 保留30天
--storage.tsdb.retention.size=100GB  # 或最多100GB

10.3 监控指标规范

10.3.1 命名规范

使用小写字母和下划线
以应用名或库名作为前缀
以单位作为后缀（可选）

# 好的命名
http_requests_total
http_request_duration_seconds
database_connections_active

# 不好的命名
httpRequests
RequestDuration
DB-Connections

10.3.2 标签规范

使用有意义的标签名
避免高基数标签（如user_id、request_id）
保持标签数量适中

# 好的标签
http_requests_total{method="GET", endpoint="/api/users", status="200"}

# 避免高基数
http_requests_total{user_id="12345"}  # 不推荐

10.4 告警规范

10.4.1 告警级别定义

级别	说明	响应时间	处理方式
critical	严重故障，影响服务	立即	7x24小时值班，电话/短信通知
warning	潜在问题，需要关注	30分钟内	工作时间处理
info	信息通知	不要求	记录日志

10.4.2 告警设计原则

可操作性：告警必须需要人工介入
准确性：避免误报，设置合理的阈值和持续时间
简洁性：告警信息简洁明了
分级处理：根据严重程度分级
避免告警疲劳：合并相似告警

10.5 性能优化

10.5.1 使用Recording Rules

对于复杂的查询，预计算并存储结果：

groups:
  - name: recording-rules
    interval: 30s
    rules:
      - record: instance:node_cpu_utilization:rate5m
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
      
      - record: instance:node_memory_utilization:ratio
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

使用预计算的指标：

# 直接使用
instance:node_cpu_utilization:rate5m

11. 常见问题与解决方案

11.1 Prometheus内存占用过高

原因：

监控目标过多
指标基数过高
数据保留时间过长

解决方案：

减少监控目标或增加抓取间隔
使用Recording Rules预计算
调整数据保留时间
增加内存限制

# 启动参数
--storage.tsdb.retention.time=15d

11.2 查询超时

解决方案：

优化PromQL查询
使用Recording Rules
增加查询超时时间

global:
  query_timeout: 2m

11.3 告警规则不生效

排查步骤：

检查规则文件语法

promtool check rules rules/*.yml

检查Prometheus配置

promtool check config prometheus.yml

查看Prometheus UI的Alerts页面
检查AlertManager配置

12. 监控方案总结

12.1 完整监控架构图

┌──────────────────────────────────────────────────────────────┐
│                         监控系统架构                            │
└──────────────────────────────────────────────────────────────┘

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│  应用服务     │   │  中间件      │   │  基础设施     │
│  /actuator   │   │  Exporters  │   │  Node Exp   │
└──────┬───────┘   └──────┬──────┘   └──────┬──────┘
       │                  │                  │
       └──────────────────┼──────────────────┘
                          │ (metrics)
                          ↓
                 ┌─────────────────┐
                 │   Prometheus    │
                 │  (抓取&存储)     │
                 └────────┬────────┘
                          │
           ┌──────────────┼──────────────┐
           │              │              │
           ↓              ↓              ↓
    ┌────────────┐  ┌──────────┐  ┌──────────┐
    │  Grafana   │  │AlertMgr  │  │  HTTP API│
    │  (可视化)   │  │ (告警)   │  │  (查询)  │
    └────────────┘  └────┬─────┘  └──────────┘
                         │
              ┌──────────┼──────────┐
              ↓          ↓          ↓
         ┌──────┐   ┌──────┐   ┌──────┐
         │ 邮件  │   │企业微信│   │ 钉钉  │
         └──────┘   └──────┘   └──────┘

12.2 关键配置清单

✅ Prometheus服务器部署
✅ Node Exporter安装
✅ 应用监控集成
✅ 中间件Exporter配置
✅ 服务发现配置
✅ 报警规则编写
✅ AlertManager部署
✅ 告警通知渠道配置
✅ Grafana部署与配置
✅ Dashboard创建
✅ 数据备份策略
✅ 高可用方案

附录

附录A：常用Exporter列表

Exporter	监控对象	端口	项目地址
node_exporter	Linux/Unix主机	9100	https://github.com/prometheus/node_exporter
mysqld_exporter	MySQL	9104	https://github.com/prometheus/mysqld_exporter
redis_exporter	Redis	9121	https://github.com/oliver006/redis_exporter
mongodb_exporter	MongoDB	9216	https://github.com/percona/mongodb_exporter
postgres_exporter	PostgreSQL	9187	https://github.com/prometheus-community/postgres_exporter
nginx_exporter	Nginx	9113	https://github.com/nginxinc/nginx-prometheus-exporter
kafka_exporter	Kafka	9308	https://github.com/danielqsj/kafka_exporter
blackbox_exporter	HTTP/TCP/ICMP	9115	https://github.com/prometheus/blackbox_exporter

附录B：PromQL速查表

# 选择器
{job="prometheus"}
{job=~"prom.*"}
{job!="prometheus"}

# 聚合
sum()
avg()
max()
min()
count()
topk(5, )
bottomk(5, )

# 速率
rate()    # 适合告警
irate()   # 适合图表

# 时间范围
[5m]   # 5分钟
[1h]   # 1小时
[1d]   # 1天

# 运算符
+ - * / %
== != > < >= <=
and or unless

# 函数
abs()
ceil()
floor()
round()
histogram_quantile()

文档版本: v1.0.0
最后更新: 2025-12-09
作者: Tiger IoT团队
适用范围: 企业级微服务监控系统

祝您监控愉快！