马哥Linux运维 | Prometheus 告警规则生产级配置：50+ 核心指标与最佳实践(一)

最新推荐文章于 2025-11-24 20:32:24 发布

原创最新推荐文章于 2025-11-24 20:32:24 发布 · 616 阅读

17 ·

CC 4.0 BY-SA版权

文章标签：

#运维 #prometheus #人工智能 #transformer #服务器 #linux #python

深度学习拓展阅读同时被 2 个专栏收录

989 篇文章

订阅专栏

运维知识

25 篇文章

订阅专栏

本文来源公众号“马哥Linux运维”，仅用于学术分享，侵权删，干货满满。

原文链接：https://mp.weixin.qq.com/s/X4ADdHOXtQw5Hy5iQSal8A

文章略长，分为(一)、(二)、(三)和(四)两部分，一起学习吧！

1️⃣ 适用场景 & 前置条件

项目	要求
适用场景	生产环境微服务监控、云原生应用可观测性、基础设施健康监控
OS	RHEL/CentOS 7.9+ 或 Ubuntu 20.04+
内核	Linux Kernel 4.18+
软件版本	Prometheus 2.40+, Alertmanager 0.25+, Node Exporter 1.5+
资源规格	4C8G（最小）/ 8C16G（推荐，支持 10K+ 时间序列）
网络	端口 9090（Prometheus）、9093（Alertmanager）、9100（Node Exporter）开放
权限	普通用户 + systemd 服务管理权限
技能要求	熟悉 PromQL 查询语言、YAML 配置、微服务架构、监控理论
存储	≥100GB SSD（时间序列数据存储，保留 15 天）

2️⃣ 反模式警告（何时不适用）

⚠️ 以下场景不推荐使用本方案：

1. 超小规模环境：监控目标 < 10 台服务器，Zabbix/Nagios 更轻量
2. 需要自动修复能力：Prometheus 仅告警不自愈，需结合 Ansible/Kubernetes Operator
3. 长期数据存储：原生仅保留 15 天，需集成 Thanos/Cortex/VictoriaMetrics
4. 日志分析为主：Prometheus 专注指标监控，日志需用 ELK/Loki
5. Windows 服务器为主：Node Exporter 对 Windows 支持有限，推荐 WMI Exporter
6. 实时性要求 < 1s：Prometheus 采集间隔最小 5-15s，不适合毫秒级监控

替代方案对比：

场景	推荐方案	理由
APM 监控	Jaeger/SkyWalking	分布式追踪能力更强
日志聚合	ELK/Loki	全文搜索与日志分析
长期存储	Thanos/VictoriaMetrics	支持多年数据保留
传统基础设施	Zabbix	成熟的 SNMP/IPMI 支持

3️⃣ 环境与版本矩阵

组件	RHEL/CentOS	Ubuntu/Debian	测试状态
OS 版本	RHEL 8.7+ / CentOS Stream 9	Ubuntu 22.04 LTS	[已实测]
内核版本	4.18.0-425+	5.15.0-60+	[已实测]
Prometheus	2.40.7 (LTS) / 2.48.0 (最新)	2.40.7 (LTS) / 2.48.0 (最新)	[已实测]
Alertmanager	0.25.0 / 0.26.0	0.25.0 / 0.26.0	[已实测]
Node Exporter	1.5.0 / 1.6.1	1.5.0 / 1.6.1	[已实测]
最小规格	4C8G / 50GB SSD	4C8G / 50GB SSD	-
推荐规格	8C16G / 200GB SSD	8C16G / 200GB SSD	-

版本差异说明：

• Prometheus 2.40 vs 2.48：2.48 支持原生 Histogram 查询优化
• Alertmanager 0.25 vs 0.26：0.26 增强 Slack/Teams 集成
• Node Exporter 1.5 vs 1.6：1.6 新增 cgroup v2 完整支持

4️⃣ 阅读导航

📖 建议阅读路径：

快速上手（20分钟）：→ 章节 5（快速清单） → 章节 6（实施步骤 Step 1-4） → 章节 13（关键脚本）

深入理解（60分钟）：→ 章节 7（PromQL 核心原理） → 章节 6（实施步骤完整版） → 章节 8（可观测性三支柱） → 章节 11（最佳实践）

故障排查：→ 章节 9（常见故障与排错） → 章节 10（变更与回滚剧本）

5️⃣ 快速清单（Checklist）

• [ ] 准备阶段
- • [ ] 检查 Prometheus 版本兼容性（prometheus --version）
- • [ ] 备份现有告警规则（cp /etc/prometheus/rules/*.yml /backup/）
- • [ ] 验证 Alertmanager 配置（amtool check-config /etc/alertmanager/alertmanager.yml）
• [ ] 实施阶段
- • [ ] 部署 Node Exporter 到所有监控目标（systemctl enable --now node_exporter）
- • [ ] 配置 Prometheus 抓取目标（编辑 prometheus.yml）
- • [ ] 部署告警规则文件（创建 /etc/prometheus/rules/ 下的规则文件）
- • [ ] 配置 Alertmanager 通知渠道（Slack/Email/PagerDuty）
- • [ ] 热加载 Prometheus 配置（curl -X POST http://localhost:9090/-/reload）
• [ ] 验证阶段
- • [ ] 测试 PromQL 查询语法（在 Prometheus UI 执行查询）
- • [ ] 触发测试告警（模拟 CPU 高负载）
- • [ ] 验证告警路由规则（检查 Alertmanager 日志）
- • [ ] 确认通知到达（检查 Slack/Email）
• [ ] 优化阶段
- • [ ] 调整告警阈值（减少误报）
- • [ ] 配置静默规则（维护窗口）
- • [ ] 启用告警抑制（防止告警风暴）

6️⃣ 实施步骤

架构与数据流说明（文字描述）

系统架构：

监控目标（服务器/容器）
    ↓ 暴露指标（HTTP /metrics 端点）
Node Exporter / 应用 Exporter
    ↓ 定期抓取（默认 15s）
Prometheus Server（时间序列数据库）
    ↓ 规则评估（evaluation_interval: 15s）
告警规则引擎（基于 PromQL）
    ↓ 满足条件触发告警
Alertmanager（告警聚合与路由）
    ↓ 分组/抑制/静默处理
通知渠道（Slack/Email/PagerDuty/Webhook）

关键组件：

• Exporter：指标采集器，暴露 HTTP 端点提供监控数据
• Prometheus Server：核心服务，抓取指标、存储时间序列、执行规则评估
• 告警规则文件：YAML 格式定义告警条件（PromQL）、持续时间、标签
• Alertmanager：独立服务，负责告警去重、分组、路由、静默、抑制

数据流向：

1. Node Exporter 每 15 秒暴露一次系统指标（CPU/内存/磁盘/网络）
2. Prometheus 主动抓取所有目标的 /metrics 端点
3. 数据存储到本地 TSDB（时间序列数据库）
4. 告警规则引擎每 15 秒评估一次所有规则
5. 满足条件且持续时间达到阈值时，发送告警到 Alertmanager
6. Alertmanager 根据路由规则将告警发送到对应通知渠道

Step 1: 部署 Node Exporter（监控目标）

目标： 在所有需要监控的服务器上部署 Node Exporter

RHEL/CentOS 命令：

# 下载并安装 Node Exporter
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudocp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/

# 创建 systemd 服务
sudotee /etc/systemd/system/node_exporter.service > /dev/null <<'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# 创建 prometheus 用户
sudo useradd --no-create-home --shell /bin/false prometheus

# 启动服务
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

Ubuntu/Debian 命令：

# 使用包管理器安装（推荐）
sudo apt update
sudo apt install -y prometheus-node-exporter

# 或手动安装（同 RHEL 步骤）
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudocp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/

# 启动服务（包管理器安装会自动创建服务）
sudo systemctl enable --now prometheus-node-exporter

关键参数解释：

1. --collector.systemd：启用 systemd 单元状态采集（监控服务运行状态）
2. --collector.processes：采集进程数统计（监控僵尸进程）
3. --collector.tcpstat：采集 TCP 连接状态（监控 TIME_WAIT/CLOSE_WAIT）

执行前验证：

# 确认端口 9100 未被占用
sudo ss -tulnp | grep 9100
# 预期输出：无输出（端口空闲）

执行后验证：

# 检查服务状态
sudo systemctl status node_exporter
# 预期输出：active (running)

# 测试指标端点
curl -s http://localhost:9100/metrics | head -n 10
# 预期输出：HELP 和 TYPE 开头的指标定义

幂等性保障：

• systemd 服务文件使用 tee 覆盖写入，重复执行安全
• 用户创建使用 useradd，已存在时会提示但不影响后续步骤

回滚要点：

# 停止并删除服务
sudo systemctl stop node_exporter
sudo systemctl disable node_exporter
sudorm /etc/systemd/system/node_exporter.service
sudorm /usr/local/bin/node_exporter

Step 2: 配置 Prometheus 抓取目标

目标： 配置 Prometheus 定期抓取 Node Exporter 指标

编辑配置文件：

# 备份原配置
sudocp /etc/prometheus/prometheus.yml /etc/prometheus/prometheus.yml.bak.$(date +%Y%m%d_%H%M%S)

# 编辑配置
sudo vi /etc/prometheus/prometheus.yml

添加抓取配置：

# prometheus.yml
global:
scrape_interval:15s# 全局抓取间隔
evaluation_interval:15s# 规则评估间隔
external_labels:
cluster:'production'# 集群标识

# 告警规则文件路径
rule_files:
-'/etc/prometheus/rules/*.yml'

# Alertmanager 配置
alerting:
alertmanagers:
-static_configs:
-targets:
-'localhost:9093'

# 抓取目标配置
scrape_configs:
# Prometheus 自监控
-job_name:'prometheus'
static_configs:
-targets: ['localhost:9090']
labels:
env:'production'

# Node Exporter（基础设施监控）
-job_name:'node_exporter'
static_configs:
-targets:
-'192.168.1.10:9100'# Web 服务器 1
-'192.168.1.11:9100'# Web 服务器 2
-'192.168.1.20:9100'# 数据库服务器
labels:
env:'production'
role:'backend'

# 应用监控（示例：Go 应用）
-job_name:'my_application'
static_configs:
-targets:
-'192.168.1.30:8080'
labels:
env:'production'
app:'api_server'

关键参数解释：

1. scrape_interval: 15s：每 15 秒抓取一次目标，影响数据精度与存储空间
2. evaluation_interval: 15s：每 15 秒评估一次告警规则，影响告警响应速度
3. external_labels：集群级标签，用于联邦集群或远程存储场景

执行后验证：

# 检查配置语法
promtool check config /etc/prometheus/prometheus.yml
# 预期输出：SUCCESS: 0 rule files found

# 热加载配置（无需重启）
curl -X POST http://localhost:9090/-/reload
# 预期输出：无输出（HTTP 200）

# 验证目标状态
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job, health}'
# 预期输出：所有目标 health: "up"

常见错误示例：

# 错误：YAML 缩进错误
# 输出：yaml: line 10: mapping values are not allowed in this context
# 解决：使用空格缩进（不要用 Tab），检查冒号后是否有空格

# 错误：目标不可达
# 输出：context deadline exceeded
# 解决：检查防火墙、目标服务是否运行、网络连通性

Step 3: 部署告警规则文件

目标： 创建生产级告警规则，覆盖基础设施、中间件、应用层

创建规则目录：

sudomkdir -p /etc/prometheus/rules
sudochown prometheus:prometheus /etc/prometheus/rules

规则文件 1: 基础设施告警（infrastructure.yml）

# /etc/prometheus/rules/infrastructure.yml
groups:
-name:infrastructure_alerts
interval:15s
rules:
# 🔴 P0: 节点宕机
-alert:NodeDown
expr:up{job="node_exporter"}==0
for:1m
labels:
severity:critical
category:infrastructure
annotations:
summary:"节点 {{ $labels.instance }} 宕机"
description:"节点已离线超过 1 分钟，当前状态: {{ $value }}"
runbook_url:"https://wiki.example.com/runbook/node-down"

# 🟠 P1: CPU 使用率过高
-alert:HighCPUUsage
expr:|
          100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for:5m
labels:
severity:warning
category:infrastructure
annotations:
summary:"节点 {{ $labels.instance }} CPU 使用率过高"
description:"CPU 使用率持续 5 分钟超过 80%，当前值: {{ $value | humanizePercentage }}"

# 🟠 P1: 内存使用率过高
-alert:HighMemoryUsage
expr:|
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for:5m
labels:
severity:warning
category:infrastructure
annotations:
summary:"节点 {{ $labels.instance }} 内存使用率过高"
description:"内存使用率持续 5 分钟超过 85%，当前值: {{ $value | humanizePercentage }}"

# 🔴 P0: 磁盘空间不足
-alert:DiskSpaceLow
expr:|
          (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"} / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.*"}) * 100 < 15
for:5m
labels:
severity:critical
category:infrastructure
annotations:
summary:"节点 {{ $labels.instance }} 磁盘空间不足"
description:"挂载点 {{ $labels.mountpoint }} 可用空间低于 15%，当前值: {{ $value | humanizePercentage }}"

# 🟡 P2: 磁盘 I/O 使用率过高
-alert:HighDiskIOUsage
expr:|
          irate(node_disk_io_time_seconds_total[5m]) > 0.95
for:10m
labels:
severity:info
category:infrastructure
annotations:
summary:"节点 {{ $labels.instance }} 磁盘 I/O 繁忙"
description:"磁盘 {{ $labels.device }} I/O 使用率持续 10 分钟超过 95%，当前值: {{ $value | humanizePercentage }}"

# 🟠 P1: 网络丢包率过高
-alert:HighNetworkPacketLoss
expr:|
          rate(node_network_receive_drop_total[5m]) > 100 or rate(node_network_transmit_drop_total[5m]) > 100
for:5m
labels:
severity:warning
category:infrastructure
annotations:
summary:"节点 {{ $labels.instance }} 网络丢包"
description:"网卡 {{ $labels.device }} 丢包率过高，RX: {{ $value }} pps"

# 🔴 P0: 系统负载过高
-alert:HighSystemLoad
expr:|
          node_load15 / count(node_cpu_seconds_total{mode="idle"}) without(cpu, mode) > 2
for:10m
labels:
severity:critical
category:infrastructure
annotations:
summary:"节点 {{ $labels.instance }} 系统负载过高"
description:"15 分钟平均负载超过 CPU 核心数 2 倍，当前值: {{ $value | humanize }}"

# 🟡 P2: 进程数过多
-alert:TooManyProcesses
expr:|
          node_procs_running > 500
for:15m
labels:
severity:info
category:infrastructure
annotations:
summary:"节点 {{ $labels.instance }} 进程数过多"
description:"运行中进程数超过 500，当前值: {{ $value }}"

# 🔴 P0: 文件描述符耗尽
-alert:FileDescriptorExhaustion
expr:|
          (node_filefd_allocated / node_filefd_maximum) * 100 > 90
for:5m
labels:
severity:critical
category:infrastructure
annotations:
summary:"节点 {{ $labels.instance }} 文件描述符即将耗尽"
description:"文件描述符使用率超过 90%，当前值: {{ $value | humanizePercentage }}"

# 🟠 P1: 时间同步异常
-alert:ClockSkew
expr:|
          abs(node_timex_offset_seconds) > 0.05
for:10m
labels:
severity:warning
category:infrastructure
annotations:
summary:"节点 {{ $labels.instance }} 时间同步异常"
description:"系统时钟偏移超过 50ms，当前值: {{ $value }}s"

规则文件 2: 应用与中间件告警（applications.yml）

# /etc/prometheus/rules/applications.yml
groups:
-name:application_alerts
interval:15s
rules:
# 🔴 P0: HTTP 5xx 错误率过高
-alert:HighHTTP5xxRate
expr:|
          (sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, instance)
          / sum(rate(http_requests_total[5m])) by (job, instance)) * 100 > 5
for:5m
labels:
severity:critical
category:application
annotations:
summary:"应用 {{ $labels.job }} 5xx 错误率过高"
description:"5xx 错误率持续 5 分钟超过 5%，当前值: {{ $value | humanizePercentage }}"

# 🟠 P1: HTTP 响应时间过长
-alert:HighHTTPLatency
expr:|
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)) > 2
for:10m
labels:
severity:warning
category:application
annotations:
summary:"应用 {{ $labels.job }} 响应时间过长"
description:"P95 响应时间持续 10 分钟超过 2 秒，当前值: {{ $value }}s"

# 🟠 P1: 应用重启频繁
-alert:FrequentAppRestarts
expr:|
          changes(process_start_time_seconds[15m]) > 2
for:0m
labels:
severity:warning
category:application
annotations:
summary:"应用 {{ $labels.job }} 重启频繁"
description:"过去 15 分钟内重启超过 2 次，需排查崩溃原因"

# 🔴 P0: 数据库连接池耗尽
-alert:DatabaseConnectionPoolExhausted
expr:|
          (db_connection_pool_active / db_connection_pool_max) * 100 > 90
for:5m
labels:
severity:critical
category:database
annotations:
summary:"数据库连接池即将耗尽"
description:"连接池使用率超过 90%，当前值: {{ $value | humanizePercentage }}"

# 🟠 P1: Redis 内存使用率过高
-alert:RedisHighMemoryUsage
expr:|
          (redis_memory_used_bytes / redis_memory_max_bytes) * 100 > 85
for:10m
labels:
severity:warning
category:cache
annotations:
summary:"Redis {{ $labels.instance }} 内存使用率过高"
description:"内存使用率持续 10 分钟超过 85%，当前值: {{ $value | humanizePercentage }}"

# 🔴 P0: Kafka 消费者延迟过高
-alert:KafkaConsumerLag
expr:|
          kafka_consumergroup_lag > 10000
for:15m
labels:
severity:critical
category:messaging
annotations:
summary:"Kafka 消费者 {{ $labels.consumergroup }} 延迟过高"
description:"Topic {{ $labels.topic }} 积压消息超过 10000 条，当前值: {{ $value }}"

规则文件 3: Prometheus 自监控告警（prometheus_self.yml）

# /etc/prometheus/rules/prometheus_self.yml
groups:
-name:prometheus_self_monitoring
interval:30s
rules:
# 🔴 P0: Prometheus 抓取失败
-alert:PrometheusTargetDown
expr:up==0
for:3m
labels:
severity:critical
category:monitoring
annotations:
summary:"Prometheus 目标 {{ $labels.job }}/{{ $labels.instance }} 不可达"
description:"目标已离线超过 3 分钟，请检查服务状态和网络连通性"

# 🟠 P1: Prometheus 磁盘空间不足
-alert:PrometheusStorageLow
expr:|
          (node_filesystem_avail_bytes{mountpoint=~"/var/lib/prometheus.*"} / node_filesystem_size_bytes) * 100 < 20
for:10m
labels:
severity:warning
category:monitoring
annotations:
summary:"Prometheus 存储空间不足"
description:"TSDB 存储目录可用空间低于 20%，当前值: {{ $value | humanizePercentage }}"

# 🟠 P1: 告警规则评估失败
-alert:PrometheusRuleEvaluationFailures
expr:|
          rate(prometheus_rule_evaluation_failures_total[5m]) > 0
for:10m
labels:
severity:warning
category:monitoring
annotations:
summary:"Prometheus 规则评估失败"
description:"规则 {{ $labels.rule_group }} 评估失败率: {{ $value }} failures/s"

# 🟡 P2: Prometheus 抓取耗时过长
-alert:PrometheusSlowScrapes
expr:|
          prometheus_target_interval_length_seconds{quantile="0.9"} > 60
for:15m
labels:
severity:info
category:monitoring
annotations:
summary:"Prometheus 抓取耗时过长"
description:"Job {{ $labels.job }} P90 抓取时间超过 60 秒，当前值: {{ $value }}s"

关键参数解释：

1. for: 5m：持续时间，条件满足后需持续 5 分钟才触发告警（防止瞬时抖动）
2. severity：告警级别（critical/warning/info），用于路由和静默规则
3. humanizePercentage：模板函数，将小数转换为百分比显示

执行后验证：

# 检查规则文件语法
promtool check rules /etc/prometheus/rules/*.yml
# 预期输出：SUCCESS: 3 rules found

# 热加载规则
curl -X POST http://localhost:9090/-/reload

# 查看已加载规则
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
# 预期输出：infrastructure_alerts, application_alerts, prometheus_self_monitoring

# 测试 PromQL 查询（在 Prometheus UI）
# 访问 http://localhost:9090/graph
# 输入：up{job="node_exporter"}
# 预期输出：所有目标的 up 状态（1=up, 0=down）

[已实测] 验证结果：

• 规则文件语法正确，Prometheus 2.48.0 成功加载
• PromQL 查询在 RHEL 8.7 环境执行正常
• 告警模板函数 humanizePercentage 输出正确

Step 4: 配置 Alertmanager 通知渠道

目标： 配置告警路由、分组、抑制、静默规则

编辑 Alertmanager 配置：

# 备份配置
sudocp /etc/alertmanager/alertmanager.yml /etc/alertmanager/alertmanager.yml.bak.$(date +%Y%m%d_%H%M%S)

# 编辑配置
sudo vi /etc/alertmanager/alertmanager.yml

生产级配置示例：

# /etc/alertmanager/alertmanager.yml
global:
# 默认通知渠道（作为后备）
smtp_smarthost:'smtp.example.com:587'
smtp_from:'alerts@example.com'
smtp_auth_username:'alerts@example.com'
smtp_auth_password:'your_password'
smtp_require_tls:true

# 通知模板
templates:
-'/etc/alertmanager/templates/*.tmpl'

# 路由规则（核心配置）
route:
# 默认接收器
receiver:'default-email'

# 分组键（相同键的告警会聚合）
group_by: ['alertname', 'cluster', 'service']

# 分组等待时间（等待同组告警一起发送）
group_wait:10s

# 分组间隔（同组新告警的等待时间）
group_interval:10s

# 重复告警间隔
repeat_interval:12h

# 子路由（按标签路由到不同接收器）
routes:
# P0 级告警 → Slack + PagerDuty
-match:
severity:critical
receiver:'critical-alerts'
group_wait:10s
repeat_interval:5m
continue:true# 继续匹配后续路由

# P1 级告警 → Slack
-match:
severity:warning
receiver:'warning-alerts'
group_wait:30s
repeat_interval:1h

# P2 级告警 → Email
-match:
severity:info
receiver:'info-alerts'
group_wait:5m
repeat_interval:24h

# 数据库告警 → DBA 团队
-match_re:
category:(database|cache)
receiver:'dba-team'
group_by: ['alertname', 'instance']

# 静默：测试环境告警
-match:
env:'test'
receiver:'null'

# 抑制规则（防止告警风暴）
inhibit_rules:
# 节点宕机时抑制该节点的所有其他告警
-source_match:
alertname:'NodeDown'
target_match_re:
alertname:'.*'
equal: ['instance']

# CPU 高负载时抑制进程数告警
-source_match:
alertname:'HighCPUUsage'
target_match:
alertname:'TooManyProcesses'
equal: ['instance']

# 接收器配置
receivers:
# 默认 Email
-name:'default-email'
email_configs:
-to:'ops-team@example.com'
headers:
Subject:'[Prometheus] {{ .GroupLabels.alertname }}'

# P0 级：Slack + PagerDuty
-name:'critical-alerts'
slack_configs:
-api_url:'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel:'#alerts-critical'
title:'🔴 P0 告警'
text:|
          *告警名称:* {{ .GroupLabels.alertname }}
          *集群:* {{ .GroupLabels.cluster }}
          *摘要:* {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}
          *详情:* {{ range .Alerts }}{{ .Annotations.description }}{{ end }}
          *Runbook:* {{ range .Alerts }}{{ .Annotations.runbook_url }}{{ end }}
send_resolved:true
pagerduty_configs:
-service_key:'your_pagerduty_service_key'
description:'{{ .GroupLabels.alertname }}'

# P1 级：Slack
-name:'warning-alerts'
slack_configs:
-api_url:'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel:'#alerts-warning'
title:'🟠 P1 告警'
text:|
          *告警名称:* {{ .GroupLabels.alertname }}
          *摘要:* {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}
send_resolved:true

# P2 级：Email
-name:'info-alerts'
email_configs:
-to:'ops-team@example.com'
headers:
Subject:'[Info] {{ .GroupLabels.alertname }}'

# DBA 团队
-name:'dba-team'
email_configs:
-to:'dba-team@example.com'
headers:
Subject:'[Database] {{ .GroupLabels.alertname }}'

# 黑洞（用于静默测试环境告警）
-name:'null'

关键参数解释：

1. group_by：告警分组键，相同键的告警会聚合成一条通知（减少通知数量）
2. group_wait: 10s：首次告警等待 10 秒后发送（等待同组其他告警）
3. inhibit_rules：抑制规则，source 告警触发时会抑制 target 告警（防止重复通知）

执行后验证：

# 检查配置语法
amtool check-config /etc/alertmanager/alertmanager.yml
# 预期输出：Checking '/etc/alertmanager/alertmanager.yml'  SUCCESS

# 热加载配置
curl -X POST http://localhost:9093/-/reload

# 查看路由配置
amtool config routes --alertmanager.url=http://localhost:9093
# 预期输出：路由树结构

# 测试告警路由（模拟告警）
amtool alert add test_alert severity=critical instance=test:9100 --alertmanager.url=http://localhost:9093
# 查看告警状态
amtool alert --alertmanager.url=http://localhost:9093

常见错误示例：

# 错误：YAML 缩进错误
# 输出：yaml: line 42: did not find expected key
# 解决：检查 receivers 和 routes 的缩进层级

# 错误：Slack Webhook URL 无效
# 输出：Post "https://hooks.slack.com/...": dial tcp: lookup hooks.slack.com: no such host
# 解决：检查网络连通性、Webhook URL 是否正确

后续内容请看（二）。

THE END !

文章结束，感谢阅读。您的点赞，收藏，评论是我继续更新的动力。大家有推荐的公众号可以评论区留言，共同学习，一起进步。