马哥Linux运维 | Prometheus 告警规则生产级配置：50+ 核心指标与最佳实践(三)

最新推荐文章于 2025-11-24 20:32:43 发布

原创最新推荐文章于 2025-11-24 20:32:43 发布 · 556 阅读

17 ·

CC 4.0 BY-SA版权

文章标签：

#运维 #prometheus #人工智能 #php #linux #AI编程 #prompt

深度学习拓展阅读同时被 2 个专栏收录

989 篇文章

订阅专栏

运维知识

25 篇文章

订阅专栏

本文来源公众号“马哥Linux运维”，仅用于学术分享，侵权删，干货满满。

原文链接：https://mp.weixin.qq.com/s/X4ADdHOXtQw5Hy5iQSal8A

文章略长，分为(一)、(二)、(三)和(四)两部分，一起学习吧！

马哥Linux运维 | Prometheus 告警规则生产级配置：50+ 核心指标与最佳实践(一)-优快云博客

马哥Linux运维 | Prometheus 告警规则生产级配置：50+ 核心指标与最佳实践(二)-优快云博客

9️⃣ 常见故障与排错

症状	诊断命令	可能根因	快速修复	永久修复
告警未触发	`curl http://localhost:9090/api/v1/rules`	1. PromQL 语法错误 2. for 持续时间未达到	检查规则状态，手动执行 PromQL	修正规则语法，调整 for 持续时间
告警未发送	`amtool alert --alertmanager.url=http://localhost:9093`	1. Alertmanager 未运行 2. 路由配置错误	启动 Alertmanager，检查日志	修正路由配置，测试 Webhook
指标未采集	`curl http://localhost:9090/api/v1/targets`	1. 目标不可达 2. 防火墙阻断	检查目标服务状态，测试端口连通性	修复网络问题，更新抓取配置
磁盘空间耗尽	`df -h /var/lib/prometheus`	1. 保留时间过长 2. 高基数标签	手动删除旧数据	调整 retention.time，修复高基数标签
PromQL 查询超时	`curl -s http://localhost:9090/api/v1/query_range`	1. 查询范围过大 2. 高基数聚合	缩小查询范围，增加 step	优化查询语句，添加 recording rules
规则评估失败	`journalctl -u prometheus \| grep "error evaluating rule"`	1. 除零错误 2. 指标不存在	检查 PromQL 表达式	添加 `or 0` 处理缺失数据
高基数告警	`curl -s http://localhost:9090/api/v1/label/__name__/values \| jq 'length'`	1. 动态标签（如 IP） 2. UUID 标签	禁用高基数 exporter	使用 metric_relabel_configs 删除标签

调试思路（系统性排查）

告警未触发诊断流程：

第1步：告警规则是否加载？
   ↓ curl http://localhost:9090/api/v1/rules
   ├─ 未加载 → 检查配置文件路径、热加载配置
   └─ 已加载 → 第2步

第2步：PromQL 查询是否返回数据？
   ↓ 在 Prometheus UI 执行查询
   ├─ 无数据 → 检查指标是否采集、标签是否匹配
   └─ 有数据 → 第3步

第3步：告警状态是什么？
   ↓ 查看告警详情页
   ├─ Inactive → PromQL 条件未满足
   ├─ Pending → for 持续时间未达到
   └─ Firing → 第4步

第4步：Alertmanager 是否收到告警？
   ↓ amtool alert --alertmanager.url=http://localhost:9093
   ├─ 未收到 → 检查 Prometheus 与 Alertmanager 通信
   └─ 已收到 → 第5步

第5步：告警是否被抑制/静默？
   ↓ amtool silence query --alertmanager.url=http://localhost:9093
   ├─ 被抑制 → 检查 inhibit_rules
   ├─ 被静默 → 删除静默规则
   └─ 正常 → 第6步

第6步：通知渠道是否配置正确？
   ↓ 检查 Alertmanager 日志
   └─ 测试 Webhook/SMTP 连通性

高基数标签排查：

# 查询标签基数（前 10）
curl -s http://localhost:9090/api/v1/label/__name__/values | jq -r '.data[]' | whileread metric; do
echo -n "$metric: "
  curl -s "http://localhost:9090/api/v1/series?match[]=$metric" | jq '.data | length'
done | sort -t: -k2 -nr | head -n 10

# 删除高基数标签（prometheus.yml）
scrape_configs:
  - job_name: 'example'
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'high_cardinality_metric.*'
        action: drop  # 删除整个指标
      - regex: 'dynamic_label'
        action: labeldrop  # 仅删除特定标签

🔟 变更与回滚剧本

灰度策略

场景：更新告警规则

# 1. 在测试环境验证新规则
promtool check rules /tmp/new_rules.yml

# 2. 使用 amtool 测试告警路由
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml \
  --tree \
  alertname=TestAlert severity=critical

# 3. 先部署到单台 Prometheus（灰度）
scp /tmp/new_rules.yml prometheus-01:/etc/prometheus/rules/
ssh prometheus-01 'curl -X POST http://localhost:9090/-/reload'

# 4. 观察 15 分钟，确认无异常告警

# 5. 批量部署到所有 Prometheus
for host in prometheus-{02..05}; do
  scp /tmp/new_rules.yml $host:/etc/prometheus/rules/
  ssh $host'curl -X POST http://localhost:9090/-/reload'
done

健康检查清单

# 1. 检查 Prometheus 是否正常运行
systemctl is-active prometheus
# 预期输出：active

# 2. 检查所有目标是否可达
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | {job, instance, health}'
# 预期输出：空（所有目标都是 up）

# 3. 检查告警规则是否有错误
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.lastError != null) | {name, lastError}'
# 预期输出：空（无错误）

# 4. 检查 Alertmanager 通信
curl -s http://localhost:9090/api/v1/alertmanagers | jq '.data.activeAlertmanagers[] | {url}'
# 预期输出：所有 Alertmanager URL

# 5. 测试告警发送
amtool alert add health_check severity=info --end=$(date -d '+1 minute' --rfc-3339=seconds | tr' ''T') --alertmanager.url=http://localhost:9093
# 检查是否收到通知

回滚条件与命令

回滚触发条件：

1. 新规则导致告警风暴（5 分钟内触发 > 100 条告警）
2. 关键告警未触发（如 NodeDown）
3. PromQL 查询错误率 > 5%
4. Prometheus 内存使用 > 90%

回滚步骤：

# 1. 立即恢复备份配置
sudocp /etc/prometheus/rules/infrastructure.yml.bak /etc/prometheus/rules/infrastructure.yml
sudocp /etc/alertmanager/alertmanager.yml.bak /etc/alertmanager/alertmanager.yml

# 2. 热加载配置
curl -X POST http://localhost:9090/-/reload
curl -X POST http://localhost:9093/-/reload

# 3. 清除所有静默规则（防止回滚后告警被抑制）
amtool silence expire $(amtool silence query -q) --alertmanager.url=http://localhost:9093

# 4. 验证回滚成功
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].file' | sort | uniq
# 确认规则文件时间戳为备份版本

数据与配置备份

自动化备份脚本：

#!/bin/bash
# 文件名：backup_prometheus.sh
# 用途：自动备份 Prometheus 配置和 TSDB 快照

set -e

BACKUP_DIR="/backup/prometheus/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"

# 1. 备份配置文件
echo"[1/4] 备份配置文件..."
cp -r /etc/prometheus "$BACKUP_DIR/config"
cp -r /etc/alertmanager "$BACKUP_DIR/alertmanager_config"

# 2. 创建 TSDB 快照（不影响运行）
echo"[2/4] 创建 TSDB 快照..."
SNAPSHOT=$(curl -s -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot | jq -r '.data.name')
echo"快照名称: $SNAPSHOT"

# 3. 复制快照到备份目录
echo"[3/4] 复制快照数据..."
cp -r "/var/lib/prometheus/snapshots/$SNAPSHOT""$BACKUP_DIR/tsdb_snapshot"

# 4. 清理旧快照（保留最近 3 个）
echo"[4/4] 清理旧快照..."
curl -s -XPOST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
ls -t /var/lib/prometheus/snapshots/ | tail -n +4 | xargs -I {} rm -rf /var/lib/prometheus/snapshots/{}

# 5. 压缩备份（可选）
tar czf "$BACKUP_DIR.tar.gz" -C /backup/prometheus "$(basename $BACKUP_DIR)"
rm -rf "$BACKUP_DIR"

echo"备份完成: $BACKUP_DIR.tar.gz"

定时备份（crontab）：

# 每天凌晨 2 点执行备份
0 2 * * * /usr/local/bin/backup_prometheus.sh >> /var/log/prometheus_backup.log 2>&1

1️⃣1️⃣最佳实践

1. 告警规则设计原则

使用 for 持续时间防止抖动：

# ❌ 错误：无 for 持续时间，瞬时抖动会触发告警
-alert:HighCPUUsage
expr:node_cpu_usage>80

# ✅ 正确：持续 5 分钟才触发
-alert:HighCPUUsage
expr:node_cpu_usage>80
for:5m

为告警添加 Runbook 链接：

annotations:
runbook_url:"https://wiki.example.com/runbook/high-cpu-usage"
description:|
    节点 {{ $labels.instance }} CPU 使用率超过 80%
    当前值: {{ $value | humanizePercentage }}

排查步骤:
1.登录节点查看top
2.检查是否有异常进程
3.查看应用日志

2. 避免高基数标签

高基数标签示例（❌ 错误）：

# ❌ 使用 IP 地址作为标签（每个 IP 都是一个时间序列）
http_requests_total{client_ip="192.168.1.100"}

# ❌ 使用 UUID 作为标签
request_duration_seconds{request_id="550e8400-e29b-41d4-a716-446655440000"}

正确做法（✅）：

# ✅ 使用有限值的标签
http_requests_total{method="GET",status="200"}

# ✅ 将高基数数据放到日志系统
# Prometheus 记录计数：http_requests_total
# Loki 记录详细日志：包含 request_id, client_ip

3. 使用 Recording Rules 优化复杂查询

问题：复杂查询导致告警评估耗时过长

# ❌ 每次评估都计算 P95 延迟（耗时）
-alert:HighHTTPLatency
expr:|
    histogram_quantile(0.95,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
    ) > 2

解决：预计算 Recording Rules

# recording_rules.yml
groups:
-name:http_latency_precompute
interval:30s
rules:
# 预计算 P95 延迟
-record:job:http_request_duration_seconds:p95
expr:|
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
          )

# 告警规则直接使用预计算结果（快速）
# infrastructure.yml
-alert:HighHTTPLatency
expr:job:http_request_duration_seconds:p95>2
for:10m

4. 配置告警抑制防止风暴

# ✅ 节点宕机时抑制该节点的所有其他告警
inhibit_rules:
-source_match:
alertname:'NodeDown'
target_match_re:
alertname:'.*'
equal: ['instance']

# ✅ 磁盘只读时抑制磁盘空间告警
-source_match:
alertname:'DiskReadOnly'
target_match:
alertname:'DiskSpaceLow'
equal: ['instance', 'device']

5. 定期演练故障场景

每季度执行一次告警演练：

# 场景 1: 模拟节点宕机
ssh target-node 'sudo systemctl stop node_exporter'
# 预期：1 分钟内触发 NodeDown 告警

# 场景 2: 模拟 CPU 高负载
ssh target-node 'stress --cpu 8 --timeout 600s'
# 预期：5 分钟内触发 HighCPUUsage 告警

# 场景 3: 模拟磁盘空间不足
ssh target-node 'dd if=/dev/zero of=/tmp/fillfile bs=1G count=50'
# 预期：5 分钟内触发 DiskSpaceLow 告警

6. 启用 Alertmanager 静默规则（维护窗口）

# 创建静默规则（2 小时维护窗口）
amtool silence add \
  alertname=~".*" \
  instance=~"192.168.1.10:9100" \
  --duration=2h \
  --author="ops@example.com" \
  --comment="服务器维护窗口" \
  --alertmanager.url=http://localhost:9093

# 查看所有静默规则
amtool silence query --alertmanager.url=http://localhost:9093

# 提前结束静默
amtool silence expire <silence_id> --alertmanager.url=http://localhost:9093

7. 使用 Prometheus 联邦集群（大规模场景）

架构：中心 Prometheus 聚合多个边缘 Prometheus

# 中心 Prometheus 配置
scrape_configs:
-job_name:'federate'
scrape_interval:60s
honor_labels:true
metrics_path:'/federate'
params:
'match[]':
-'{job="node_exporter"}'# 聚合所有 node_exporter 指标
-'{__name__=~"job:.*"}'# 聚合所有 recording rules
static_configs:
-targets:
-'edge-prometheus-01:9090'
-'edge-prometheus-02:9090'

后续内容请看（四）。

THE END !

文章结束，感谢阅读。您的点赞，收藏，评论是我继续更新的动力。大家有推荐的公众号可以评论区留言，共同学习，一起进步。