Grafana 实战：配置告警全流程（从渠道到路由）

原创于 2025-08-30 09:39:50 发布 · 1.4k 阅读

25 ·

CC 4.0 BY-SA版权

文章标签：

#grafana #实战 #配置 #告警 #学习

性能监控专栏收录该内容

82 篇文章

订阅专栏

Grafana 实战：配置告警全流程（从渠道到路由）

本文将手把手带你完成 Grafana 告警的完整配置流程，涵盖：

✅ 添加通知渠道（Contact Point）
✅ 创建告警规则（Alert Rule）
✅ 配置标签用于路由
✅ 设置通知策略（Notification Policy）

最终实现：当 CPU 使用率持续 5 分钟超过 80% 时，向 Slack 发送告警。

一、前提条件

Grafana 已运行（http://localhost:3000）
已配置 Prometheus 数据源
已有主机监控仪表盘（含 node_cpu_seconds_total 指标）
Slack 工作区权限（用于创建 Incoming Webhook）

二、步骤 1：创建 Contact Point（联系点）——配置 Slack 通知渠道

2.1 在 Slack 中创建 Webhook

进入 Slack 管理后台
搜索 “Incoming Webhooks” 或访问：https://api.slack.com/apps
创建新 App → Incoming Webhooks → 启用
选择频道（如 #alerts）

复制 Webhook URL：

https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX

2.2 在 Grafana 中添加 Contact Point

登录 Grafana
点击左侧 Alerting（铃铛图标）→ Contact points
点击 New contact point
配置：

字段	值
Name	`slack-ops`
Type	`Slack`
URL	粘贴刚才的 Webhook URL
Channel	`#alerts`（可选，会覆盖 Webhook 设置）
Send resolved alerts	✅ 勾选（告警恢复时也通知）

点击 Test 按钮，验证是否收到测试消息
点击 Create 保存

✅ 显示 “Notification test sent” 即成功。

三、步骤 2：创建 Alert Rule（告警规则）——定义触发条件

3.1 进入告警规则创建页面

Alerting → Alert rules → Create alert rule

3.2 配置告警规则

（1）Basic settings

字段	值
Rule name	`HighCpuUsage`
Evaluate every	`1m`（每分钟评估一次）
For	`5m`（持续 5 分钟才触发）

（2）Query & alert condition

点击 + Query 添加查询
Data source: Prometheus

Query:

1 - rate(node_cpu_seconds_total{mode="idle",job="node"}[5m])

保持 A 作为 Ref ID

✅ 此查询计算 CPU 使用率

（3）Alert condition

Condition: A
Reducer: avg()（对所有 instance 取平均）
Operator: >
Value: 0.8（80%）

✅ 含义：当平均 CPU 使用率 > 80% 持续 5 分钟，触发告警

（4）Labels（标签）——用于路由匹配

点击 Add label 添加：

Key: severity
Value: warning

✅ 此标签将用于后续的路由策略

（5）Annotations（注释）——丰富告警信息

点击 Add annotation 添加：

Summary: CPU usage high on {{ $labels.instance }}
Description: CPU usage is {{ $value }} on {{ $labels.instance }}. [Runbook](https://runbooks.example.com/high-cpu)
runbook_url: https://runbooks.example.com/high-cpu

✅ 支持模板变量 {{ $labels.xxx }}, {{ $value }}

（6）Save Rule

点击 Save 保存告警规则

四、步骤 3：配置 Notification Policy（通知策略）——路由告警

4.1 进入通知策略页面

Alerting → Notification policies

4.2 配置路由规则

默认策略可能已存在，点击 Edit 进行修改，或添加新策略。

添加新路由规则

点击 Add route
配置匹配器（Matchers）：
- Label: severity
- Operator: =
- Value: warning
Contact point: 选择 slack-ops
（可选）设置分组：
- Group by: alertname, instance
- Group wait: 30s
- Group interval: 5m
- Repeat interval: 1h

✅ 表示：所有 severity=warning 的告警都发送到 Slack

4.3 保存策略

点击 Save policy 保存

五、验证告警流程

5.1 检查告警状态

进入 Alerting → Alert rules
查看 HighCpuUsage 规则状态：
- 正常：Normal
- 触发：Firing
- 待触发：Pending（满足条件但未到 For 时间）

5.2 手动测试（可选）

如果无法立即触发，可临时修改规则：

将 For 改为 1m
将阈值改为 0.1（10%）
等待触发后改回

5.3 查看 Slack 通知

触发后，Slack 应收到类似消息：

🚨 FIRING
HighCpuUsage
CPU usage high on node-1

Labels:
- alertname: HighCpuUsage
- instance: node-1:9100
- job: node
- severity: warning

Annotations:
- description: CPU usage is 0.85 on node-1:9100. [Runbook](https://runbooks.example.com/high-cpu)
- runbook_url: https://runbooks.example.com/high-cpu

Value: 0.85
Graph: [View in Grafana](http://grafana:3000/d/...)

✅ 点击链接可跳转到 Grafana 查看详情。

六、告警恢复通知

当 CPU 使用率回落到 80% 以下并持续超过 For 时间：

告警状态变为 Resolved
Slack 收到恢复通知（因启用了 Send resolved alerts）

七、进阶配置建议

配置	说明
添加 Email Contact Point	重要告警双通道通知
创建 Critical 路由	`severity=critical` → PagerDuty
设置 Mute Timing	维护期静默告警
配置 Inhibition	节点宕机时抑制其上服务告警
使用 Variables	动态实例监控

八、最佳实践总结

步骤	关键点
Contact Point	测试连接，确保能收到消息
Alert Rule	使用 `For` 避免瞬时抖动
Labels	使用 `severity=warning/critical` 便于路由
Annotations	添加 `summary`, `description`, `runbook_url`
Notification Policy	合理分组，避免告警风暴