Awesome Cheatsheet 监控告警系统：Prometheus/Grafana 指南-优快云博客

Awesome Cheatsheet 监控告警系统：Prometheus/Grafana 指南

【免费下载链接】awesome-cheatsheet :beers: awesome cheatsheet 项目地址: https://gitcode.com/gh_mirrors/aw/awesome-cheatsheet

引言

你是否曾在系统故障发生数小时后才收到报警？是否为监控指标配置不当导致告警风暴而头疼？在现代分布式系统中，有效的监控告警机制已成为保障服务稳定性的核心能力。本文将系统讲解Prometheus（普罗米修斯）和Grafana（ graf-ah-nuh）这对开源监控组合的部署、配置与最佳实践，帮助你构建企业级监控告警体系。

读完本文，你将能够：

从零搭建Prometheus+Grafana监控系统
掌握PromQL（Prometheus Query Language，Prometheus查询语言）指标查询
配置精准的告警规则与通知渠道
设计高可用监控架构
构建美观实用的业务仪表盘

监控系统架构概述

传统监控 vs 云原生监控

维度	传统监控（如Zabbix）	云原生监控（Prometheus）
数据模型	表格型数据	时序数据+标签
采集方式	被动拉取/主动推送	主动拉取（Pull）
存储方案	集中式关系型数据库	本地时序数据库
扩展性	垂直扩展为主	水平扩展，联邦集群
动态发现	有限支持	原生支持Kubernetes等动态环境

Prometheus核心组件

Prometheus生态系统由以下核心组件构成：

mermaid

Prometheus Server：核心组件，负责指标采集、存储和查询
Exporters：指标暴露器，将系统/应用指标转换为Prometheus格式
Alertmanager：告警管理器，处理告警分组、抑制和路由
Grafana：可视化平台，用于构建监控仪表盘
Service Discovery：服务发现，自动发现监控目标

Prometheus部署与配置

安装Prometheus

二进制安装（Linux）

# 下载最新版本
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz

# 解压
tar xvfz prometheus-*.tar.gz
cd prometheus-*/

# 启动Prometheus
./prometheus --config.file=prometheus.yml

Docker部署

docker run -d \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:v2.45.0

核心配置文件（prometheus.yml）

global:
  scrape_interval: 15s  # 全局抓取间隔
  evaluation_interval: 15s  # 规则评估间隔

rule_files:
  - "alert.rules.yml"  # 告警规则文件

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']  # 监控Prometheus自身

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']  # 监控节点 exporter

启动与验证

访问Prometheus UI：http://localhost:9090

查看目标状态：http://localhost:9090/targets

Exporters与指标采集

常用Exporters

Exporter	用途	默认端口	关键指标
node_exporter	主机监控	9100	node_cpu_seconds_total, node_memory_MemFree_bytes
cadvisor	容器监控	8080	container_cpu_usage_seconds_total
mysqld_exporter	MySQL监控	9104	mysql_global_status_threads_connected
blackbox_exporter	网络探测	9115	probe_success, probe_duration_seconds

部署node_exporter

# 下载并启动node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
cd node_exporter-*/
./node_exporter

配置Prometheus监控node_exporter

修改prometheus.yml，添加以下配置：

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']  # node_exporter地址

重启Prometheus后，可在UI中查询主机指标：

node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes

PromQL查询语言

数据类型

Prometheus指标有四种类型：

Counter（计数器）：单调递增的指标，如请求总数
Gauge（仪表盘）：可增可减的指标，如当前内存使用量
Histogram（直方图）：样本分布统计，如请求延迟分布
Summary（摘要）：类似直方图，提供分位数统计

基础查询示例

1. 查看当前CPU使用率

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

2. 内存使用率Top 3实例

sort_desc(
  avg by (instance) (
    (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) 
    / node_memory_MemTotal_bytes * 100
  )
)[1:3]

3. HTTP 4xx错误率

sum(rate(http_requests_total{status=~"4.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100

高级查询技巧

1. 同比增长率

rate(http_requests_total[5m]) 
/ 
rate(http_requests_total[5m] offset 1d) - 1

2. 预测资源耗尽时间

predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 3600) < 0

Grafana可视化

安装与配置Grafana

# Docker部署Grafana
docker run -d -p 3000:3000 grafana/grafana:10.0.3

访问Grafana UI：http://localhost:3000（默认账号：admin/admin）

添加Prometheus数据源

登录后点击左侧菜单 Configuration > Data Sources
点击 Add data source，选择 Prometheus
设置URL为Prometheus地址（如http://prometheus:9090）
点击 Save & Test 验证连接

导入官方仪表盘

点击左侧菜单 Dashboards > Import
输入Node Exporter仪表盘ID：1860
选择已配置的Prometheus数据源
点击 Import 完成导入

效果如下：

Node Exporter Dashboard

自定义业务仪表盘

创建一个简单的API监控仪表盘：

mermaid

点击 New dashboard > Add visualization
选择Prometheus数据源
输入查询：sum(http_requests_total{status="200"})
设置图表标题为"成功请求数"
重复添加其他状态码的查询
添加一个Pie Chart面板展示状态码分布

告警系统配置

Alertmanager部署

docker run -d \
  -p 9093:9093 \
  -v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager:v0.25.0

配置告警规则（alert.rules.yml）

groups:
- name: host_alerts
  rules:
  - alert: HighCpuUsage
    expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is above 80% (current value: {{ $value }})"
  
  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is above 90% (current value: {{ $value }})"

配置Alertmanager（alertmanager.yml）

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 4h
  receiver: 'email'

receivers:
- name: 'email'
  email_configs:
  - to: 'admin@example.com'
    send_resolved: true
    smtp_from: 'alertmanager@example.com'
    smtp_smarthost: 'smtp.example.com:587'
    smtp_auth_username: 'alertmanager@example.com'
    smtp_auth_password: 'password'
    smtp_require_tls: true

在Prometheus中关联Alertmanager

修改prometheus.yml：

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093  # Alertmanager地址

最佳实践与性能优化

监控目标自动发现

在Kubernetes环境中配置服务发现：

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true

存储优化

storage:
  tsdb:
    retention: 15d  # 数据保留15天
    wal_compression: true  # 启用WAL压缩
  remote_write:
    - url: "http://thanos-receive:19291/api/v1/receive"  # 对接Thanos实现长期存储

高可用部署

mermaid

采用双副本Prometheus + 共享Alertmanager架构，结合Thanos实现全局视图和长期存储。

扩展资源

官方文档：LICENSE
项目贡献指南：CONTRIBUTING.md
监控最佳实践：README.md

总结

Prometheus和Grafana已成为云原生监控的事实标准，其灵活的数据模型、强大的查询能力和丰富的可视化选项，使其能够满足从简单到复杂的监控需求。本文详细介绍了从部署配置到高级优化的全流程，包括核心组件架构、指标采集、PromQL查询、告警配置和最佳实践。

监控系统是保障业务稳定运行的关键基础设施，建议根据实际需求持续优化监控指标和告警策略。通过本文的指导，你已具备构建企业级监控系统的基础知识，后续可深入学习ServiceMonitor、Thanos等高级主题。

点赞👍收藏🌟关注，获取更多监控运维技术干货！下期预告：《分布式追踪系统：Jaeger实战指南》

【免费下载链接】awesome-cheatsheet :beers: awesome cheatsheet 项目地址: https://gitcode.com/gh_mirrors/aw/awesome-cheatsheet

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考