EMQX监控告警配置：Prometheus+Grafana实战-优快云博客

EMQX监控告警配置：Prometheus+Grafana实战

【免费下载链接】emqx The most scalable open-source MQTT broker for IoT, IIoT, and connected vehicles 项目地址: https://gitcode.com/gh_mirrors/em/emqx

一、痛点与解决方案

你是否还在为IoT设备连接中断无法及时发现而烦恼？还在为EMQX集群故障排查缺乏数据支持而头疼？本文将通过Prometheus（普罗米修斯）+Grafana（图形化面板）的实战配置，帮助你构建完整的EMQX监控告警体系。读完本文你将获得：

从零开始的EMQX监控指标暴露配置
Prometheus数据采集与存储策略
Grafana可视化面板部署与自定义
关键业务指标告警规则设置
高可用监控架构设计方案

二、监控架构总览

2.1 架构流程图

mermaid

2.2 组件版本兼容性

组件	最低版本	推荐版本	说明
EMQX	5.0.0	5.4.0+	提供原生Prometheus指标暴露
Prometheus	2.30.0	2.45.0+	支持远程写入和联邦集群
Grafana	8.0.0	10.2.0+	提供EMQX官方仪表盘模板

三、EMQX监控指标配置

3.1 启用Prometheus指标

EMQX默认开启Prometheus指标暴露，可通过以下命令验证：

curl -f "http://127.0.0.1:18083/api/v5/prometheus/stats"

成功响应将返回类似以下的指标数据：

# HELP emqx_connections_count Number of current connections
# TYPE emqx_connections_count gauge
emqx_connections_count{node="emqx@127.0.0.1"} 120
# HELP emqx_messages_received Total number of received messages
# TYPE emqx_messages_received counter
emqx_messages_received{node="emqx@127.0.0.1"} 56320

3.2 高级配置（可选）

修改EMQX配置文件emqx.conf调整指标暴露参数：

prometheus {
  ## 启用基本认证
  enable_basic_auth = false
  ## 认证用户名
  basic_auth_username = "emqx_monitor"
  ## 认证密码
  basic_auth_password = "secret"
  
  ## 指标收集配置
  collectors {
    ## 启用Mnesia数据库指标
    mnesia = false
    ## 启用VM内存指标
    vm_memory = false
    ## 启用VM统计信息
    vm_statistics = false
  }
  
  ## PushGateway配置（可选，适用于无法直接拉取的场景）
  push_gateway {
    enable = false
    url = "http://pushgateway:9091"
    interval = "15s"
    job_name = "${name}/instance/${name}~${host}"
    headers = {
      Authorization = "Bearer some-token"
    }
  }
}

四、Prometheus配置

4.1 安装Prometheus

# 下载最新版本
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64

4.2 配置Prometheus

创建或修改prometheus.yml：

global:
  scrape_interval: 15s  # 全局抓取间隔
  evaluation_interval: 15s  # 规则评估间隔

rule_files:
  - "alert_rules.yml"  # 告警规则文件

scrape_configs:
  - job_name: 'emqx'
    metrics_path: '/api/v5/prometheus/stats'
    static_configs:
      - targets: ['127.0.0.1:18083']  # EMQX节点列表
        labels:
          cluster: 'emqx-cluster-01'  # 集群标识

  # 监控Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

4.3 启动Prometheus

./prometheus --config.file=prometheus.yml --storage.tsdb.path=data/ --web.enable-lifecycle

五、Grafana可视化配置

5.1 安装与启动Grafana

# Ubuntu/Debian
sudo apt-get install -y adduser libfontconfig1
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_10.2.0_amd64.deb
sudo dpkg -i grafana-enterprise_10.2.0_amd64.deb
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

5.2 导入EMQX官方仪表盘

登录Grafana（默认地址：http://localhost:3000，用户名/密码：admin/admin）
导航至Dashboards > Import
输入官方仪表盘ID：17446，或导入EMQX源码中的JSON文件：
- 路径：apps/emqx_prometheus/grafana_template/EMQ_Dashboard.json
选择Prometheus数据源，完成导入

5.3 关键仪表盘介绍

EMQX提供三个核心仪表盘：

EMQ overview（ID:17446）：集群整体状态
- 客户端连接数趋势图
- 消息吞吐量统计
- 主题与订阅关系分布
ErlangVM：节点虚拟机监控
- 进程数量与内存占用
- IO调度与垃圾回收
- 网络分发统计
EMQ：详细指标监控
- 分协议类型连接统计
- 规则引擎处理性能
- 桥接数据转发状态

六、告警规则配置

6.1 创建Prometheus告警规则

创建alert_rules.yml文件：

groups:
- name: emqx_alerts
  rules:
  # 客户端连接数告警
  - alert: HighConnectionCount
    expr: sum(emqx_connections_count) / sum(emqx_max_connections) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "EMQX连接数过高"
      description: "连接数已达最大连接数的{{ $value | humanizePercentage }} ({{ $value }})"
      
  # 节点离线告警
  - alert: NodeDown
    expr: up{job="emqx"} == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "EMQX节点离线"
      description: "节点{{ $labels.instance }}已离线超过30秒"
      
  # 消息丢弃告警
  - alert: MessageDroppedRate
    expr: rate(emqx_messages_dropped[5m]) > 10
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "消息丢弃率过高"
      description: "过去5分钟消息丢弃率{{ $value | humanize }}/秒"

6.2 配置Grafana告警通知

导航至Alerting > Notification channels
点击New channel，配置通知方式：

通知类型	配置要点	示例配置
电子邮件	SMTP服务器地址与认证	SMTP服务器: smtp.example.com:587
钉钉	机器人Webhook	https://oapi.dingtalk.com/robot/send?access_token=XXX
Slack	频道名称与API令牌	频道: #emqx-alerts

6.3 关键业务指标告警阈值

指标名称	告警阈值	持续时间	严重级别
连接失败率	>10次/秒	2分钟	警告
消息延迟	p95>500ms	5分钟	警告
节点CPU使用率	>80%	10分钟	严重
规则引擎错误	>5次/分钟	1分钟	紧急

七、高可用监控架构

7.1 Prometheus联邦集群

mermaid

配置联邦主节点：

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="emqx"}'
    static_configs:
      - targets:
        - 'region1-prometheus:9090'
        - 'region2-prometheus:9090'

7.2 监控数据持久化

配置Prometheus远程写入InfluxDB：

remote_write:
  - url: "http://influxdb:8086/api/v1/prom/write?db=prometheus"
    basic_auth:
      username: "influxdb-user"
      password: "influxdb-pass"
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'emqx_.*'
        action: keep

八、常见问题排查

8.1 指标采集失败

检查EMQX指标端点：

curl -I "http://127.0.0.1:18083/api/v5/prometheus/stats"
# 预期响应: HTTP/1.1 200 OK

验证Prometheus配置：

./promtool check config prometheus.yml

查看Prometheus目标状态：
- 访问：http://localhost:9090/targets
- 检查"emqx" job的健康状态

8.2 Grafana图表无数据

检查数据源配置：
- 导航至Configuration > Data Sources
- 点击"Test"验证Prometheus连接
验证查询表达式：
```
emqx_connections_count{job="emqx"}
```
检查时间范围：
- 确保Grafana时间范围设置正确（默认：Last 6 hours）

九、最佳实践与优化

9.1 性能优化建议

指标采集优化：
- 非必要时禁用VM级指标（vm_memory、vm_statistics）
- 高基数标签（如clientid）使用白名单过滤

Prometheus存储优化：

storage.tsdb.retention.time: 15d  # 保留15天数据
storage.tsdb.wal-compression: true  # 启用WAL压缩

Grafana查询优化：
- 减少面板刷新频率（非关键面板设为30s+）
- 使用变量和模板减少重复查询

9.2 监控指标参考

指标类别	关键指标	用途
连接指标	emqx_connections_count	当前连接数
	emqx_client_connected	连接成功次数
	emqx_client_disconnected	连接断开次数
消息指标	emqx_messages_received	接收消息总数
	emqx_messages_sent	发送消息总数
	emqx_messages_dropped	丢弃消息总数
系统指标	emqx_node_cpu_usage	节点CPU使用率
	emqx_node_memory_usage	节点内存使用率
	emqx_node_uptime	节点运行时间

十、总结与展望

通过本文配置，你已构建起完整的EMQX监控告警体系。关键收获包括：

基于Prometheus+Grafana的开源监控方案，降低部署成本
覆盖设备连接、消息流转、系统资源的全链路监控
可扩展的告警通知机制，支持多渠道告警分发

未来监控体系演进方向：

引入ServiceMesh实现监控流量治理
基于LLM的异常检测与根因分析
监控数据与日志/链路追踪的关联分析

建议定期关注EMQX官方文档，获取最新的监控特性与最佳实践。

收藏本文，随时查阅EMQX监控配置指南。关注我们，获取更多IoT平台运维实战教程！

【免费下载链接】emqx The most scalable open-source MQTT broker for IoT, IIoT, and connected vehicles 项目地址: https://gitcode.com/gh_mirrors/em/emqx

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考