📚概述
简介
RocketMQ官网给出了RocketMQ监控的示例,本文针对该示例进行细化和实战。
官方文档:https://rocketmq.apache.org/zh/docs/4.x/deployment/04Exporter
📗安装rocketmq-exporter
本文以
4.9.4版本为例,其他版本需要修改对应的版本号,替换到脚本安装包即可。
🧩rocketmq-exporter配置
具体操作步骤:
🧾下载源码并修改bug
对应GitHub issues ===> BrokerRuntimeStats#loadTps NPE #131
原生rocketmq-exporter有bug,需要修改org.apache.rocketmq.exporter.model.BrokerRuntimeStats#BrokerRuntimeStats中getTransferredTps修改为getTransferedTps。
📑修改配置
✨pom.xml配置
修改pom.xml改为对应的rocketmq的版本号。

✨application.yml配置
修改application.yml配置的namesrvAddr地址,以及其他对应的配置信息,具体的task执行周期可以不用修改,也可以根据实际情况进行修改。

rocketmq.config.enableACL如果RocketMQ集群开启了ACL验证,需要配置为true, 并在accessKey和secretKey中配置相应的ak,sk.rocketmq.config.outOfTimeSeconds用于配置存储指标和相应的值的过期时间,若超过该时间,cache中的key对应的节点没有发生写更改,则会进行删除。一般配置为60s即可(根据promethus获取指标的时间间隔进行合理配置,只要保证过期时间大于等于promethus收集指标的时间间隔即可)
📑打包启动
打包
使用maven打包即可。使用rocketmq-exporter-0.0.2-SNAPSHOT-exec.jar文件。

启动脚本
# rocketmq.config.namesrvAddr 配置nameserver地址,多个用分号隔开
nohup java -jar -Xms512m -Xmx512m rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 >/dev/null 2>&1 &
完整脚本

🔊注意:
由于service文件中不能使用环境变量,所以在安装的时候就直接判断jdk是否安装并提供软连接到/usr/bin/java文件,后续脚本直接使用该文件
#!/bin/bash
# 安装目录
installDir="/opt/gdmp/exporter"
# exporter名称启动文件名称
exporterName="rocketmq-exporter"
# exporter安装包名称
exporterPackageName="${exporterName}"
exporterPackageNameTar="${exporterPackageName}.jar"
# exporter端口
exporterPort="5557"
# 描述信息
description="默认暴露端口为:${exporterPort},需要修改配置需编辑/etc/systemd/system/${exporterName}.service注册服务,并执行systemctl daemon-reload&systemctl restart ${exporterName}重启${exporterName}服务"
if ! egrep "7.[0-9]" /etc/redhat-release &>/dev/null; then
printf -- '\033[31m ERROR: 支持Centos 7版本 \033[0m\n'
exit 1
fi
# 目录不存在,创建目录
function mkdirIfNotExist() {
if [ ! -d "$1" ]; then
echo "mkdir -p $1"
mkdir -p $1
fi
}
# 软连接
if [ ! -z "$JAVA_HOME" ]; then
echo "ln -s $JAVA_HOME/jre/bin/java /usr/bin/java"
ln -s $JAVA_HOME/jre/bin/java /usr/bin/java
else
echo "未安装JDK或者为配置环境变量"
exit 1
fi
# 目录创建
mkdirIfNotExist ${installDir}/${exporterName}
# 拷贝安装包
echo "/usr/bin/cp -rf ${exporterPackageNameTar} ${installDir}/${exporterPackageName}/"
/usr/bin/cp -rf ${exporterPackageNameTar} ${installDir}/${exporterPackageName}/
# 启动脚本
echo "/usr/bin/cp -rf start.sh ${installDir}/${exporterPackageName}/"
/usr/bin/cp -rf start.sh ${installDir}/${exporterPackageName}/
# 拷贝启动service文件
echo "/usr/bin/cp -f ${exporterName}.service /etc/systemd/system/"
/usr/bin/cp -f ${exporterName}.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable ${exporterName}
systemctl start ${exporterName}
echo "启动 ${exporterName} 客户端完成"
echo "注册 ${exporterName} 服务守护进程完成"
printf -- "\033[32m ${exporterName} 状态: \033[0m\n"
systemctl --type=service --state=active | grep ${exporterName}
printf -- "\033[32m exporter访问地址:http://127.0.0.1:${exporterPort}/metrics \033[0m\n"
echo ${description}
[Unit]
Description=https://github.com/apache/rocketmq-exporter
After=network-online.target
[Service]
ExecStart=/opt/gdmp/exporter/rocketmq-exporter/start.sh
#ExecStart=/usr/bin/java -jar -Xms1G -Xmx1G /opt/gdmp/exporter/rocketmq-exporter/rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 >/data/rocketmq/rocketmq-exporter/exporter.log 2>&1
Restart=always
RestartSec=5
StartLimitInterval=0
StartLimitBurst=10
StandardOutput=append:/data/rocketmq/rocketmq-exporter/startup.log
StandardError=append:/data/rocketmq/rocketmq-exporter/error.log
[Install]
WantedBy=multi-user.target
#!/bin/bash
if [ ! -z "$JAVA_HOME" ]; then
JAVA="$JAVA_HOME/bin/java"
else
JAVA='/usr/bin/java'
fi
echo "$JAVA"
# rocketmq.config.namesrvAddr 配置nameserver地址,多个用分号隔开
$JAVA -jar -Xms1G -Xmx1G /opt/gdmp/exporter/rocketmq-exporter/rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 2>&1
#!/bin/bash
# 安装目录
installDir="/opt/gdmp/exporter"
# exporter名称
exporterName="rocketmq-exporter"
echo "systemctl stop ${exporterName}"
systemctl stop ${exporterName}
systemctl daemon-reload
# 删除安装文件
echo "rm -rf ${installDir}/${exporterName}"
rm -rf ${installDir}/${exporterName}
# 安装服务文件
echo "rm -rf /etc/systemd/system/${exporterName}.service"
rm -rf /etc/systemd/system/${exporterName}.service
printf -- "\033[32m 卸载完成 \033[0m\n"
安装包:
链接:https://pan.baidu.com/s/1f9nMH1oSxyr8azUepu-Q1g
提取码:gcjk
🧫安装过程
直接执行install.sh脚本。

访问地址:

🧾日志路径
# 查看日志
tail -f ~/logs/exporterlogs/rocketmq-exporter.log

🔖问题记录
🔊注意:
- 原生
rocketmq-exporter有bug,需要修改org.apache.rocketmq.exporter.model.BrokerRuntimeStats#BrokerRuntimeStats中getTransferredTps修改为getTransferedTps。- 如果使用版本不一致,需要在rocketmq-exporter中修改对应的版本,涉及到
pom.xml文件和application.yml文件。
java.lang.NullPointerException: null
at org.apache.rocketmq.exporter.model.BrokerRuntimeStats.loadTps(BrokerRuntimeStats.java:149)
at org.apache.rocketmq.exporter.model.BrokerRuntimeStats.<init>(BrokerRuntimeStats.java:94)
at org.apache.rocketmq.exporter.task.MetricsCollectTask.collectBrokerRuntimeStats(MetricsCollectTask.java:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

📘原理说明
Rocketmq-exporter 是用于监控 RocketMQ broker 端和客户端所有相关指标的系统,通过 mqAdmin 从 broker 端获取指标值后封装成 87 个 cache。
🔊警告
过去版本曾是87个concurrentHashMap,由于Map不会删除过期指标,所以一旦有label变动就会生成一个新的指标,旧的无用指标无法自动删除,久而久之造成内存溢出。而使用Cache结构可可以实现过期删除,且过期时间可配置。
上述是
RocketMQ官网的问题,也是我们在编写exporter需要注意的问题。Rocketmq-exporter也是我们自己开发exporter重要参考资料。
Rocketmq-expoter 获取监控指标的流程如下图所示,Expoter 通过 MQAdminExt 向 MQ 集群请求数据,请求到的数据通过 MetricService 规范化成 Prometheus 需要的格式,然后通过 /metics 接口暴露给 Promethus。

🗞️Metric结构

详细资料参考官网文档,在这里不在赘述。官网文档地址:https://rocketmq.apache.org/zh/docs/4.x/deployment/04Exporter#metric-%E7%BB%93%E6%9E%84

🧩prometheus相关配置
🌵按照prometheus官网配置启动
配置 promethus 的 static_config: -targets 为 exporter 的启动 IP 和端口,如: localhost:5557
- job_name: 'rocketmq'
scrape_interval: 30s
static_configs:
- targets: ['10.0.107.158:5557']
labels:
instance: '监控(0.0.107.158:5557)'
☘️grafana面板
以下面板在官网提供的面板上做了修改。
Rocketmq_dashboard.json

📙指标
💻服务端指标
| 指标名称 | 含义 | 对应Broker指标名 |
|---|---|---|
| rocketmq_broker_tps | Broker级别的生产TPS | |
| rocketmq_broker_qps | Broker级别的消费QPS | |
| rocketmq_broker_commitlog_diff | Broker组从节点同步落后消息size | |
| rocketmq_brokeruntime_pmdt_0ms | 服务端开始处理写请求到完成写入的耗时(0ms) | putMessageDistributeTime |
| rocketmq_brokeruntime_pmdt_0to10ms | 服务端开始处理写请求到完成写入的耗时(0~10ms) | |
| rocketmq_brokeruntime_pmdt_10to50ms | 服务端开始处理写请求到完成写入的耗时(10~50ms) | |
| rocketmq_brokeruntime_pmdt_50to100ms | 服务端开始处理写请求到完成写入的耗时(50~100ms) | |
| rocketmq_brokeruntime_pmdt_100to200ms | 服务端开始处理写请求到完成写入的耗时(100~200ms) | |
| rocketmq_brokeruntime_pmdt_200to500ms | 服务端开始处理写请求到完成写入的耗时(200~500ms) | |
| rocketmq_brokeruntime_pmdt_500to1s | 服务端开始处理写请求到完成写入的耗时(500~1000ms) | |
| rocketmq_brokeruntime_pmdt_1to2s | 服务端开始处理写请求到完成写入的耗时(1~2s) | |
| rocketmq_brokeruntime_pmdt_2to3s | 服务端开始处理写请求到完成写入的耗时(2~3s) | |
| rocketmq_brokeruntime_pmdt_3to4s | 服务端开始处理写请求到完成写入的耗时(3~4s) | |
| rocketmq_brokeruntime_pmdt_4to5s | 服务端开始处理写请求到完成写入的耗时(4~5s) | |
| rocketmq_brokeruntime_pmdt_5to10s | 服务端开始处理写请求到完成写入的耗时(5~10s) | |
| rocketmq_brokeruntime_pmdt_10stomore | 服务端开始处理写请求到完成写入的耗时(> 10s) | |
| rocketmq_brokeruntime_dispatch_behind_bytes | 到现在为止,未被分发(构建索引之类的操作)的消息bytes | dispatchBehindBytes |
| rocketmq_brokeruntime_put_message_size_total | broker写入消息size的总和 | putMessageSizeTotal |
| rocketmq_brokeruntime_put_message_average_size | broker写入消息的平均大小 | putMessageAverageSize |
| rocketmq_brokeruntime_remain_transientstore_buffer_numbs | TransientStorePool 中队列的容量 | remainTransientStoreBufferNumbs |
| rocketmq_brokeruntime_earliest_message_timestamp | broker存储的消息最早的时间戳 | earliestMessageTimeStamp |
| rocketmq_brokeruntime_putmessage_entire_time_max | broker自运行以来,写入消息耗时的最大值 | putMessageEntireTimeMax |
| rocketmq_brokeruntime_start_accept_sendrequest_time | 开始接受发送请求的时间 | startAcceptSendRequestTimeStamp |
| rocketmq_brokeruntime_putmessage_times_total | broker写入消息的总次数 | putMessageTimesTotal |
| rocketmq_brokeruntime_getmessage_entire_time_max | broker自启动以来,处理消息拉取的最大耗时 | getMessageEntireTimeMax |
| rocketmq_brokeruntime_pagecache_lock_time_mills | pageCacheLockTimeMills | |
| rocketmq_brokeruntime_commitlog_disk_ratio | commitLog所在磁盘的使用比例 | commitLogDiskRatio |
| rocketmq_brokeruntime_dispatch_maxbuffer | broker没有计算,一直为0 | dispatchMaxBuffer |
| rocketmq_brokeruntime_pull_threadpoolqueue_capacity | 处理拉取请求线程池队列的容量 | pullThreadPoolQueueCapacity |
| rocketmq_brokeruntime_send_threadpoolqueue_capacity | 处理发送请求线程池队列的容量 | sendThreadPoolQueueCapacity |
| rocketmq_brokeruntime_query_threadpool_queue_capacity | 处理查询请求线程池队列的容量 | queryThreadPoolQueueCapacity |
| rocketmq_brokeruntime_pull_threadpoolqueue_size | 处理拉取请求线程池队列的实际size | pullThreadPoolQueueSize |
| rocketmq_brokeruntime_query_threadpoolqueue_size | 处理查询请求线程池队列的实际size | queryThreadPoolQueueSize |
| rocketmq_brokeruntime_send_threadpool_queue_size | 处理send请求线程池队列的实际size | sendThreadPoolQueueSize |
| rocketmq_brokeruntime_pull_threadpoolqueue_headwait_timemills | 处理拉取请求线程池队列的队头任务等待时间 | pullThreadPoolQueueHeadWaitTimeMills |
| rocketmq_brokeruntime_query_threadpoolqueue_headwait_timemills | 处理查询请求线程池队列的队头任务等待时间 | queryThreadPoolQueueHeadWaitTimeMills |
| rocketmq_brokeruntime_send_threadpoolqueue_headwait_timemills | 处理发送请求线程池队列的队头任务等待时间 | sendThreadPoolQueueHeadWaitTimeMills |
| rocketmq_brokeruntime_msg_gettotal_yesterdaymorning | 到昨晚12点为止,读取消息的总次数 | msgGetTotalYesterdayMorning |
| rocketmq_brokeruntime_msg_puttotal_yesterdaymorning | 到昨晚12点为止,写入消息的总次数 | msgPutTotalYesterdayMorning |
| rocketmq_brokeruntime_msg_gettotal_todaymorning | 到今晚12点为止,读取消息的总次数 | msgGetTotalTodayMorning |
| rocketmq_brokeruntime_msg_puttotal_todaymorning | 到昨晚12点为止,写入消息的总次数 | putMessageTimesTotal |
| rocketmq_brokeruntime_msg_put_total_today_now | 每个broker到现在为止,写入的消息次数 | msgPutTotalTodayNow |
| rocketmq_brokeruntime_msg_gettotal_today_now | 每个broker到现在为止,读取的消息次数 | msgGetTotalTodayNow |
| rocketmq_brokeruntime_commitlogdir_capacity_free | commitLog所在目录的可用空间 | commitLogDirCapacity |
| rocketmq_brokeruntime_commitlogdir_capacity_total | commitLog所在目录的总空间 | |
| rocketmq_brokeruntime_commitlog_maxoffset | commitLog的最大offset | commitLogMaxOffset |
| rocketmq_brokeruntime_commitlog_minoffset | commitLog的最小offset | commitLogMinOffset |
| rocketmq_brokeruntime_remain_howmanydata_toflush | remainHowManyDataToFlush | |
| rocketmq_brokeruntime_getfound_tps600 | 600s内getMessage时get到消息的平均TPS | getFoundTps |
| rocketmq_brokeruntime_getfound_tps60 | 60s内getMessage时get到消息的平均TPS | |
| rocketmq_brokeruntime_getfound_tps10 | 10s内getMessage时get到消息的平均TPS | |
| rocketmq_brokeruntime_gettotal_tps600 | 600s内getMessage次数的平均TPS | getTotalTps |
| rocketmq_brokeruntime_gettotal_tps60 | 60s内getMessage次数的平均TPS | |
| rocketmq_brokeruntime_gettotal_tps10 | 10s内getMessage次数的平均TPS | |
| rocketmq_brokeruntime_gettransfered_tps600 | getTransferedTps | |
| rocketmq_brokeruntime_gettransfered_tps60 | ||
| rocketmq_brokeruntime_gettransfered_tps10 | ||
| rocketmq_brokeruntime_getmiss_tps600 | 600s内getMessage时没有get到消息的平均TPS | getMissTps |
| rocketmq_brokeruntime_getmiss_tps60 | 60s内getMessage时没有get到消息的平均TPS | |
| rocketmq_brokeruntime_getmiss_tps10 | 10s内getMessage时没有get到消息的平均TPS | |
| rocketmq_brokeruntime_put_tps600 | 600s内写入消息次数的平均TPS | putTps |
| rocketmq_brokeruntime_put_tps60 | 60s内写入消息次数的平均TPS | |
| rocketmq_brokeruntime_put_tps10 | 10s内写入消息次数的平均TPS |
💻生产端指标
| 指标名称 | 含义 |
|---|---|
| rocketmq_producer_offset | topic当前时间的最大offset |
| rocketmq_topic_retry_offset | 重试Topic当前时间的最大offset |
| rocketmq_topic_dlq_offset | 死信Topic当前时间的最大offset |
| rocketmq_producer_tps | Topic在一个Broker组上的生产TPS |
| rocketmq_producer_message_size | Topic在一个Broker组上的生产消息大小的TPS |
| rocketmq_queue_producer_tps | 队列级别生产TPS |
| rocketmq_queue_producer_message_size | 队列级别生产消息大小的TPS |
💻消费端指标
| 指标名称 | 含义 |
|---|---|
| rocketmq_group_diff | 消费组消息堆积消息数 |
| rocketmq_group_retrydiff | 消费组重试队列堆积消息数 |
| rocketmq_group_dlqdiff | 消费组死信队列堆积消息数 |
| rocketmq_group_count | 消费组内消费者个数 |
| rocketmq_client_consume_fail_msg_count | 过去1h消费者消费失败的次数 |
| rocketmq_client_consume_fail_msg_tps | 消费者消费失败的TPS |
| rocketmq_client_consume_ok_msg_tps | 消费者消费成功的TPS |
| rocketmq_client_consume_rt | 消息从拉取到被消费的时间 |
| rocketmq_client_consumer_pull_rt | 客户端拉取消息的时间 |
| rocketmq_client_consumer_pull_tps | 客户端拉取消息的TPS |
| rocketmq_consumer_tps | 每个Broker组上订阅组的消费TPS |
| rocketmq_group_consume_tps | 订阅组当前消费TPS(对rocketmq_consumer_tps按broker聚合) |
| rocketmq_consumer_offset | 订阅组在一个broker组上当前的消费Offset |
| rocketmq_group_consume_total_offset | 订阅组当前消费的Offset(对rocketmq_consumer_offset按broker聚合) |
| rocketmq_consumer_message_size | 订阅组在一个broker组上消费消息大小的TPS |
| rocketmq_send_back_nums | 订阅组在一个broker组上消费失败,写入重试消息的次数 |
| rocketmq_group_get_latency_by_storetime | 消费组消费延时,exporter get到消息后与当前时间相减 |
🧱监控指标选取
| 指标 | PromQL |
|---|---|
| 生产消息TPS | sum by (broker,topic) (rocketmq_producer_tps{instance=“ i n s t a n c e " , b r o k e r = " instance",broker=~" instance",broker= "broker”}) |
| 消费消息TPS | sum by (broker) (rocketmq_consumer_tps{instance=“ i n s t a n c e " , b r o k e r = " instance",broker=~" instance",broker= "broker”}) |
| 消息积压数量 | sum(rocketmq_producer_offset{instance=“KaTeX parse error: Expected 'EOF', got '}' at position 10: instance"}̲) by (topic) - …instance”}) by (group,topic) |
| 磁盘最高使用率 | max(rocketmq_brokeruntime_commitlog_disk_ratio{instance=“$instance”}) * 100 |
| 消费组消费延时 | sum by (group) (rocketmq_group_get_latency_by_storetime{instance=“$instance”}) |
🧶告警规则示例
具体规则根据需求执行定义即可。
groups:
- name: 'RocketMQ出现异常'
rules:
- alert: '生产消息TPS'
expr: sum by (instance) (rocketmq_producer_tps{instance="10.0.107.158:5557"}/60) >= 50
for: 1m
labels:
severity: '4'
annotations:
description: '{{ $labels.gdmpName }}的生产消息TPS当前是{{ $value | printf "%.2f" }}条/秒,请及时处理!!'
currentValue: '{{ $value | printf "%.2f" }}条/秒'
thresholdValue: '生产消息TPS ≥ 50条/秒'
- alert: '消费消息TPS'
expr: sum by (instance) (rocketmq_consumer_tps{instance="10.0.107.158:5557"}/60) >= 50
for: 5m
labels:
severity: '4'
annotations:
description: '{{ $labels.gdmpName }}的消费消息TPS当前是{{ $value | printf "%.2f" }}条/秒,请及时处理!!'
currentValue: '{{ $value | printf "%.2f" }}条/秒'
thresholdValue: '消费消息TPS ≥ 50条/秒'
- alert: '消息积压数量'
expr: sum by (instance) (sum(rocketmq_producer_offset{instance="10.0.107.158:5557"}) by (topic,gdmpId) - on(topic,gdmpId) group_right sum(rocketmq_consumer_offset{instance="10.0.107.158:5557"}) by (group,topic,gdmpId)) >= 100
for: 5m
labels:
severity: '4'
annotations:
description: '{{ $labels.gdmpName }}的消息积压数量当前是{{ $value }}条,请及时处理!!'
currentValue: '{{ $value }}条'
thresholdValue: '消息积压数量 ≥ 100条'
- alert: '磁盘最高使用率'
expr: max by (instance)(rocketmq_brokeruntime_commitlog_disk_ratio{instance="10.0.107.158:5557"}) * 100 >= 80
for: 5m
labels:
severity: '4'
annotations:
description: '{{ $labels.gdmpName }}的磁盘最高使用率当前是{{ $value | printf "%.2f" }}%,请及时处理!!'
currentValue: '{{ $value | printf "%.2f" }}%'
thresholdValue: '磁盘最高使用率 ≥ 80%'
- alert: '最高消费延时'
expr: max by (instance)(rocketmq_group_get_latency_by_storetime{instance="10.0.107.158:5557"}) / 1000 >= 50
for: 5m
labels:
severity: '4'
annotations:
description: '{{ $labels.gdmpName }}的最高消费延时当前是{{ $value | printf "%.2f" }}秒,请及时处理!!'
currentValue: '{{ $value | printf "%.2f" }}秒'
thresholdValue: '最高消费延时 ≥ 50秒'

本文介绍了如何安装和配置rocketmq-exporter以监控RocketMQ,包括解决原生bug,修改配置文件,打包启动,以及如何配合Prometheus和Grafana展示与分析指标。同时,文章提供了安装脚本和可能遇到的问题记录,帮助读者了解RocketMQ监控系统的实现原理和关键指标。
963

被折叠的 条评论
为什么被折叠?



