Apache RocketMQ容器健康检查实现:自定义脚本与探针
1. 容器化部署的健康检查痛点
在Kubernetes环境部署Apache RocketMQ时,传统的TCP端口探测(如检查9876端口)无法真实反映服务可用性。生产环境中常出现以下问题:
- 端口监听正常但Namesrv路由表未初始化
- Broker存储异常导致消息读写失败
- Controller集群脑裂引发选主异常
本文提供三类探针实现方案,覆盖从基础到高级的健康检查需求,确保服务真正可用。
2. 基础健康检查:HTTP接口实现
2.1 Namesrv健康检查接口
// NamesrvController.java 新增健康检查端点
@GetMapping("/health")
public ResponseEntity<Map<String, Object>> healthCheck() {
Map<String, Object> status = new HashMap<>();
status.put("namesrvStatus", "RUNNING");
status.put("timestamp", System.currentTimeMillis());
status.put("version", VersionUtils.getVersion());
return ResponseEntity.ok(status);
}
2.2 Broker健康状态暴露
// BrokerController.java 实现健康指标收集
public HealthStatus getHealthStatus() {
HealthStatus status = new HealthStatus();
status.setStoreStatus(storeService.getStoreCheckStatus());
status.setCommitLogMaxOffset(storeService.getMaxPhyOffset());
status.setConsumerOffsetLag(statisticsService.calculateOffsetLag());
status.setDLedgerRole(dLedgerServer == null ? "STANDALONE" : dLedgerServer.getMemberState().getSelfId());
return status;
}
3. 高级探针:自定义Shell脚本
3.1 多维度检查脚本(rocketmq-health.sh)
#!/bin/bash
set -e
# 1. 基础网络检查
if ! nc -z localhost 9876; then
echo "Namesrv port 9876 not listening"
exit 1
fi
# 2. 元数据完整性检查
ROUTE_INFO=$(curl -s http://localhost:8080/namesrv/route)
if [ $(echo $ROUTE_INFO | jq '.status') != "success" ]; then
echo "Route info unavailable: $ROUTE_INFO"
exit 1
fi
# 3. 存储层健康检查
STORE_STATUS=$(curl -s http://localhost:8081/broker/store/status)
if [ $(echo $STORE_STATUS | jq '.diskAvailable') == "false" ]; then
echo "Disk space insufficient: $STORE_STATUS"
exit 1
fi
# 4. Controller集群状态检查
CONTROLLER_STATUS=$(curl -s http://localhost:9878/controller/status)
if [ $(echo $CONTROLLER_STATUS | jq '.leaderAvailable') != "true" ]; then
echo "No active controller leader"
exit 1
fi
echo "Health check passed"
exit 0
3.2 脚本权限与容器集成
# Dockerfile 健康检查配置
FROM apache/rocketmq:5.1.4
COPY rocketmq-health.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/rocketmq-health.sh
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD /usr/local/bin/rocketmq-health.sh
4. Kubernetes探针配置
4.1 部署清单示例
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rocketmq-broker
spec:
template:
spec:
containers:
- name: broker
image: custom-rocketmq:5.1.4
ports:
- containerPort: 10911
livenessProbe:
exec:
command: ["/usr/local/bin/rocketmq-health.sh"]
initialDelaySeconds: 120
periodSeconds: 30
failureThreshold: 3
readinessProbe:
httpGet:
path: /broker/health
port: 8081
initialDelaySeconds: 60
periodSeconds: 10
startupProbe:
tcpSocket:
port: 10911
failureThreshold: 30
periodSeconds: 10
4.2 探针参数调优矩阵
| 探针类型 | 初始延迟(秒) | 周期(秒) | 超时(秒) | 失败阈值 | 适用场景 |
|---|---|---|---|---|---|
| 启动探针 | 0 | 10 | 5 | 30 | 冷启动慢的Broker |
| 存活探针 | 120 | 30 | 10 | 3 | 运行时健康检查 |
| 就绪探针 | 60 | 10 | 5 | 2 | 流量接入控制 |
5. 分布式健康检查架构
5.1 多组件协同检查流程图
5.2 Controller集群健康检查实现
// Controller健康检查核心逻辑
public boolean isControllerClusterHealthy() {
// 1. 检查QuorumPeer集群状态
if (!quorumPeer.getSelf().isLeader()) {
return false;
}
// 2. 验证Broker心跳连续性
long lastHeartbeat = brokerManager.getLastHeartbeatTimestamp(brokerId);
if (System.currentTimeMillis() - lastHeartbeat > HEARTBEAT_TIMEOUT) {
log.warn("Broker heartbeat timeout: {}", brokerId);
return false;
}
// 3. 检查持久化存储状态
if (metadataStore.getStoreFileStatus() != StoreStatus.NORMAL) {
return false;
}
return true;
}
6. 生产环境最佳实践
6.1 健康检查指标阈值配置
# broker.conf 健康检查阈值配置
healthCheck.diskUsageThreshold=85
healthCheck.msgBacklogThreshold=100000
healthCheck.replicationLagThreshold=5000
healthCheck.controllerPingInterval=5000
6.2 监控告警集成
# Prometheus抓取配置
scrape_configs:
- job_name: 'rocketmq-health'
metrics_path: '/metrics'
static_configs:
- targets: ['namesrv:8080', 'broker:8081']
健康指标告警规则:
groups:
- name: rocketmq_health
rules:
- alert: HighDiskUsage
expr: rocketmq_disk_usage_percent > 85
for: 5m
labels:
severity: critical
annotations:
summary: "Broker磁盘使用率过高"
description: "磁盘使用率 {{ $value }}%,超过阈值85%"
7. 问题排查与解决方案
7.1 常见健康检查失败案例
| 故障现象 | 根因分析 | 解决方案 |
|---|---|---|
| 健康脚本超时 | DNS解析Namesrv延迟 | 脚本中添加本地hosts解析 |
| 存储检查误报 | 磁盘IO毛刺 | 实现三次采样确认机制 |
| Controller脑裂 | 网络分区导致心跳丢失 | 增加quorum投票验证 |
7.2 脚本调试技巧
# 开启调试模式执行
bash -x /usr/local/bin/rocketmq-health.sh
# 单独测试存储检查模块
./rocketmq-health.sh --module=store --verbose
8. 未来演进方向
- 自适应健康检查:基于机器学习动态调整检查频率
- 预测性健康管理:通过指标趋势预测潜在故障
- 分布式共识检查:跨节点协同验证集群健康状态
RocketMQ社区计划在5.2.0版本中内置Prometheus健康指标暴露,并提供官方Helm Chart集成上述探针方案。
附录:健康检查脚本完整代码
完整脚本可参考RocketMQ官方容器化部署工具包中的healthcheck/目录实现。
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



