CMAK与Azure Monitor集成:日志与指标集中管理
1. 引言:Kafka监控的痛点与解决方案
你是否正在为Kafka集群的日志分散在多台服务器而烦恼?是否因无法实时获取关键指标告警而错失故障排查良机?CMAK(Cluster Management for Apache Kafka)作为Yahoo开源的Kafka集群管理工具,提供了丰富的集群管理功能,但原生并未直接支持与Azure Monitor集成。本文将详细介绍如何通过日志转发和指标导出,实现CMAK与Azure Monitor的无缝集成,构建统一的监控平台。
读完本文后,你将能够:
- 配置CMAK日志输出至Azure Log Analytics
- 实现JMX指标与Azure Monitor的对接
- 创建自定义仪表板与告警规则
- 解决常见的集成问题
2. 架构设计:数据流向与组件交互
2.1 集成架构图
2.2 核心组件说明
| 组件 | 作用 | 技术实现 |
|---|---|---|
| CMAK JMX客户端 | 收集Kafka Broker指标 | KafkaJMX.scala |
| Logback日志框架 | 生成应用日志 | logback.xml配置 |
| Filebeat | 日志转发至Azure | 轻量级日志收集器 |
| Azure Log Analytics | 日志存储与查询 | Kusto查询语言 |
| Azure Monitor | 指标存储与告警 | 自定义指标API |
3. 日志集成:从CMAK到Log Analytics
3.1 CMAK日志配置优化
CMAK使用Logback作为日志框架,默认配置文件路径为conf/logback.xml。为便于Azure Log Analytics解析,需要调整日志格式,添加必要的元数据字段:
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${application.home}/logs/application.log</file>
<encoder>
<pattern>%date{ISO8601} [%thread] %-5level %logger{36} - %message - { "cluster": "%X{cluster}", "user": "%X{user}", "requestId": "%X{requestId}" }%n</pattern>
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>${application.home}/logs/application.%d{yyyy-MM-dd}.log</fileNamePattern>
<maxHistory>5</maxHistory>
<totalSizeCap>5GB</totalSizeCap>
</rollingPolicy>
</appender>
关键改进点:
- 添加ISO8601格式时间戳
- 增加JSON结构化日志元数据
- 包含集群名称、用户名和请求ID等上下文信息
3.2 配置Filebeat转发日志
3.2.1 Filebeat安装
在CMAK服务器上安装Filebeat:
curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-7.14.0-linux-x86_64.tar.gz
tar xzvf filebeat-7.14.0-linux-x86_64.tar.gz
cd filebeat-7.14.0-linux-x86_64
3.2.2 Filebeat配置
创建filebeat.yml配置文件:
filebeat.inputs:
- type: log
paths:
- /data/web/disk1/git_repo/gh_mirrors/cm/CMAK/logs/application*.log
fields:
log_type: cmak_application
fields_under_root: true
output.logstash:
hosts: ["logstash-eastus-01:5044"]
ssl.certificate_authorities: ["/etc/pki/tls/certs/ca-bundle.crt"]
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
- add_docker_metadata: ~
3.2.3 配置Logstash输出至Azure Log Analytics
input {
beats {
port => 5044
}
}
output {
azure_loganalytics {
customer_id => "<your-customer-id>"
shared_key => "<your-shared-key>"
log_type => "CMAKLogs"
resource_id => "/subscriptions/<subscription-id>/resourcegroups/<resource-group>/providers/microsoft.operationalinsights/workspaces/<workspace-name>"
}
}
3.3 Azure Log Analytics查询示例
3.3.1 错误日志查询
CMAKLogs
| where Level == "ERROR"
| where TimeGenerated > ago(1h)
| project TimeGenerated, Logger, Message, cluster
| sort by TimeGenerated desc
3.3.2 集群操作审计
CMAKLogs
| where Message contains "Created topic" or Message contains "Deleted topic"
| extend Operation = case(
Message contains "Created topic", "CreateTopic",
Message contains "Deleted topic", "DeleteTopic",
"Unknown"
)
| project TimeGenerated, Operation, User, cluster, Message
| sort by TimeGenerated desc
4. 指标集成:JMX数据导出至Azure Monitor
4.1 CMAK JMX数据采集原理
CMAK通过KafkaJMX.scala实现JMX连接与指标采集,核心代码如下:
def doWithConnection[T](jmxHost: String, jmxPort: Int, jmxUser: Option[String], jmxPass: Option[String], jmxSsl: Boolean)(fn: MBeanServerConnection => T) : Try[T] = {
val urlString = s"service:jmx:rmi:///jndi/rmi://$jmxHost:$jmxPort/jmxrmi"
val url = new JMXServiceURL(urlString)
try {
require(jmxPort > 0, "No jmx port but jmx polling enabled!")
// 连接参数配置
val jmxConnectorProperties = List(credsProps, sslProps).flatten.foldRight(defaultJmxConnectorProperties)(_ ++ _)
val jmxc = JMXConnectorFactory.connect(url, jmxConnectorProperties.asJava)
try {
Try {
fn(jmxc.getMBeanServerConnection)
}
} finally {
jmxc.close()
}
} catch {
case e: Exception =>
logger.error(s"Failed to connect to $urlString",e)
Failure(e)
}
}
4.2 自定义指标导出器实现
4.2.1 添加Azure Monitor依赖
在project/plugins.sbt中添加Azure Monitor SDK依赖:
libraryDependencies += "com.microsoft.azure" % "azure-monitor-metrics-ingestion" % "1.0.0-beta.1"
4.2.2 实现指标导出工具类
package kafka.manager.utils
import com.microsoft.azure.monitor.metrics-ingestion.models.MetricDataPoint
import com.microsoft.azure.monitor.metrics-ingestion.MetricsIngestionClient
import java.time.OffsetDateTime
import scala.collection.JavaConverters._
object AzureMetricsExporter {
private val client = MetricsIngestionClient.builder()
.connectionString(System.getenv("AZURE_METRICS_CONNECTION_STRING"))
.buildClient()
def exportMetrics(clusterName: String, brokerId: Int, metrics: Map[String, Double]): Unit = {
val dataPoints = metrics.map { case (metricName, value) =>
MetricDataPoint.builder()
.time(OffsetDateTime.now())
.name(metricName)
.value(value)
.build()
}.asJava
val metricNamespace = "CMAK/Kafka"
val resourceId = s"/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Kafka/Clusters/$clusterName"
client.upload(resourceId, metricNamespace, dataPoints)
}
}
4.2.3 修改BrokerViewCacheActor集成导出功能
// 在BrokerViewCacheActor.scala中添加指标导出逻辑
import kafka.manager.utils.AzureMetricsExporter
def updateBrokerMetrics(id: Int, metrics: BrokerMetrics): Unit = {
// 原有逻辑...
// 导出关键指标
val exportMetrics = Map(
"BytesInPerSec" -> metrics.bytesInPerSec.oneMinuteRate,
"BytesOutPerSec" -> metrics.bytesOutPerSec.oneMinuteRate,
"MessagesInPerSec" -> metrics.messagesInPerSec.oneMinuteRate,
"FailedProduceRequestsPerSec" -> metrics.failedProduceRequestsPerSec.oneMinuteRate,
"FailedFetchRequestsPerSec" -> metrics.failedFetchRequestsPerSec.oneMinuteRate
)
AzureMetricsExporter.exportMetrics(clusterName, id, exportMetrics)
}
4.3 常用指标参考表
| 指标名称 | 描述 | JMX来源 | 推荐聚合方式 |
|---|---|---|---|
| BytesInPerSec | 每秒入站字节数 | kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec | Average |
| BytesOutPerSec | 每秒出站字节数 | kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec | Average |
| MessagesInPerSec | 每秒消息数 | kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec | Average |
| FailedProduceRequestsPerSec | 每秒失败的生产请求数 | kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec | Sum |
| FailedFetchRequestsPerSec | 每秒失败的拉取请求数 | kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec | Sum |
| ActiveControllerCount | 活跃控制器数量 | kafka.controller:type=KafkaController,name=ActiveControllerCount | Max |
| OfflinePartitionsCount | 离线分区数量 | kafka.controller:type=KafkaController,name=OfflinePartitionsCount | Max |
5. 自定义仪表板与告警配置
5.1 创建Azure Dashboard
- 登录Azure门户,导航至"仪表板"
- 点击"新建",选择"空白仪表板"
- 添加以下磁贴:
- 指标图表:集群吞吐量(BytesInPerSec总和)
- 指标图表:消息延迟(P95延迟)
- 日志查询:最近错误日志
- 指标警报:离线分区告警状态
5.2 配置关键告警规则
5.2.1 离线分区告警
- 信号名称: OfflinePartitionsCount
- 条件: 大于0,持续5分钟
- 严重程度: 警告(Warning)
- 操作组: 发送邮件通知、触发Azure Function自动修复
5.2.2 高请求失败率告警
- 信号名称: FailedProduceRequestsPerSec
- 条件: 5分钟内平均大于10
- 严重程度: 错误(Error)
- 操作组: 发送短信通知、创建事件工单
5.3 仪表板JSON示例
{
"properties": {
"lenses": {
"0": {
"order": 0,
"parts": {
"0": {
"position": {
"x": 0,
"y": 0,
"rowSpan": 2,
"colSpan": 4
},
"metadata": {
"inputs": [
{
"name": "queryInputs",
"value": {
"query": "CMAKMetrics | where MetricName == 'BytesInPerSec' | summarize sum(Value) by bin(TimeGenerated, 5m), ClusterName | render timechart",
"timeRange": {
"duration": "PT1H"
}
}
}
],
"type": "Extension/HubsExtension/PartType/MonitorChartPart"
}
}
}
}
}
}
}
6. 常见问题与解决方案
6.1 日志格式不规范导致解析错误
问题表现:Azure Log Analytics中日志字段缺失或格式错误
解决方案:确保Logback配置中的JSON格式正确,使用%X{}添加MDC上下文信息
<pattern>%date{ISO8601} [%thread] %-5level %logger{36} - %message - { "cluster": "%X{cluster}", "user": "%X{user}", "requestId": "%X{requestId}" }%n</pattern>
6.2 JMX连接超时或认证失败
问题表现:CMAK日志中出现Failed to connect to service:jmx:rmi:///jndi/rmi://...
解决方案:
- 检查Kafka Broker的JMX端口是否开放:
netstat -tln | grep 9999 - 验证JMX认证配置:
// 在KafkaJMX.scala中启用详细日志
logger.debug(s"JMX连接参数: host=$jmxHost, port=$jmxPort, user=${jmxUser.isDefined}")
- 确保CMAK服务器与Broker之间网络连通:
telnet broker-host 9999
6.3 指标数据延迟或缺失
问题表现:Azure Monitor中指标数据不完整或延迟超过5分钟
解决方案:
- 检查指标导出器日志:
grep AzureMetricsExporter logs/application.log - 验证Azure SDK连接配置:
// 添加连接调试日志
logger.info(s"Azure metrics exported to resourceId: $resourceId, points: ${dataPoints.size}")
- 调整导出批处理大小,避免API限制:
// 分批处理大量指标
val batchSize = 100
metrics.grouped(batchSize).foreach { batch =>
client.upload(resourceId, metricNamespace, batch.asJava)
}
7. 总结与最佳实践
7.1 关键收获
本文详细介绍了CMAK与Azure Monitor集成的完整方案,包括:
- 通过Logback和Filebeat实现日志集中管理
- 基于JMX和Azure SDK的指标导出方案
- 自定义仪表板与告警配置
- 常见问题的诊断与解决方法
7.2 最佳实践建议
-
日志方面:
- 始终包含集群名称和时间戳字段
- 错误日志应包含完整堆栈跟踪
- 定期清理历史日志,避免磁盘空间耗尽
-
指标方面:
- 优先导出关键业务指标,避免数据过载
- 对指标进行分级,核心指标高频采集,非核心指标低频采集
- 使用标签区分不同环境(开发/测试/生产)
-
安全方面:
- JMX连接启用SSL加密
- Azure连接字符串使用环境变量注入
- 定期轮换访问密钥和证书
-
性能方面:
- 日志转发使用异步方式
- 指标导出采用批处理模式
- 监控服务器与Kafka集群部署在同一区域
通过这套集成方案,你可以充分利用Azure云平台的监控能力,为Kafka集群提供企业级的可观测性保障,显著提升问题排查效率和系统稳定性。
8. 附录:参考资源
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



