线上Kafka集群节点宕机问题排查

本文探讨了一次线上Kafka集群节点宕机事件,涉及主机资源(4核CPU,64GB内存,5.3TB磁盘)和进程信息(4GB内存,大量分区和数据)。主要问题表现为Java堆内存不足,导致堆栈和文件映射失败。文章详细记录了异常日志、排查过程和解决措施,包括备份、重启和错误分析。后续两台节点也出现类似问题,重点关注了内存分配错误和IO操作异常。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

线上Kafka集群节点宕机问题排查

主机和进程信息

主机信息:6cores,64G,5.3T
Kafka进程信息:4G, partition 1K左右,消息数据量3.7T

今天上午发现Kafka有个节点挂了,上去查看日志发现有如下异常

Java HotSpot(TM) 64-Bit Server VM warning: Attempt to deallocate stack guard pages failed.
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f04249ed000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/appweb/hs_err_pid5566.log

备份日志,重启,

查看启动日志,一直有WARN日志

server.log.2021-07-22-11:[2021-07-22 11:37:18,003] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/data/kafka_2/ai_jl_simple-4/00000000000000000082.index) has non-zero size but the last offset is 82 which is no larger than the base offset 82.}. deleting /data/kafka_2/ai_jl_simple-4/00000000000000000082.timeindex, /data/kafka_2/ai_jl_simple-4/00000000000000000082.index and rebuilding index... (kafka.log.Log)

持续到最后启动成功,检查最后一个上述WARN日志,整个过程持续48分钟才启动成功

下午另外两台进程也相继挂掉
其中一台错误日志

[2021-07-22 16:29:47,593] INFO Rolled new log segment for 'schumann-biz.app.log-0' in 10 ms. (kafka.log.Log)
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f3646dd0000, 65536, 1) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/appweb/hs_err_pid18240.log
#
# Compiler replay data is saved as:
# /home/appweb/replay_pid18240.log

另外一台错误日志是

[2021-07-22 16:31:17,899] FATAL [Replica Manager on Broker 2]: Halting due to unrecoverable I/O error while handling produce request:  (kafka.server.ReplicaManager)
kafka.common.KafkaStorageException: I/O exception in append to log 'wdp-monitor.app.log-0'
	at kafka.log.Log.append(Log.scala:362)
	at kafka.cluster.Partition$$anonfun$11.apply(Partition.scala:451)
	at kafka.cluster.Partition$$anonfun$11.apply(Partition.scala:439)
	at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
	at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:219)
	at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:438)
	at kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:389)
	at kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:375)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:375)
	at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:312)
	at kafka.server.KafkaApis.handleProducerRequest(KafkaApis.scala:427)
	at kafka.server.KafkaApis.handle(KafkaApis.scala:80)
	at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:62)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Map failed
	at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:907)
	at kafka.log.AbstractIndex.<init>(AbstractIndex.scala:61)
	at kafka.log.TimeIndex.<init>(TimeIndex.scala:55)
	at kafka.log.LogSegment.<init>(LogSegment.scala:73)
	at kafka.log.Log.roll(Log.scala:809)
	at kafka.log.Log.maybeRoll(Log.scala:775)
	at kafka.log.Log.append(Log.scala:419)
	... 21 more
Caused by: java.lang.OutOfMemoryError: Map failed
	at sun.nio.ch.FileChannelImpl.map0(Native Method)
	at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:904)
	... 27 more
Java HotSpot(TM) 64-Bit Server VM warning: Attempt to deallocate stack guard pages failed.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值