一、现象
1、flink job log
2022-05-17 02:48:25,688 INFO org.apache.hadoop.hdfs.DataStreamer [] - Slow ReadProcessor read fields for block BP-1487117196-110.246.92.211-1641695608175:blk_1305548153_236768430 took 44369ms (threshold=30000ms); ack: seqno: 2041 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 44349240826 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[110.246.92.130:9866,DS-3ee15ef0-7de4-49ed-b031-1a2769cd9a97,DISK], DatanodeInfoWithStorage[110.246.92.217:9866,DS-382900d8-67b2-4aa0-8a66-e1ea435aaba4,DISK], DatanodeInfoWithStorage[110.246.92.218:9866,DS-e28bb23b-7a03-498a-b8a2-859382ad300e,DISK]]
2、hadoop datanode log
2022-05-24 01:27:12,269 WARN [DataXceiver for client DFSClient_NONMAPREDUCE_-1719586450_1 at /110.246.92.219:60654 [Receiving block BP-1487117196-110.246.92.211-1641695608175:blk_1328326532_260060124]] datanode.DataNode (BlockReceiver.java:manageWriterOsCache(934)) - Slow manageWriterOsCache took 510ms (threshold=300ms), volume=file:/data/hadoop/, blockId=1328326532
二、问题分析及处理
1、在每个DataNode上运行以下命令来收集所有Slow消息的计数
$ egrep -o "Slow.*?(took|cost)" hadoop-root-datanode-datanode-1.out | sort | uniq -c
1105 Slow BlockReceiver write data to disk cost
330 Slow BlockReceiver write packet to mirror took
102 Slow flushOrSync took
5031 Slow manageWriterOsCache took
388 Slow PacketResponder send ack to upst

本文分析了Hadoop集群中出现的性能问题,特别是针对慢速读写操作进行了深入探讨。通过对日志的分析,发现了一系列与数据节点相关的警告信息,并提出了具体的排查步骤和工具,如使用iostat监测磁盘活动。
最低0.47元/天 解锁文章
1630

被折叠的 条评论
为什么被折叠?



