hdfs datanode节点死掉

本文记录了一次Hadoop集群中数据节点出现故障的过程,详细分析了日志中关于GC暂停时间过长、I/O异常及块复制错误等问题,并探讨了可能的原因及解决方案。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1 现象一个节点dead,日志如下:

GC pool 'PS MarkSweep' had collection(s): count=13 time=24256ms
2018-03-28 17:52:51,624 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-229667812-10.100.208.74-1520931759525:blk_1078749177_5008401 received exception java.io.IOException: Premature EOF from inputStream
2018-03-28 17:52:59,065 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6944ms
GC pool 'PS MarkSweep' had collection(s): count=15 time=27935ms
2018-03-28 17:54:59,834 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 14932ms
GC pool 'PS MarkSweep' had collection(s): count=20 time=37701ms
2018-03-28 17:56:10,452 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: tf73:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.100.208.68:46164 dst: /10.100.208.73:50010
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:202)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:503)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:903)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:805)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
        at java.lang.Thread.run(Thread.java:748)
2018-03-28 17:58:03,025 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow flushOrSync took 13437ms (threshold=300ms), isSync:false, flushTotalNanos=13436879837ns
2018-03-28 17:56:42,980 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 25844ms
GC pool 'PS MarkSweep' had collection(s): count=63 time=118826ms
2018-03-28 17:59:02,118 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3395ms
GC pool 'PS MarkSweep' had collection(s): count=108 time=206400ms
2018-03-28 17:59:07,882 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3312ms
GC pool 'PS MarkSweep' had collection(s): count=4 time=7664ms
2018-03-28 17:59:13,599 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-229667812-10.100.208.74-1520931759525:blk_1078749174_5008400 src: /10.100.208.74:51732 dest: /10.100.208.73:50010
2018-03-28 17:59:21,115 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6989ms
GC pool 'PS MarkSweep' had collection(s): count=7 time=13229ms
2018-03-28 17:59:40,030 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow flushOrSync took 5667ms (threshold=300ms), isSync:false, flushTotalNanos=5666402795ns
2018-03-28 17:59:41,916 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-229667812-10.100.208.74-1520931759525:blk_1078749180_5008405 src: /10.100.208.74:51746 dest: /10.100.208.73:50010
2018-03-28 17:59:49,420 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3333ms
GC pool 'PS MarkSweep' had collection(s): count=10 time=18909ms
2018-03-28 18:00:01,035 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-229667812-10.100.208.74-1520931759525:blk_1078749179_5008404 src: /10.100.208.68:46194 dest: /10.100.208.73:50010
2018-03-28 18:00:55,777 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-229667812-10.100.208.74-1520931759525:blk_1078749174_5008400 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-229667812-10.100.208.74-1520931759525:blk_1078749174_5008400 already exists in state TEMPORARY and thus cannot be created.
2018-03-28 18:01:37,902 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:13334ms (threshold=300ms)
2018-03-28 18:02:26,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: tf73:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.100.208.74:51732 dst: /10.100.208.73:50010; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-229667812-10.100.208.74-1520931759525:blk_1078749174_5008400 already exists in state TEMPORARY and thus cannot be created.
2018-03-28 18:02:26,014 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7111ms
GC pool 'PS MarkSweep' had collection(s): count=80 time=152336ms
2018-03-28 18:03:38,774 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1366ms
GC pool 'PS MarkSweep' had collection(s): count=11 time=21098ms
2018-03-28 18:03:53,946 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:13320ms (threshold=300ms)
2018-03-28 18:04:32,159 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 13024ms
GC pool 'PS MarkSweep' had collection(s): count=61 time=116740ms
2018-03-28 18:06:50,214 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-229667812-10.100.208.74-1520931759525:blk_1078749180_5008405
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:202)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:503)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:903)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:805)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
        at java.lang.Thread.run(Thread.java:748)
2018-03-28 18:07:43,834 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:11756ms (threshold=300ms)
2018-03-28 18:07:43,834 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-229667812-10.100.208.74-1520931759525:blk_1078749182_5008407 src: /10.100.208.74:51760 dest: /10.100.208.73:50010
2018-03-28 18:07:17,288 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5282ms
                                                                                                                                                                                                 



2 原因:


3 解决办法




HDFSDataNode节点之间的数据不均衡指的是在HDFS集群中,不同的DataNode节点存储的数据量不一致。这可能导致某些节点负载过重,而其他节点负载较轻。 导致数据不均衡的主要原因有以下几点: 1. 初始复制:当数据进入HDFS时,会将其初始复制到不同的DataNode节点。由于网络延迟或节点性能差异等原因,可能导致某些节点复制的数据过多,而其他节点复制的数据较少。 2. 数据块移动:当节点故障或离线时,HDFS会将其上存储的数据块移动到其他健康的节点上。这个过程可能导致一些节点存储的数据块数量过多,而其他节点数据块较少。 为了解决数据不均衡的问题,HDFS采取了一些策略: 1. 副本平衡:HDFS会定期检查集群中各个节点上的数据块数量,并采取副本平衡的措施。这意味着将数据块从负载过重的节点移动到负载较轻的节点上,以实现数据均衡。 2. 块调度:HDFS的块调度器会根据各个节点上的剩余存储空间以及网络带宽等因素,决定将新的数据块复制到哪些节点上,以实现负载均衡。 3. HDFS管理员操作:HDFS管理员可以手动干预,将一些数据块从负载过重的节点移动到其他节点上,以实现数据均衡。 综上所述,数据不均衡是HDFS集群中的一个常见问题。通过副本平衡、块调度和管理员操作等策略,HDFS可以实现数据的均衡分布,提高数据的可靠性和性能。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值