java base server 状态_hbase regionserver异常宕机

这篇博客分析了一起HBase RegionServer因DroppedSnapshotException异常宕机的案例。错误源于刷新Memstore时,HDFS写入数据时间过长。触发因素包括Memstore大小限制、Region Server全局内存限制和HLog数量限制。解决方案涉及调整memstore设置、检查HDFS健康状况、优化DataNode集群负载和网络,以及必要时重启服务器。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

原因分析:

线上hbase,在凌晨1点左右,发现某一台regionserver进行了重启(regionserver加了守护线程)

1、查看master日志:

2020-02-27 01:04:57,001 ERROR [RpcServer.FifoRWQ.default.read.handler=26,queue=10,port=16000] master.MasterRpcServices: Region server a3ster,16020,1582342923163reported a fatal error:

ABORTING region server a3ser,16020,1582342923163: Replay of WAL required. Forcing server shutdown

Cause:

org.apache.hadoop.hbase.DroppedSnapshotException: region: T_BL,\x0A\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1572576275632.069e4d877a4ff46f9964ac8bcddb09ef.

at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2509)

at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2186)

at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2148)

at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2039)

at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1965)

at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:505)

at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:475)

at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)

at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:263)

at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed toget sync result after 300000 ms for ringBufferSequence=101793126, WAL system stuck?at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:174)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1406)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:1400)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1512)

at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:126)

at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:75)

at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2486)

...9more2020-02-27 01:04:57,032 ERROR [RpcServer.FifoRWQ.default.read.handler=29,queue=8,port=16000] master.MasterRpcServices: Region server a3ser,16020,1582342923163reported a fatal error:

ABORTING region server a3serz,16020,1582342923163: Replay of WAL required. Forcing server shutdown

Cause:

2、查看regioserver 日志

2020-02-27 01:04:56,813 WARN [ResponseProcessor for block BP-1884348122-10.62.2.1-1545175191847:blk_1489206371_467735337] hdfs.DFSClient: Slow ReadProcessor read fields took 327586ms (threshold=30000ms); ack: seqno: 1 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 965211 4: "\000\000", targets: [11.23.3.3:9866, 11.23.3.5:9866]2020-02-27 01:04:56,816 FATAL [MemStoreFlusher.6] regionserver.HRegionServer: ABORTING region server a3serz,16020,1582342923163: Replay of WAL required. Forcing server shutdown

org.apache.hadoop.hbase.DroppedSnapshotException: region: T_BL,\x0A\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1572576275632.069e4d877a4ff46f9964ac8bcddb09ef.

at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2509)

at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2186)

at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2148)

at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2039)

at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1965)

at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:505)

at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:475)

at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)

at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:263)

at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed toget sync result after 300000 ms for ringBufferSequence=101793126, WAL system stuck?at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:174)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1406)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:1400)

at org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1512)

at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:126)

at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:75)

at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2486)

分析:

出现DroppedSnapshotException错误,一般都是由于在进行刷新memstore时,出现了问题,上诉标黄的地方,说明在刷新某一个region级别的memstore时,往hdfs写入数据时间过长,导致regionserver挂掉

hbase memstore 刷新触发条件如下:

HBase会在如下几种情况下触发flush操作,需要注意的是MemStore的最小flush单元是HRegion而不是单个MemStore。可想而知,如果一个HRegion中Memstore过多,每次flush的开销必然会很大,因此我们也建议在进行表设计的时候尽量减少ColumnFamily的个数。

Memstore级别限制:当Region中任意一个MemStore的大小达到了上限(hbase.hregion.memstore.flush.size,默认128MB),会触发Memstore刷新。

Region级别限制:当Region中所有Memstore的大小总和达到了上限(hbase.hregion.memstore.block.multiplier* hbase.hregion.memstore.flush.size,默认 2* 128M =256M),会触发memstore刷新。

Region Server级别限制:当一个Region Server中所有Memstore的大小总和达到了上限(hbase.regionserver.global.memstore.upperLimit * hbase_heapsize,默认 40%的JVM内存使用量),会触发部分Memstore刷新。Flush顺序是按照Memstore由大到小执行,先Flush Memstore最大的Region,再执行次大的,直至总体Memstore内存使用量低于阈值(hbase.regionserver.global.memstore.lowerLimit * hbase_heapsize,默认 38%的JVM内存使用量)。

当一个Region Server中HLog数量达到上限(可通过参数hbase.regionserver.maxlogs配置)时,系统会选取最早的一个 HLog对应的一个或多个Region进行flush

HBase定期刷新Memstore:默认周期为1小时,确保Memstore不会长时间没有持久化。为避免所有的MemStore在同一时间都进行flush导致的问题,定期的flush操作有20000左右的随机延时。

手动执行flush:用户可以通过shell命令 flush ‘tablename’或者flush ‘region name’分别对一个表或者一个Region进行flush。

上诉的这个问题不仅仅是hbase本身的问题,跟hdfs也相关。

3、查看hbase 写入数据时,datanode节点的Slow状态情况

$ egrep -o "Slow.*?(took|cost)" hadoop-hduser-datanode-a3ser.log.1 |sort |uniq -c36Slow BlockReceiver write data to disk cost2743Slow BlockReceiver write packet to mirror took2Slow flushOrSync took35Slow manageWriterOsCache took21 Slow PacketResponder send ack to upstream took

说明:

Slow BlockReceiver write data to disk cost : 表明在将块写入OS缓存或磁盘时存在延迟

Slow BlockReceiver write packet to mirror took :表明在网络上写入块时有延迟

Slow manageWriterOsCache took : 表明在将块写入OS缓存或磁盘时存在延迟

Slow PacketResponder send ack to upstream took : 母鸡 。。。

Slow flushOrSync took : 表明在将块写入OS缓存或磁盘时存在延迟

4.解决方案

1.设置memstore大小;HloG数量设置;2.check hdfs 并且修复

3、检查datanode 集群负载,网络情况4.重启server。

a3ster

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值