一、YouAreDeadException
FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=cloud13,60020,1348890729197, load=(requests=0, regions=375, usedHeap=2455, maxHeap=6035): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing cloud13,60020,1348890729197 as dead server
org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing cloud13,60020,1348890729197 as dead server
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:734)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:595)
at java.lang.Thread.run(Thread.java:722)
再看一段YouAreDeadException的注释
/**
* This exception is thrown by the master when a region server reports and is
* already being processed as dead. This can happen when a region server loses
* its session but didn't figure it yet.
*/
很明显,这个是由于session超时引起的,譬如说超时时间是30s,结果30s内没有和服务器取得联系,那么服务器就会认定这个rs超时,等rs再次连接的时候,就会出现这个异常。这个问题极有可能是由于GC引起的,请留意GC日志。
--------------------------------------------------分割线------------------------------------------------------------
二、Got error for OP_READ_BLOCK
2012-10-09 02:22:41,788 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.0.1.170:50010 for file /hbase/pp_mac_all/784dcfc3fa060b66402a242080f5cd91/nf/5190449121954817199 for block blk_5558099265298248729_681382:java.io.IOException: Got error for OP_READ_BLOCK, self=/10.0.1.170:23458, remote=/10.0.1.170:50010, for file /hbase/pp_mac_all/784dcfc3fa060b66402a242080f5cd91/nf/5190449121954817199, for block 5558099265298248729_681382
at org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1476) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1992)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2066)
at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:46)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2066)
at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:46)
at org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:101)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:113)
at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
at org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.loadBlock(HFile.java:1442)
at org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1299)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:136)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:96)
at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:77)
at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1351)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2284)
at org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1135)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1127)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1111)
at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:3009)
at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2911)
at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1661)
at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2551)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
一般看到这个异常,都没什么关系。
这个问题是由于读hdfs中的block的时候出的问题,看DFSClient中的一段代码:
while (true) {
// cached block locations may have been updated by chooseDataNode()
// or fetchBlockAt(). Always get the latest list of locations at the
// start of the loop.
block = getBlockAt(block.getStartOffset(), false);
DNAddrPair retval = chooseDataNode(block);
DatanodeInfo chosenNode = retval.info;
InetSocketAddress targetAddr = retval.addr;
BlockReader reader = null;
int len = (int) (end - start + 1);
try {
Token<BlockTokenIdentifier> accessToken = block.getBlockToken();
// first try reading the block locally.
if (shouldTryShortCircuitRead(targetAddr)) {
try {
reader = getLocalBlockReader(conf, src, block.getBlock(),
accessToken, chosenNode, DFSClient.this.socketTimeout, start);
} catch (AccessControlException ex) {
LOG.warn("Short circuit access failed ", ex);
//Disable short circuit reads
shortCircuitLocalReads = false;
continue;
}
} else {
// go to the datanode
dn = socketFactory.createSocket();
NetUtils.connect(dn, targetAddr, socketTimeout);
dn.setSoTimeout(socketTimeout);
reader = BlockReader.newBlockReader(dn, src,
block.getBlock().getBlockId(), accessToken,
block.getBlock().getGenerationStamp(), start, len, buffersize,
verifyChecksum, clientName);
}
int nread = reader.readAll(buf, offset, len);
if (nread != len) {
throw new IOException("truncated return from reader.read(): " +
"excpected " + len + ", got " + nread);
}
return;
} catch (ChecksumException e) {
LOG.warn("fetchBlockByteRange(). Got a checksum exception for " +
src + " at " + block.getBlock() + ":" +
e.getPos() + " from " + chosenNode.getName());
reportChecksumFailure(src, block.getBlock(), chosenNode);
} catch (IOException e) {
if (refetchToken > 0 && tokenRefetchNeeded(e, targetAddr)) {
refetchToken--;
fetchBlockAt(block.getStartOffset());
continue;
} else {
LOG.warn("Failed to connect to " + targetAddr + " for file " + src
+ " for block " + block.getBlock() + ":" + e);
if (LOG.isDebugEnabled()) {
LOG.debug("Connection failure ", e);
}
}
} finally {
IOUtils.closeStream(reader);
IOUtils.closeSocket(dn);
}
// Put chosen node into dead list, continue
addToDeadNodes(chosenNode);
}
以上代码结合异常信息,可以得出hdfs在读block时出了问题,
OP_READ_BLOCK 是读数据块的操作,最后一句addToDeadNodes(chosenNode)并不是说将这个DataNode直接加到deadlist中,而只是在这次操作中不会再去使用这个dn。
可以看一下这段注释:
/**
* This variable tracks the number of failures since the start of the
* most recent user-facing operation. That is to say, it should be reset
* whenever the user makes a call on this stream, and if at any point
* during the retry logic, the failure count exceeds a threshold,
* the errors will be thrown back to the operation.
*
* Specifically this counts the number of times the client has gone
* back to the namenode to get a new list of block locations, and is
* capped at maxBlockAcquireFailures
*/
private int failures = 0;