故障现象:
6点1 分左右开始, Hadoop集群异常,所有的hdfs操作都出现问题。
几十个 job报以下错
1 2 3 4 5 6 7 8 | FAILED: RuntimeException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category WRITE is not supported in state standby或者: FAILED: RuntimeException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby |
故障原因分析:
1)看nn的日志,可以发现是写jn导致的,怀疑是网络问题导致,由于是5台jn,如果是网络问题的话,根据major的算法,需要3台写入失败才会导致出现问题,虽然jn和nn有些机器不在同一个交换机下,但是,从监控上来看,网络没有异常,排除网络问题
2)仔细查看nn的日志,和对源码的分析,发现是由于 namenode这个时间点进行了一次时间比较长的 full gc (96s ),导致写 journalnode 超时(默认是20s), namenode进程退出。同时从jn的日志可以看出,后面是没有nn的请求过来的(因为nn已经挂掉了)。
Gc日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 | 2014-05-25T05:59:53.578 +0800: 2478553.316: [Full GC [PSYoungGen: 2784K->0K(4179328K)] [PSOldGen: 28566944K->9208608K(28573696K)] 28569728K->9208608K(32753024K)[PSPermGen: 41405K->41405K(524288K)], 95.6706740 secs] [Times: user=95.62 sys=0.00, real= 95.67 secs]HeapPSYoungGen total 4179328K, used 87037K [0x00007f70e5000000, 0x00007f71e5000000, 0x00007f71e5000000) eden space 4164800K, 2% used [0x00007f70e5000000,0x00007f70ea4ff590,0x00007f71e3330000) from space 14528K, 0% used [0x00007f71e3330000,0x00007f71e3330000,0x00007f71e4160000) to space 14336K, 0% used [0x00007f71e4200000,0x00007f71e4200000,0x00007f71e5000000)PSOldGen total 28573696K, used 9208608K [0x00007f6a15000000, 0x00007f70e5000000, 0x00007f70e5000000) object space 28573696K, 32% used [0x00007f6a15000000,0x00007f6c470c8078,0x00007f70e5000000)PSPermGen total 524288K, used 41466K [0x00007f69f5000000, 0x00007f6a15000000, 0x00007f6a15000000) object space 524288K, 7% used [0x00007f69f5000000,0x00007f69f787e8f8,0x00007f6a15000000) |
jn日志:
1 2 3 4 5 6 7 8 | 2014-05-25 05:58:45,432 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /home/vipshop/hard_disk/nn_shared_disk/bipcluster/current/edits_inprogress_0000000001665741687 -> /home/vipshop/hard_disk/nn_shared_disk/bipcluster/current/edits_0000000001665741687-00000000016657770622014-05-25 06:44:54,299 ERROR org.apache.hadoop.hdfs.qjournal.server.JournalNode: RECEIVED SIGNAL 15: SIGTERM2014-05-25 06:44:54,302 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: SHUTDOWN_MSG: |
nn错误日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | 2014-05-25 06:01:29,258 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 96275 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.2014-05-25 06:01:29,259 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 96145msGC pool 'PS MarkSweep' had collection(s): count=1 time=95670msGC pool 'PS Scavenge' had collection(s): count=1 time=592ms2014-05-25 06:01:29,259 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 96281ms to send a batch of 2 edits (358 bytes) to remote journal xxx:84852014-05-25 06:01:29,259 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 96283ms to send a batch of 2 edits (358 bytes) to remote journal xxx:84852014-05-25 06:01:29,259 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:Number of transactions: 20019Total time for transactions(ms): 436Number of transactions batched in Syncs: 9441Number of syncs: 5160SyncTimes(ms): 41623 93052014-05-25 06:01:29,259 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 96275ms to send a batch of 13 edits (1638 bytes) to remote journal xxx:84852014-05-25 06:01:29,259 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx:8485, xxx:8485, xxx:8485, xxx:8485, xxx:8485], stream=QuorumOutputStream starting at txid 1665777063))java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:490) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:350) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:53) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:486) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:581) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1879) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1845) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:439) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:207) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44942) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1746)2014-05-25 06:01:29,259 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 96276ms to send a batch of 13 edits (1638 bytes) to remote journal 10.201.202.21:84852014-05-25 06:01:29,263 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /tmp/hive-hdfs/hive_2014-05-25_05-59-16_296_5866222407565920996/_task_tmp.-ext-10002/_tmp.001379_0. BP-180494678-xxx-1366627257763 blk_-3717787322078480343_437088977{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[xxx:50010|RBW], ReplicaUnderConstruction[xxx:50010|RBW], ReplicaUnderConstruction[xxx:50010|RBW]]}2014-05-25 06:01:29,263 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault: Not able to place enough replicas, still in need of 2 to reach 3For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger2014-05-25 06:01:29,263 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 1665777063......2014-05-25 06:01:29,268 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 96285ms to send a batch of 13 edits (1638 bytes) to remote journal xxx:8485......2014-05-25 06:01:29,324 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 12014-05-25 06:01:29,364 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: |
在写journalnode超时时,触发了 ExitUtil类的terminate 方法,终止当前的进程:
JournalSet类中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | for (JournalAndStream jas : journals) { try { closure.apply(jas); } catch (Throwable t) { if (jas.isRequired()) { final String msg = "Error: " + status + " failed for required journal (" + jas + ")" ; LOG .fatal(msg, t); // If we fail on *any* of the required journals, then we must not // continue on any of the other journals. Abort them to ensure that // retry behavior doesn't allow them to keep going in any way. abortAllJournals(); // the current policy is to shutdown the NN on errors to shared edits // dir . There are many code paths to shared edits failures - syncs , // roll of edits etc. All of them go through this common function // where the isRequired() check is made. Applying exit policy here // to catch all code paths. terminate(1, msg); } else { LOG .error("Error: " + status + " failed for (journal " + jas + ")" , t); badJAS.add(jas); } } } |
ExitUtil类的terminate方法,调用了System.exit方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | /** * Terminate the current process. Note that terminate is the *only* method * that should be used to terminate the daemon processes. * @param status exit code * @param msg message used to create the ExitException * @throws ExitException if System.exit is disabled for test purposes */ public static void terminate(int status, String msg) throws ExitException { LOG.info( "Exiting with status " + status); if (systemExitDisabled) { ExitException ee = new ExitException(status, msg); LOG.fatal( "Terminate called", ee); if (null == firstExitException) { firstExitException = ee; } throw ee; } System.exit(status); } |
处理方法:
重启集群,在 6:50左右恢复正常
后续解决方法:
1)调节journalnode 的写入超时时间
1 | dfs.qjournal.write-txns.timeout.ms |
2)调整namenode 的java参数,提前触发 full gc,这样full gc 的时间就会小一些。
3)默认namenode的fullgc方式是parallel gc,是stw模式的,更改为cms的格式。调整namenode的启动参数:
1 2 3 4 5 6 | -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 -XX:SoftRefLRUPolicyMSPerMB=0 |
另外一个比较坑的是flume不会自动恢复,需要重启。。
否则会一直报java.lang.InterruptedException: sleep interrupted 错误。

本文详细分析了一次Hadoop集群故障的原因,故障由Namenode长时间Full GC导致写JournalNode超时,进而引发一系列错误。文章提供了具体的日志分析、故障定位及解决方案,包括调整JournalNode写入超时时间和Namenode的Java参数。
2766

被折叠的 条评论
为什么被折叠?



