现象:namenode宕机,启动一分钟后又宕机,导致集群不健康,所有任务运行失败
发现初步问题:同事的误操作导致磁盘100%
随即删除一些不需要的东西,并且重新启动,我日,重启了好多遍,namenode还是无法启动
情况比我想象的严重,于是进一步查找问题
第一步:根据如下日志怀疑可能跟安全模式有关,尝试使用命令sudo -u hdfs hdfs dfsadmin -fs hdfs://master01:8020 -safemode leave强制离开安全模式,出现无法连接namenode的错误(纳尼,重试了n次,此路不通)
Execution of ‘/usr/hdp/current/hadoop-hdfs-namenode/bin/hdfs dfsadmin -fs hdfs://master01:8020 -safemode get | grep ‘Safe mode is OFF’’ returned 1
第二步:查看namenode日志(如下),recoverUnfinalizedSegments failed for required journal,应该跟journal服务有关系(离问题更近一点了)
2021-03-13 16:40:12,443 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [172.16.16.226:8485, 172.16.16.227:8485, 172.16.16.223:8485], stream=null))
java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:204)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:443)
at org.apache.hadoop.hdfs.server.namenode.JournalSet6.apply(JournalSet.java:616)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:613)atorg.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1602)atorg.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1216)atorg.apache.hadoop.hdfs.server.namenode.NameNode6.apply(JournalSet.java:616) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385) at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:613) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1602) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1216) at org.apache.hadoop.hdfs.server.namenode.NameNode6.apply(JournalSet.java:616)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:613)atorg.apache.hadoop.hdfs.serv<