JournalNode无法启动问题排查
1. 问题说明
- 1.1 JournalNode重启后又失败,一直重启不成功,经过观察,发现日志报错,经排查报错原因是edit log损坏导致的
2018-05-28 16:06:07,896 WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(359)) - Caught exception after scanning through 0 ops from /hadoop/hdfs/journal/DHTestCluster/current/edits_inprogress_0000000000019770365 while determining its valid length. Position was 1044480
java.io.IOException: Can't scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4974)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)
at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:551)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:192)
at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:152)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:99)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:189)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)
2018-05-28 16:06:07,896 WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(364)) - After resync, position is 1044480
2. 解决方法
2.1 从正常运行的JournalNode机器上copy edit log
2.2 切换到edit log所在的current目录
- cd /hadoop/hdfs/journal/DHTestCluster/current(根据自己的配置文件找到current目录)
2.3 压缩current目录
- tar -zcvf current.tar.gz ./current
2.4 删除损坏的edit log
cd /hadoop/hdfs/journal/DHTestCluster/
rm -rf current/
2.5 copy current.tar.gz到目标机器上
scp current.tar.gz hdfs@hadoop:/hadoop/hdfs/journal/DHTestCluster
tar -zxvf current.tar.gz
2.6 重启JournalNode即可