AppendEC报错

测试报告

1.场景还原

使用appendEC操作(参考HDFS-14581 test):

public static void appendTest() throws IOException{
        Path file = new Path(dir, "appendTest-1");
	// Create file
	FSDataOutputStream out = dfs.create(file);
	out.write("testAppendWithoutNewBlock\n".getBytes());
	out.close();
	// Append file
	try {
		out = dfs.append(file, EnumSet.of(CreateFlag.APPEND), 4096, null);
		out.write("testAppendWithoutNewBlock2".getBytes());
	} catch (Exception e) {
		throw e;
	}
}

客户端异常

[hadoop@bigdata-nmg-hdfstest13.nmg01 ~/whb]$ hadoop jar demo-mr-1.0-SNAPSHOT-jar-with-dependencies.jar
Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.lang.UnsupportedOperationException): Append on EC file without new block is not supported.
	at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.prepareFileForAppend(FSDirAppendOp.java:190)
	at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:135)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2735)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:842)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:493)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)

	at org.apache.hadoop.ipc.Client.call(Client.java:1503)
	at org.apache.hadoop.ipc.Client.call(Client.java:1441)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy10.append(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.append(ClientNamenodeProtocolTranslatorPB.java:332)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
	at com.sun.proxy.$Proxy11.append(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1811)
	at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1880)
	at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1850)
	at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:395)
	at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:391)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:403)
	at cn.whbing.hadoop.ec.AppendECTest.appendTest(AppendECTest.java:57)
	at cn.whbing.hadoop.ec.AppendECTest.main(AppendECTest.java:25)

SNN异常

  • ANN
2020-01-15 21:15:57,957 WARN org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020, call Call#4 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.append from 10.83.78.58:56180
java.lang.UnsupportedOperationException: Append on EC file without new block is not supported.
        at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.prepareFileForAppend(FSDirAppendOp.java:190)
        at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:135)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2735)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:842)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:493)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
  • 随后Standby挂掉(60分钟后),NameNodeLog异常:
2020-01-15 22:17:57,720 ... to transaction ID 144098821
2020-01-15 22:17:57,737 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=/user/wanghongbing/ec/appendTest-1, replication=1, mtime
=1579097758258, atime=1579094157672, blockSize=134217728, blocks=[blk_-9223372036854775776_28378092], permissions=hadoop:supergroup:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storag
ePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=144098822]
java.io.IOException: File is not under construction: /user/wanghongbing/ec/appendTest-1
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
2020-01-15 22:17:57,738 ERROR org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN.
java.io.IOException: File is not under construction: /user/wanghongbing/ec/appendTest-1
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:476)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:258)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:161)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:898)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:329)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
2020-01-15 22:17:57,751 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: FSImageSaver clean checkpoint: txid=144098277 when meet shutdown.
...
2020-01-15 22:17:57,751 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: FSImageSaver clean checkpoint: txid=144089742 when meet shutdown.
2020-01-15 22:17:57,751 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at bigdata-nmg-hdfsmaster02.nmg01.diditaxi.com/10.83.75.42
************************************************************/
异常分析

通过hdfs oev查看ANN的当日全部editLog,搜索相关路径,全部有关的操作如下:

  • Jan 15 21:17操作(edits_0000000000144098698-0000000000144098704.xml)
- <OPCODE>OP_START_LOG_SEGMENT</OPCODE>
- <OPCODE>OP_ADD</OPCODE>
- <OPCODE>OP_ALLOCATE_BLOCK_ID</OPCODE>
- <OPCODE>OP_SET_GENSTAMP_V2</OPCODE>
- <OPCODE>OP_ADD_BLOCK</OPCODE>
- <OPCODE>OP_CLOSE</OPCODE>
- <OPCODE>OP_END_LOG_SEGMENT</OPCODE>
  • Jan 15 22:17操作(edits_0000000000144098821-0000000000144098823.xml)
- <OPCODE>OP_START_LOG_SEGMENT</OPCODE>
- <OPCODE>OP_CLOSE</OPCODE>
- <OPCODE>OP_END_LOG_SEGMENT</OPCODE>

而最后一个CLOSE在整整一个小时候后执行,初步判断是租约超过硬限制强制close而生成,查看ANN log:

2020-01-15 22:15:58,257 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: DFSClient_NONMAPREDUCE_-848910938_1, pending creates: 1] has expired hard limit
2020-01-15 22:15:58,257 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-848910938_1, pending creates: 1], src=/user/wanghongbing/ec/appendTest-1

2.代码梳理

FSDirAppendOp.java

appendFile(){
  // ① 获得一个租约
  fsn.recoverLeaseInternal(RecoverLeaseOp.APPEND_FILE...)
  // ② 预处理方法
  prepareFileForAppend(){
    // 改变文件为UnderConstruction状态
    file.toUnderConstruction(leaseHolder, clientMachine);
    // ③ 判断是EC文件且非新块的append时抛出异常
      if (!newBlock) {
        if (file.isStriped()) 
          throw new UnsupportedOperationException(...)
      }
  }
}

patch的改动是将上述③移动到①之前,提前抛出异常,不进行获得租约操作,从而不会产生租约超过1小时硬释放问题。

3.修复测试

  • 1.先使用skip-ec-close临时包启动将ANN和SNN启动一段时间,将editlog中的错误的那条记录跳过。(必须执行,线上已经上过了,跳过此步)
  • 2.将SNN和ANN切换到新的修复后的包。
  • 3.使用EC Append测试,如下:
[hadoop@bigdata-nmg-hdfstest13.nmg01 ~]$ hadoop fs -appendToFile - /user/wanghongbing/ec/test2
appendToFile: Append on EC file without new block is not supported. Use NEW_BLOCK create flag while appending file.
[hadoop@bigdata-nmg-hdfstest13.nmg01 ~]$ hdfs fsck /user/wanghongbing/ec/test2 -openforwrite
FSCK started by hadoop(null) (auth:SIMPLE) from /10.83.78.58 for path /user/wanghongbing/ec/test at Fri Jan 17 14:26:40 CST 2020

测试指标

  • 没有显示OPENFORWRITE,表明抛出异常时,文件已经关闭。不会再进行租约硬释放。符合预期。
  • 观察ANN和SNN 60分钟后正常运行,namenode log无异常。

4.小结

问题梳理如下:

1.向ec block进行append操作,客户端及NN抛出异常;
2.抛出异常前,NN进行了一次add租约操作,这是出问题的关键。由于append异常,文件处于openforwrite状态,而实际上并未开始写操作,块并未切换到UNDER_CONSTRUCTION状态;
3.60分钟后,租约超过hard limit,强制close文件,editlog记录该操作;
4.SNN同步editlog时,发现对未UNDER_CONSTRUCTION的文件进行close,挂掉。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值