区块链项目：生产环境上chaincode容器不断重启的排查

最新推荐文章于 2023-05-19 10:02:40 发布

原创最新推荐文章于 2023-05-19 10:02:40 发布 · 2w 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#区块链 #容器 #运维

区块链同时被 2 个专栏收录

34 篇文章

订阅专栏

fabric

13 篇文章

订阅专栏

区块链项目中，业务chaincode容器频繁重启，日志显示为`transport is closing`错误。分析发现，原因是银行内部网络环境中，F5在300秒无数据通信时会断开连接。解决方案是启用grpc的keepalive选项，保持连接活跃，避免容器重启。状态chaincode因每30秒有交易，避免了类似问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

#背景

在做区块链项目时，银行发送业务交易时有时会出现超时，报错的response;于是跟踪部署的服务的相关容器，发现业务chaincode容器存在不断重启的现象，业务chaincode容器存活一般几分钟到几小时不等。这说明chaincode容器异常挂掉过，然后再被拉起。
但是获取区块链状态的状态chaincode没有重启过，一直在运行。

#现象描述
具体地，chaincode容器的日志如下：
chaincode err log
错误为：

[E] [chaincode.go:304] Received error from server: [rpc error: code = Internal desc = transport is closing], ending chaincode stream

相应地，peer上的错误日志如下：
peer err log

错误为：

2018/07/11 08:11:11 [E] [handler.go:329] Error handling chaincode support stream: [rpc error: code = Canceled desc = context canceled]

从以上现象，可知chaincode确实因为错误挂掉了，这个错误表面上看是因为一个rpc error导致的，再具体一点的原因是：transport is closing, 也即是因为rpc的这个链接被关掉了，但是为何被关掉现在还不得知。

#原因分析

原因分析一：查看代码，追溯何因导致

chaincode启动过程：

main()->shim.Start()->userChaincodeStreamGetter():主要是创建grpc的连接，返回stream对象-> chatWithPeer(chaincodename, stream, cc):与peer建立通信，处理一些消息

###grpc client端：chaincode处理peer发过来消息方法：

go func() {

		...
		
		for {
		in = nil
		err = nil
		nsInfo = nil
		if recv {
			recv = false
			go func() {
				var in2 *pb.ChaincodeMessage
				in2, err = stream.Recv()
				msgAvail <- in2
			}()
		}
		select {
		case sendErr := <-errc:
			//serialSendAsync successful?
			if sendErr == nil {
				continue
			}
			//no, bail
			err = fmt.Errorf("Error sending %s: %s", in.Type.String(), sendErr)
			return
		case in = <-msgAvail:
			if err == io.EOF {
				chaincodeLogger.Debugf("Received EOF, ending chaincode stream, %s", err)
				return
			} else if err != nil {
				chaincodeLogger.Errorf("Received error from server: %s, ending chaincode stream", err) 
				return
			} else if in == nil {
				err = fmt.Errorf("Received nil message, ending chaincode stream")
				chaincodeLogger.Debug("Received nil message, ending chaincode stream")
				return
			}
			chaincodeLogger.Debugf("[%s]Received message %s from shim", shorttxid(in.Txid), in.Type.String())
			recv = true
		case nsInfo = <-handler.nextState:
			in = nsInfo.msg
			if in == nil {
				panic("nil msg")
			}
			chaincodeLogger.Debugf("[%s]Move state message %s", shorttxid(in.Txid), in.Type.String())
		}
		// Call FSM.handleMessage()
		err = handler.handleMessage(in)
		if err != nil {
			err = fmt.Errorf("Error handling message: %s", err)
			return
		}

		//keepalive messages are PONGs to the fabric's PINGs
		if in.Type == pb.ChaincodeMessage_KEEPALIVE {
			chaincodeLogger.Debug("Sending KEEPALIVE response")
			//ignore any errors, maybe next KEEPALIVE will work
			handler.serialSendAsync(in, nil)
		} else if nsInfo != nil && nsInfo.sendToCC {
			chaincodeLogger.Debugf("[%s]send state message %s", shorttxid(in.Txid), in.Type.String())
			handler.serialSendAsync(in, errc)
		}
	}
}()

...

从代码上可知因为：

case in = <-msgAvail:
	if err == io.EOF {
		chaincodeLogger.Debugf("Received EOF, ending chaincode stream, %s", err)
		return
	} else if err != nil {
		chaincodeLogger.Errorf("Received error from server: %s, ending chaincode stream", err)
	return
	}

stream.Recv()时出现了错误，这个错误是：transport is closing.

###grpc server端：peer的相关处理方法：

1.chaincode通过gprc调用register方法：

// Register the bidi stream entry point called by chaincode to register with the Peer.
func (chaincodeSupport *ChaincodeSupport) Register(stream pb.ChaincodeSupport_RegisterServer) error {
	return chaincodeSupport.HandleChaincodeStream(stream.Context(), stream)
}

2.HandleChaincodeStream

// HandleChaincodeStream Main loop for handling the associated Chaincode stream
func HandleChaincodeStream(chaincodeSupport *ChaincodeSupport, ctxt context.Context, stream ccintf.ChaincodeStream) error {
	deadline, ok := ctxt.Deadline()
	chaincodeLogger.Debugf("Current context deadline = %s, ok = %v", deadline, ok)
	handler := newChaincodeSupportHandler(chaincodeSupport, stream)
	return handler.processStream()
}

3.处理chaincode发过的消息：

processStream（）{
...
for {
...
	select {
	case in = <-msgAvail:
		// Defer the deregistering of the this handler.
		if err == io.EOF {
		chaincodeLogger.Debugf("Received EOF, ending chaincode support stream, %s", err)
			return err
		} else if err != nil {
			chaincodeLogger.Errorf("Error handling chaincode support stream: %s", err)
			return err
		} else if in == nil {
			err = fmt.Errorf("Received nil message, ending chaincode support stream")
			chaincodeLogger.Debug("Received nil message, ending chaincode support stream")
			return err
		}
	}
	
	...

}

通过peer日志和代码可知grpc server端也出现错误：chaincodeLogger.Errorf(“Error handling chaincode support stream: %s”, err)，这个错误具体地为：contexted canceled. 一般地，导致此错误的原因是对方的线程关掉了。

##原因分析二：

通过以上日志和代码初步怀疑：grpc连接出现问题，导致chaincode容器退出，server端接收到context canceled的错误。
检查chaincode和peer之前的网络连接情况。
由于区块链项目部署在各家银行内部，银行内部网络关系比较复杂，具体地一般访问需要开通网络端口，经过F5：
抓业务chaicode：7052映射为（32007）与peer之间的通信的报文：
1、所有通信端口交互的情况：

2、业务chaincode与peer通信的报文情况：
从以上抓包可知在发生业务chaincode挂掉的时间段里，chaincode与peer没有抓到报文，也即是没有通信。
从银行运维了解到我们申请的端口访问关系都是短链接，f5规定所有短链接，如果超过300秒没有数据通信，会进行拆链操作，也就是把此链接断掉

这也就解释了为什么会出现grpc connetion err: transport is closing的错误，在300秒内没有业务的话，这条链接会被f5断掉，导致业务的chaincode挂掉。

#解决方案

grpc在创建连接时有一个keepalive的配置选项，也就是server端会给这个链接发送keepalive message:

case <-handler.waitForKeepaliveTimer():
			if handler.chaincodeSupport.keepalive <= 0 {
				chaincodeLogger.Errorf("Invalid select: keepalive not on (keepalive=%d)", handler.chaincodeSupport.keepalive)
				continue
			}

			//if no error message from serialSend, KEEPALIVE happy, and don't care about error
			//(maybe it'll work later)
			handler.serialSendAsync(&pb.ChaincodeMessage{Type: pb.ChaincodeMessage_KEEPALIVE}, nil)
			continue

如果keepalive选项打开就会定时发送keepalive消息，这样f5就不再会进行拆链，因此grpc链接可以保持。

综上，解决方案为：打开keepalive选项

# keepalive in seconds. In situations where the communiction goes through a
    # proxy that does not support keep-alive, this parameter will maintain connection
    # between peer and chaincode.
    # A value <= 0 turns keepalive off
    keepalive: 30s

经过验证，业务容器不在频繁挂掉。

#总结
至此，排查出业务chaincode频繁重启的原因，但是还有一个问题我们的状态chaincode为什么没有频繁挂掉重启呢？网络关系也是一样的，keepalive同样没有开启。

这是因为我们状态chaincode每30秒会有一个获取区块状态的交易，这样的话f5就不会对这个链接进行拆链操作，所以不会出现chaincode挂掉的情况。

最后一点，在跨网进行服务部署的时候，一定要弄清楚网络访问的关系和特性，不然会耗费很大的人力和物力去排查解决，特别是已经上线的系统。