现象
最近在一个客户环境中发现启动Trafodion时每次dcsstart后dcsserver立即变为Down的状态,查看相关dcs_master和dcs_server的日志如下,
–dcs master log
2017-08-29 13:05:10,895, ERROR, org.trafodion.dcs.master.ServerManager, Node Number: , CPU: , PID: , Process Name: , , ,exit code [1], stdout [server 21612. Stop it first.]
2017-08-29 13:05:10,895, INFO, org.trafodion.dcs.util.RetryCounter, Node Number: , CPU: , PID: , Process Name: , , ,Sleeping 2000ms before retry #1...
2017-08-29 13:05:12,896, INFO, org.trafodion.dcs.master.ServerManager, Node Number: , CPU: , PID: , Process Name: , , ,Restarting DcsServer [bdtest04.novalocal:1], script [
–dcs server log
2017-08-29 13:04:28,918, ERROR, org.trafodion.dcs.server.ServerManager, Node Number: , CPU: , PID: , Process Name: , , ,org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /trafodion/dcs/servers/registered/bdtest04.novalocal:1:4
2017-08-29 13:04:28,920, ERROR, org.trafodion.dcs.server.DcsServer, Node Number: , CPU: , PID: , Process Name: , , ,java.util.concurrent.ExecutionException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /trafodion/dcs/servers/registered/bdtest04.novalocal:1:4
分析
通过以上错误信息,怀疑可能是zookeeper中/trafodion/dcs/servers/registered/bdtest04.novalocal节点没有创建成功,因此去zkCLi.sh中查看,发现节点已经存在
解决
通过CDH Manager发现,由于之前某个节点硬盘损坏,导致其中一个zookeeper节点更换到其他节点,但相应的dcs-site.xml文件中的”dcs.zookeeper.quorum”并没有修改,因此修改dcs-site.xml文件并同步所有Trafodion节点后重启Trafodion,解决问题
<property>
<name>dcs.zookeeper.quorum</name>
<value>bdtest04.novalocal,bdtest06.novalocal,bdtest03.novalocal</value>
</property>