When resourcemanage is restarted, it may print the following error messages.
2018-01-11 16:02:23,991 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1515636545026_0001 failed 2 times due to Error launching appattempt_1515636545026_0001_000002. Got exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:99)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy82.startContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:251)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(com.google.protobuf.InvalidProtocolBufferException): Protocol message contained an invalid tag (zero).
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
2018-01-09 15:51:31,651 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_e06_1515465880229_0007_02_000002 and exit code: 1
ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This is because the protobuf format changed between hadoop 2.6 and hadoop 2.7.
To solve this problem, we must not run Resourcemanager and Nodemanager simultaneously.
The steps to upgrade the YARN
1. Backup nodemanagers setting.
cp /home/hadoop/hadoop-2.6.0-cdh5.7.5/etc/hadoop/nodemanagers /home/hadoop/hadoop-2.6.0-cdh5.7.5/etc/hadoop/nodemanagers_bak
2. Stop the standby resourcemanager
yarn-daemon.sh stop resourcemanager
3. Login in active resourcemanager, stop all nodemanager
echo "localhost" > /usr/local/hadoop/etc/hadoop/nodemanagers
yarn rmadmin -refreshNodes
4. Wait until all nodemanager stopped, then stop active resourcemanager.
yarn-daemon.sh stop resourcemanager
5. Restart new version of Resourcemanager
export NEW_VERSION_OF_HADOOP_LOCATION=/home/hadoop/hadoop-2.7.5
export HADOOP_HOME=${NEW_VERSION_OF_HADOOP_LOCATION}
export HADOOP_COMMON_LIB_NATIVE_DIR=${NEW_VERSION_OF_HADOOP_LOCATION}/lib/native
export HADOOP_HDFS_HOME=${NEW_VERSION_OF_HADOOP_LOCATION}
export HADOOP_COMMON_HOME=${NEW_VERSION_OF_HADOOP_LOCATION}
export HADOOP_CONF_DIR=${NEW_VERSION_OF_HADOOP_LOCATION}/etc/hadoop
export HADOOP_MAPRED_HOME=${NEW_VERSION_OF_HADOOP_LOCATION}
${NEW_VERSION_OF_HADOOP_LOCATION}/sbin/yarn-daemon.sh start resourcemanager
By default, the resourcemanager will consider AM fail in 10 minutes if AM does not sends heartbeat. Set the following content to yarn-site.xml can relaunch application in 1 minutes.
<property>
<name>yarn.am.liveness-monitor.expiry-interval-ms</name>
<value>60000</value>
</property>