从现象来看,报错日志为(从JobManager中获取)
[2019-02-16 09:18:50,218] INFO Diagnostics for container container_e31_1548733575161_1174_01_000003 in state COMPLETE : exitStatus=1 diagnostics=Exception from container-launch.
Container id: container_e31_1548733575161_1174_01_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
JobManager是通过NodeManager启动TaskManager的,所以我找到NodeManager来查看日志,对应的日志如下:
2019-02-16 09:19:32,065 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e31_1548733575161_1175_01_000004 transitioned from LOCALIZING to LOCALIZED
2019-02-16 09:19:32,078 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e31_1548733575161_1175_01_000004 transitioned from LOCALIZED to RUNNING
2019-02-16 09:19:32,079 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /data/yarn/nm/usercache/flink/appcache/application_1548733575161_1175/container_e31_1548733575161_1175_01_000004/default_container_executor.sh]
2019-02-16 09:19:32,166 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_e31_1548733575161_1175_01_000004
2019-02-16 09:19:32,181 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 18989 for container-id container_e31_1548733575161_1175_01_000004: 15.8 MB of 1 GB physical memory used; 1.7 GB of 2.1 GB virtual memory used
2019-02-16 09:19:32,194 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 2073 for container-id container_e31_1548733575161_0314_01_000003: 514.5 MB of 1 GB physical memory used; 2.0 GB of 2.1 GB virtual memory used
2019-02-16 09:19:32,202 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 5863 for container-id container_e31_1548733575161_0909_01_000001: 376.8 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used
2019-02-16 09:19:34,381 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_e31_1548733575161_1175_01_000004 is : 1
2019-02-16 09:19:34,381 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_e31_1548733575161_1175_01_000004 and exit code: 1
ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
2019-02-16 09:19:34,381 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_e31_1548733575161_1175_01_000004
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: ExitCodeException exitCode=1:
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell.run(Shell.java:507)
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.lang.Thread.run(Thread.java:748)
2019-02-16 09:19:34,382 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 1
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e31_1548733575161_1175_01_000004 transitioned from RUNNING to EXITED_WITH_FAILURE
2019-02-16 09:19:34,382 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e31_1548733575161_1175_01_000004
2019-02-16 09:19:34,396 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /data/yarn/nm/usercache/flink/appcache/application_1548733575161_1175/container_e31_1548733575161_1175_01_000004
2019-02-16 09:19:34,396 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=flink OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1548733575161_1175 CONTAINERID=container_e31_1548733575161_1175_01_000004
2019-02-16 09:19:34,396 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e31_1548733575161_1175_01_000004 transitioned from EXITED_WITH_FAILURE to DONE
2019-02-16 09:19:34,396 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_e31_1548733575161_1175_01_000004 from application application_1548733575161_1175
2019-02-16 09:19:34,396 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_e31_1548733575161_1175_01_000004 for log-aggregation
2019-02-16 09:19:34,396 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1548733575161_1175
2019-02-16 09:19:34,396 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_e31_1548733575161_1175_01_000004
2019-02-16 09:19:35,204 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_e31_1548733575161_1175_01_000004
可以看到确实报错了,但是看这个并不能很清楚的知道为啥脚本启动报错!
然后记得有个朋友说过可以延迟删除container文件,百度了一下,有这么一篇文章
https://blog.youkuaiyun.com/xiao_jun_0820/article/details/76081321
https://blog.youkuaiyun.com/wangming520liwei/article/details/78923216
根据这个文件内容
yarn.nodemanager.delete.debug-delay-sec
默认值:0,app执行完之后立即删除本地文件
desc:应用程序完成之后 NodeManager 的 DeletionService 删除应用程序的本地化文件和日志目录之前的时间(秒数)。要诊断 YARN 应用程序问题,请将此属性的值设为足够大(例如,设为 600 秒,即 10 分钟)以允许检查这些目录。
---------------------
PS:日志聚合---https://blog.youkuaiyun.com/lrf2454224026/article/details/82700129
再试一把,
然后让运维重新部署下!可以看到TaskManager的日志
[2019-02-16 11:17:01,557] INFO Unable to start Queryable State Server. All ports in provided range ([9067]) are occupied. org.apache.flink.queryablestate.network.AbstractServerBase.start(AbstractServerBase.java:197)
[2019-02-16 11:17:01,557] INFO Shutting down Queryable State Server @ null org.apache.flink.queryablestate.network.AbstractServerBase.shutdownServer(AbstractServerBase.java:288)
[2019-02-16 11:17:01,558] INFO Queryable State Server was shutdown successfully. org.apache.flink.queryablestate.server.KvStateServerImpl.shutdown(KvStateServerImpl.java:107)
[2019-02-16 11:17:01,559] ERROR Error while starting up taskManager grizzled.slf4j.Logger.error(slf4j.scala:116)
java.io.IOException: Failed to start the Queryable State Data Server.
at org.apache.flink.runtime.io.network.NetworkEnvironment.start(NetworkEnvironment.java:319)
at org.apache.flink.runtime.taskexecutor.TaskManagerServices.fromConfiguration(TaskManagerServices.java:240)
at org.apache.flink.runtime.taskmanager.TaskManager$.startTaskManagerComponentsAndActor(TaskManager.scala:2023)
at org.apache.flink.runtime.taskmanager.TaskManager$.runTaskManager(TaskManager.scala:1854)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$1.apply$mcV$sp(TaskManager.scala:1964)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$1.apply(TaskManager.scala:1942)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$1.apply(TaskManager.scala:1942)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.flink.runtime.akka.AkkaUtils$.retryOnBindException(AkkaUtils.scala:766)
at org.apache.flink.runtime.taskmanager.TaskManager$.runTaskManager(TaskManager.scala:1942)
at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1713)
at org.apache.flink.runtime.taskmanager.TaskManager.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala)
at org.apache.flink.yarn.YarnTaskManagerRunnerFactory$Runner.call(YarnTaskManagerRunnerFactory.java:70)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.YarnTaskManager$.main(YarnTaskManager.scala:78)
at org.apache.flink.yarn.YarnTaskManager.main(YarnTaskManager.scala)
Caused by: org.apache.flink.util.FlinkRuntimeException: Unable to start Queryable State Server. All ports in provided range are occupied.
at org.apache.flink.queryablestate.network.AbstractServerBase.start(AbstractServerBase.java:198)
at org.apache.flink.queryablestate.server.KvStateServerImpl.start(KvStateServerImpl.java:95)
at org.apache.flink.runtime.io.network.NetworkEnvironment.start(NetworkEnvironment.java:315)
... 18 more
---
解决方案:
在flink-conf.yaml里添加【参考文档:https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/queryable_state.html#Configuration
相关类:org.apache.flink.configuration.QueryableStateOptions】
query.proxy.ports: 50100-50200,50300-59900,59999
query.server.ports: 50100-50200,50300-59900,59999