Flink on yarn 常见错误

最新推荐文章于 2025-11-05 16:29:07 发布

原创

最新推荐文章于 2025-11-05 16:29:07 发布 · 1.4w 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#flink #技术细节

本文列举了Flink在YARN上运行时常见的五种错误：1) 连接ResourceManager失败，建议先启动Hadoop环境；2) 无法获取ClusterClient状态，可能因资源不足，可通过关闭Hadoop虚拟内存检查或减少分配内存解决；3) 用户函数实例化失败，增大Flink启动资源参数可解决；4) Akka配置变量解析失败，需在maven-shaded-plugin中配置；5) NoClassDefFoundError，可能是类路径冲突，检查并清理YARN的lib目录。

1 Retrying connect to server
2 Unable to get ClusterClient status from Application Client
3 Cannot instantiate user function
4 Could not resolve substitution to a value: ${akka.stream.materializer}
5 java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArrayDeserializer

1 Retrying connect to server

Flink on yarn 依赖 hadoop 集群，在没有启动hadoop之前，直接执行Flink启动命令

./bin/yarn-session.sh -n 1 -jm 1024 -tm 4096

结果就是flink连不上ResourceManager，脚本一直卡在着进行重试

2018-05-19 14:36:08,062 INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032  
2018-05-19 14:36:09,231 INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)  
2018-05-19 14:36:10,234 INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)  
2018-05-19 14:36:11,235 INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)  
2018-05-19 14:36:12,238 INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)  
2018-05-19 14:36:13,240 INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)  
2018-05-19 14:36:14,247 INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

所以，先别着急，启动好 hadoop 环境后再启动Flink。

2 Unable to get ClusterClient status from Application Client

hadoop 已经启动了，这下执行 Flink 启动命令

./bin/yarn-session.sh -n 1 -jm 1024 -tm 4096

Flink 还是没有启动成功

2018-05-19 15:30:10,456 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@hadoop100:55053] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2018-05-19 15:30:21,680 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Could not retrieve the current cluster status. Skipping current retrieval attempt ...
java.lang.RuntimeException: Unable to get ClusterClient status from Application Client
        at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:253)
        at org.apache.flink.yarn.cli.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:443)
        at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:720)
        at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:514)
        at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:511)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:511)
Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.
        at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:862)
        at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:248)
        ... 9 more
Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)
        at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:857)
        ... 10 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
        at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:190)
        at scala.concurrent.Await.result(package.scala)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:77)
        ... 11 more
2018-05-19 15:30:21,691 WARN  org.apache.flink.yarn.YarnClusterClient                       - YARN reported application state FAILED
2018-05-19 15:30:21,692 WARN  org.apache.flink.yarn.YarnClusterClient                       - Diagnostics: Application application_1521277661809_0006 failed 1 times due to AM Container for appattempt_1521277661809_0006_000001 exited with  exitCode: -103
For more detailed output, check application tracking page:http://hadoop100:8088/cluster/app/application_1521277661809_0006Then, click on links to logs of each attempt.
Diagnostics: Container [pid=6386,containerID=container_1521277661809_0006_01_000001] is running beyond virtual memory limits. Current usage: 250.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1521277661809_0006_01_000001 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE

最低0.47元/天解锁文章