Yarn application has already ended! It might have been killed or unable to launch application master

环境:ambari+hdp 2.7.3

出现背景:nodename服务器出现异常,发生重启。

出现问题:以前能跑的pyspark脚本,运行的时候Yarn application has already ended! It might have been killed or unable to launch application master的错误。

解决方法:

1.在ambari中重启yarn,问题未得到解决。

2.在ambari中重启hdfs,问题未得到解决。

3.在ambari中重启spark,问题未得到解决。

4.编写测试脚本,spark采用local的模式运行,能够正常运行,确认问题应该出现在yarn上。

5.通过ambari中的run service check的功能对yarn进行check,出现:

  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 102, in checked_call
    tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy)
  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 150, in _call_wrapper
    result = _call(command, **kwargs_copy)
  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 303, in _call
    raise ExecutionFailed(err_msg, code, out, err)
resource_management.core.exceptions.ExecutionFailed: Execution of 'yarn org.apache.hadoop.yarn.applications.distributedshell.Client -shell_command ls -num_containers 1 -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -timeout 300000 --queue default' returned 2. 19/01/25 06:03:17 INFO distributedshell.Client: Initializing Client
19/01/25 06:03:17 INFO distributedshell.Client: Running Client
19/01/25 06:03:17 INFO client.RMProxy: Connecting to ResourceManager at lntbdnn1.lnt/10.250.10.67:8050
19/01/25 06:03:17 INFO client.AHSProxy: Connecting to Application History server at lntbddn2.lnt/10.250.10.69:10200
19/01/25 06:03:17 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=4
19/01/25 06:03:17 INFO distributedshell.Client: Got Cluster node info from ASM
19/01/25 06:03:17 INFO distributedshell.Client: Got node report from ASM for, nodeId=lntbddn1:45454, nodeAddresslntbddn1:8042, nodeRackName/default-rack, nodeNumContainers0
19/01/25 06:03:17 INFO distributedshell.Client: Got node report from ASM for, nodeId=lntbdnn1:45454, nodeAddresslntbdnn1:8042, nodeRackName/default-rack, nodeNumContainers0
19/01/25 06:03:17 INFO distributedshell.Client: Got node report from ASM for, nodeId=lntbddn3:45454, nodeAddresslntbddn3:8042, nodeRackName/default-rack, nodeNumContainers0
19/01/25 06:03:17 INFO distributedshell.Client: Got node report from ASM for, nodeId=lntbddn2:45454, nodeAddresslntbddn2:8042, nodeRackName/default-rack, nodeNumContainers0
19/01/25 06:03:17 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.0, queueMaxCapacity=1.0, queueApplicationCount=0, queueChildQueueCount=0
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=ADMINISTER_QUEUE
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=SUBMIT_APPLICATIONS
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=ADMINISTER_QUEUE
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=llap, userAcl=SUBMIT_APPLICATIONS
19/01/25 06:03:17 INFO distributedshell.Client: User ACL Info for Queue, queueName=llap, userAcl=ADMINISTER_QUEUE
19/01/25 06:03:17 INFO distributedshell.Client: Max mem capability of resources in this cluster 98304
19/01/25 06:03:17 INFO distributedshell.Client: Max virtual cores capabililty of resources in this cluster 25
19/01/25 06:03:17 INFO distributedshell.Client: Copy App Master jar from local filesystem and add to local environment
19/01/25 06:03:18 INFO distributedshell.Client: Set the environment for the application master
19/01/25 06:03:18 INFO distributedshell.Client: Setting up app master command
19/01/25 06:03:18 INFO distributedshell.Client: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx100m 

然后一看本本地时间,本地是时间是14:20,查看各个服务器时间,发现发现主服务器的时间少了8个小时,将主服务器时间修改。重新运行脚本正常,问题得到解决。

### Linux环境下使用`spark-submit --master yarn --deploy-mode client`启动失败的原因分析 当在Linux环境中执行命令 `spark-submit --master yarn --deploy-mode client` 提交应用程序时,如果遇到启动失败的情况,可能由多种因素引起。以下是常见错误及其对应的解决方案: #### 1. YARN资源不足 YARN集群中的可用资源不足以满足提交的应用程序需求可能导致任务无法成功调度并运行。这通常表现为应用被拒绝或者长时间处于等待状态。 - **解决方案**: 调整申请的内存大小和其他资源配置参数以适应当前环境下的实际可利用资源量。可以通过减少分配给驱动器(driver)和执行者(executor)的内存来尝试解决问题[^3]。 #### 2. Hadoop配置不正确 HDFS地址或其他必要的Hadoop配置项未正确定义也可能引发连接不上NameNode等问题,进而影响整个流程。 - **解决方案**: 检查HADOOP_CONF_DIR变量指向的位置是否包含了有效的core-site.xml以及hdfs-site.xml文件,并确认其中定义的服务端口与网络可达性均无误。 #### 3. Spark版本兼容性问题 不同版本之间可能存在API变更或是依赖库差异,特别是对于一些特定功能的支持程度有所区别。 - **解决方案**: 确认所使用的Apache Spark版本同部署的目标平台(如CDH、HDP等)保持一致;必要时升级或降级至匹配版本[^4]。 #### 4. Application Master未能启动 有时尽管客户端看似正常工作,但实际上Application Master并未能顺利初始化,从而导致整体进程终止。 - **异常提示**: org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. - **解决方案**: 查看YARN ResourceManager的日志信息寻找更详细的报错描述,排查是否存在权限不够或者其他阻碍AM创建的因素[^5]。 ```bash yarn logs -applicationId <your_application_id> ``` 通过上述方法可以有效定位并解决大部分情况下由于不当操作引起的`spark-submit --master yarn --deploy-mode client`模式下发生的故障现象。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值