spark执行优化——依赖上传到HDFS(spark.yarn.jar和spark.yarn.archive的使用)

使用yarn提交spark应用时,未配置spark.yarn.archive或spark.yarn.jars会导致上传本地jar到HDFS耗时。本文介绍了这两个配置在官网的解释、使用方法、可能遇到的错误,还对比了不同配置下的效果,结论是配置spark.yarn.jars并上传依赖jar到HDFS可减少资源上传。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.简述

使用yarn的方式提交spark应用时,在没有配置spark.yarn.archive或者spark.yarn.jars时, 看到输出的日志在输出Neither spark.yarn.jars nor spark.yarn.archive is set;一段指令后,会看到不停地上传本地jar到HDFS上,内容如下,这个过程会非常耗时。可以通过在spark-defaults.conf配置里添加spark.yarn.archive或spark.yarn.jars来缩小spark应用的启动时间。

 Will allocate AM container, with 896 MB memory including 384 MB overhead
2020-12-01 11:16:11 INFO  Client:54 - Setting up container launch context for our AM
2020-12-01 11:16:11 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-01 11:16:11 INFO  Client:54 - Preparing resources for our AM container
2020-12-01 11:16:12 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2020-12-01 11:16:14 INFO  Client:54 - Uploading resource file:/tmp/spark-897c6291-e0bd-47e6-8d42-7f67225c4819/__spark_libs__5294834939010995385.zip -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/__spark_libs__5294834939010995385.zip
2020-12-01 11:16:18 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/wordcount.jar
2020-12-01 11:16:18 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zookeeper-3.4.6.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/zookeeper-3.4.6.jar
2020-12-01 11:16:18 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xz-1.0.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/xz-1.0.jar

2.spark官网对这两个配置的解释

在这里插入图片描述
中文释义大概如下
在这里插入图片描述

3.spark.yarn.jars使用

3.1 将spark根目录下jars里的所有jar包上传到HDFS
 hadoop fs -mkdir -p  /spark-yarn/jars
 hadoop fs -put /opt/module/spark-2.3.2-bin-hadoop2.7/jars/* /spark-yarn/jars/
3.2 修改spark-defaults.conf
spark.yarn.jars hdfs://hadoop122:9000/spark-yarn/jars/*.jar
3.3 效果
2020-12-01 13:53:52 INFO  Client:54 - Setting up container launch context for our AM
2020-12-01 13:53:52 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-01 13:53:52 INFO  Client:54 - Preparing resources for our AM container
2020-12-01 13:53:53 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/JavaEWAH-0.3.2.jar
2020-12-01 13:53:53 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/RoaringBitmap-0.5.11.jar
3.4 可能遇到的错误

ERROR client.TransportClient: Failed to send RPC RPC

Caused by: java.io.IOException: Failed to send RPC 5353749227723805834 to /192.168.10.122:58244: java.nio.channels.ClosedChannelException
	at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
	at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
	at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

关闭通道异常,看上去是超时的问题,这个问题当运行spark-shell --master yarn-client时,可能也会出现。在yarn-site.xml里添加如下配置可以解决

<property>
		<name>yarn.nodemanager.pmem-check-enabled</name>
		<value>false</value>
</property>
<property>
		<name>yarn.nodemanager.vmem-check-enabled</name>
		<value>false</value>
</property>

4.spark.yarn.archive使用

4.1 将spark根目录下jars里的所有jar包上传到HDFS

打包要注意所有的jar都在zip包的根目录中

cd /opt/module/spark-2.3.2-bin-hadoop2.7/jars/
zip -q -r spark_jars_2.3.2.zip *
hadoop fs -mkdir /spark-yarn/zip
hadoop fs -put spark_jars_2.3.2.zip /spark-yarn/zip/
4.2 修改spark-defaults.conf
spark.yarn.archive hdfs://hadoop122:9000/spark-yarn/zip/spark_jars_2.3.2.zip
4.3 效果
2020-12-01 14:41:53 INFO  Client:54 - Setting up container launch context for our AM
2020-12-01 14:41:53 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-01 14:41:53 INFO  Client:54 - Preparing resources for our AM container
2020-12-01 14:41:54 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/zip/spark_jars_2.3.2.zip
2020-12-01 14:41:54 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/wordcount.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zstd-jni-1.3.2-2.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/zstd-jni-1.3.2-2.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zookeeper-3.4.6.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/zookeeper-3.4.6.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xz-1.0.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xz-1.0.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xmlenc-0.52.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xmlenc-0.52.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xml-apis-1.3.04.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xml-apis-1.3.04.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xercesImpl-2.9.1.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xercesImpl-2.9.1.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xbean-asm5-shaded-4.4.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xbean-asm5-shaded-4.4.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/spark-core_2.11-2.3.2.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/spark-core_2.11-2.3.2.jar
4.4 可能遇到的错误

应用的driver日志

错误: 找不到或无法加载主类 org.apache.spark.deploy.yarn.ApplicationMaster

如果像如下的打包方式,就会保留目录的层级到zip包中,就会报错如上

zip -q -r spark_jars_2.3.2.zip /opt/module/spark-2.3.2-bin-hadoop2.7/jars/*

在这里插入图片描述

5.效果对比

spark官网关于这两个配置的说明有以下两块
在这里插入图片描述
Preparations处有个说明如果未指定spark.yarn.archive或者spark.yarn.jars,Spark将创建一个zip文件,包含所有$SPARK_HOME/jars路径下jar包,并将其上传到分布式缓存。
在这里插入图片描述
spark.yarn.archive有个说明是,说如果两个参数都配置,应用程序会优先使用 spark.yarn.archive会代替 spark.yarn.jars 在所有容器中使用打包文件。

为了更好的看出效果,我分别就配置的几种情况,提交yarn-cluster任务查看控制台输出的日志情况

5.1 未配置spark.yarn.jars/spark.yarn.archive

上传:__spark_conf__xxx.zip、__spark_libs__xxx.zip、jar包、wordcount.jar

2020-12-02 14:40:01 INFO  Client:54 - Will allocate AM container, with 1408 MB memory including 384 MB overhead
2020-12-02 14:40:01 INFO  Client:54 - Setting up container launch context for our AM
2020-12-02 14:40:01 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-02 14:40:01 INFO  Client:54 - Preparing resources for our AM container
2020-12-02 14:40:02 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2020-12-02 14:40:04 INFO  Client:54 - Uploading resource file:/tmp/spark-343f5dda-9476-4330-b2ed-407ec6aa00e9/__spark_libs__2239070030081220213.zip -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0003/__spark_libs__2239070030081220213.zip
2020-12-02 14:40:07 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0003/wordcount.jar
2020-12-02 14:40:07 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zstd-jni-1.3.2-2.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0003/zstd-jni-1.3.2-2.jar
2020-12-02 14:40:07 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zookeeper-3.4.6.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0003/zookeeper-3.4.6.jar
...
...
2020-12-02 14:40:14 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/aopalliance-repackaged-2.4.0-b34.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0003/aopalliance-repackaged-2.4.0-b34.jar
2020-12-02 14:40:14 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/activation-1.1.1.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0003/activation-1.1.1.jar
2020-12-02 14:40:14 INFO  Client:54 - Uploading resource file:/tmp/spark-343f5dda-9476-4330-b2ed-407ec6aa00e9/__spark_conf__7169080322038125856.zip -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0003/__spark_conf__.zip
2020-12-02 14:40:14 INFO  SecurityManager:54 - Changing view acls to: root
...
5.2 配置spark.yarn.jars

上传:__spark_conf__xxx.zip、(未上传)jar包、wordcount.jar

2020-12-02 14:47:07 INFO  Client:54 - Will allocate AM container, with 1408 MB memory including 384 MB overhead
2020-12-02 14:47:07 INFO  Client:54 - Setting up container launch context for our AM
2020-12-02 14:47:07 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-02 14:47:07 INFO  Client:54 - Preparing resources for our AM container
2020-12-02 14:47:08 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/JavaEWAH-0.3.2.jar
2020-12-02 14:47:08 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/RoaringBitmap-0.5.11.jar
2020-12-02 14:47:08 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/ST4-4.0.4.jar
2020-12-02 14:47:08 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/activation-1.1.1.jar
...
...
2020-12-02 14:47:08 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/zookeeper-3.4.6.jar
2020-12-02 14:47:08 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/zstd-jni-1.3.2-2.jar
2020-12-02 14:47:08 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0004/wordcount.jar
2020-12-02 14:47:09 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/zstd-jni-1.3.2-2.jar added multiple times to distributed cache
2020-12-02 14:47:09 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/zookeeper-3.4.6.jar added multiple times to distributed cache
...
...
2020-12-02 14:47:09 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/scala-parser-combinators_2.11-1.0.4.jar added multiple times to distributed cache
2020-12-02 14:47:09 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/scalap-2.11.0.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0004/scalap-2.11.0.jar
2020-12-02 14:47:09 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/scala-logging_2.11-3.5.0.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0004/scala-logging_2.11-3.5.0.jar
2020-12-02 14:47:09 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/scala-library-2.11.12.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0004/scala-library-2.11.12.jar
2020-12-02 14:47:09 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/scala-compiler-2.11.12.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0004/scala-compiler-2.11.12.jar
...
...
2020-12-02 14:47:11 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/aopalliance-repackaged-2.4.0-b34.jar added multiple times to distributed cache
2020-12-02 14:47:11 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/activation-1.1.1.jar added multiple times to distributed cache
2020-12-02 14:47:11 INFO  Client:54 - Uploading resource file:/tmp/spark-b763698e-77bd-4004-8d71-c0eca5a1006d/__spark_conf__5511617263425666988.zip -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0004/__spark_conf__.zip
2020-12-02 14:47:11 INFO  SecurityManager:54 - Changing view acls to: root
5.3 配置spark.yarn.archive

上传:__spark_conf__xxx.zip、jar包、wordcount.jar

2020-12-02 14:53:43 INFO  Client:54 - Will allocate AM container, with 1408 MB memory including 384 MB overhead
2020-12-02 14:53:43 INFO  Client:54 - Setting up container launch context for our AM
2020-12-02 14:53:43 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-02 14:53:43 INFO  Client:54 - Preparing resources for our AM container
2020-12-02 14:53:44 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/zip/spark_jars_2.3.2.zip
2020-12-02 14:53:44 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0005/wordcount.jar
2020-12-02 14:53:44 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zstd-jni-1.3.2-2.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0005/zstd-jni-1.3.2-2.jar
2020-12-02 14:53:45 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zookeeper-3.4.6.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0005/zookeeper-3.4.6.jar
...
...
2020-12-02 14:53:51 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/aopalliance-repackaged-2.4.0-b34.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0005/aopalliance-repackaged-2.4.0-b34.jar
2020-12-02 14:53:51 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/activation-1.1.1.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0005/activation-1.1.1.jar
2020-12-02 14:53:51 INFO  Client:54 - Uploading resource file:/tmp/spark-0ce3c5f1-6083-499a-b5cd-1a5700e74bf3/__spark_conf__6154114087017136415.zip -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0005/__spark_conf__.zip
2020-12-02 14:53:51 INFO  SecurityManager:54 - Changing view acls to: root
5.4 配置spark.yarn.jars&spark.yarn.archive

上传:__spark_conf__xxx.zip、jar包、wordcount.jar

2020-12-02 14:59:08 INFO  Client:54 - Will allocate AM container, with 1408 MB memory including 384 MB overhead
2020-12-02 14:59:08 INFO  Client:54 - Setting up container launch context for our AM
2020-12-02 14:59:08 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-02 14:59:08 INFO  Client:54 - Preparing resources for our AM container
2020-12-02 14:59:09 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/zip/spark_jars_2.3.2.zip
2020-12-02 14:59:09 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0006/wordcount.jar
2020-12-02 14:59:10 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zstd-jni-1.3.2-2.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0006/zstd-jni-1.3.2-2.jar
2020-12-02 14:59:10 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zookeeper-3.4.6.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0006/zookeeper-3.4.6.jar
...
...
2020-12-02 14:59:16 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/aopalliance-repackaged-2.4.0-b34.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0006/aopalliance-repackaged-2.4.0-b34.jar
2020-12-02 14:59:16 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/activation-1.1.1.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0006/activation-1.1.1.jar
2020-12-02 14:59:16 INFO  Client:54 - Uploading resource file:/tmp/spark-3451a4e5-fb97-45a5-85dc-36f645fb7db3/__spark_conf__4794321975937827120.zip -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0006/__spark_conf__.zip
2020-12-02 14:59:16 INFO  SecurityManager:54 - Changing view acls to: root
5.5 配置spark.yarn.jars并且上传了程序所有的依赖jar到HDFS /spark-yarn/jars/ 下

上传:__spark_conf__xxx.zip、wordcount.jar

2020-12-02 15:19:30 INFO  Client:54 - Will allocate AM container, with 1408 MB memory including 384 MB overhead
2020-12-02 15:19:30 INFO  Client:54 - Setting up container launch context for our AM
2020-12-02 15:19:30 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-02 15:19:30 INFO  Client:54 - Preparing resources for our AM container
2020-12-02 15:19:31 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/JavaEWAH-0.3.2.jar
2020-12-02 15:19:31 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/RoaringBitmap-0.5.11.jar
2020-12-02 15:19:31 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/ST4-4.0.4.jar
2020-12-02 15:19:31 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/activation-1.1.1.jar
...
...
2020-12-02 15:19:31 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/zookeeper-3.4.6.jar
2020-12-02 15:19:31 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/zstd-jni-1.3.2-2.jar
2020-12-02 15:19:31 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0007/wordcount.jar
2020-12-02 15:19:31 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/zstd-jni-1.3.2-2.jar added multiple times to distributed cache
2020-12-02 15:19:31 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/zookeeper-3.4.6.jar added multiple times to distributed cache
...
...
2020-12-02 15:19:31 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/scala-parser-combinators_2.11-1.0.4.jar added multiple times to distributed cache
2020-12-02 15:19:31 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/scalap-2.11.0.jar added multiple times to distributed cache
2020-12-02 15:19:31 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/scala-logging_2.11-3.5.0.jar added multiple times to distributed cache
2020-12-02 15:19:31 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/scala-library-2.11.12.jar added multiple times to distributed cache
2020-12-02 15:19:31 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/scala-compiler-2.11.12.jar added multiple times to distributed cache
...
...
2020-12-02 15:19:31 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/aopalliance-repackaged-2.4.0-b34.jar added multiple times to distributed cache
2020-12-02 15:19:31 WARN  Client:66 - Same name resource file:///home/workspace/wordcount/lib/activation-1.1.1.jar added multiple times to distributed cache
2020-12-02 15:19:32 INFO  Client:54 - Uploading resource file:/tmp/spark-e9eca080-9f38-4fe5-ae5f-663a3e54c718/__spark_conf__3182593855268881202.zip -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606887365975_0007/__spark_conf__.zip
2020-12-02 15:19:32 INFO  SecurityManager:54 - Changing view acls to: root
结论:

5.5(配置spark.yarn.jars并且上传了程序所有的依赖jar到HDFS /spark-yarn/jars/ 下)的方式更能减少资源的上传

参考文章:
https://blog.youkuaiyun.com/liyaya0201/article/details/105277681
https://www.cnblogs.com/yyy-blog/p/11110388.html
https://www.jianshu.com/p/e44e948b8d5f
https://blog.youkuaiyun.com/u012957549/article/details/89361485

<think>嗯,我现在要配置Spark on YARN,但是我对这个过程还不是很熟悉。让我先回忆一下之前学过的相关知识,然后一步步理清楚应该怎么做。首先,Spark on YARN是什么意思呢?我记得YARN是Hadoop的资源管理系统,而Spark作为一个计算框架,可以运行在YARN上,这样就能利用YARN的资源管理功能,不需要Spark自己来管理集群资源了。对吧? 那配置Spark on YARN的前提条件应该有哪些呢?首先,肯定需要安装好Hadoop,并且YARN已经正确配置运行。可能还需要确保Hadoop的版本Spark兼容。然后,Spark本身也需要安装好,对吗?另外,可能需要设置一些环境变量,比如HADOOP_CONF_DIR或者YARN_CONF_DIR,让Spark知道YARN的配置文件在哪里。这部分我记得不太清楚,可能需要查文档确认。 接下来,具体的配置步骤可能包括修改Spark的配置文件,比如spark-defaults.conf,或者在提交应用时通过命令行参数指定。例如,设置spark.master为yarn,这样Spark就知道要连接到YARN集群。对吗?另外,可能需要配置一些资源参数,比如executor的内存、核心数等,这些参数可以通过Spark的配置项或者提交任务时指定。 用户提到的配置,可能需要关注几个关键点:首先是Hadoop配置的集成,然后是Spark本身的配置,还有资源分配调优的参数。可能还需要考虑权限问题,比如用户是否有权限在YARN上提交应用,或者HDFS的访问权限等。另外,网络配置是否正确,比如各个节点之间能否互相通信,端口是否开放,这些也可能影响Spark on YARN的运行。 有没有可能遇到常见的问题呢?比如Spark应用提交到YARN后,无法申请到资源,或者Executor启动失败。这时候可能需要查看YARN的日志,或者Spark的日志,排查具体原因。例如,资源不足,配置错误,或者类路径问题。比如,有时候需要把Spark依赖包上传到HDFS,或者在YARN的配置中指定,这样NodeManager能够找到这些依赖。 再想想具体的步骤,可能需要这样: 1. 确保Hadoop YARN已经正确安装配置,并且正在运行。 2. 下载并安装Spark,解压到合适的目录。 3. 配置环境变量,比如在spark-env.sh中设置HADOOP_CONF_DIR或YARN_CONF_DIR,指向Hadoop的配置文件目录。 4.Spark的配置文件spark-defaults.conf中设置spark.master为yarn。 5. 可能需要配置Spark执行参数,比如spark.executor.memory,spark.executor.cores等。 6. 将应用提交到YARN集群,使用spark-submit命令,并指定deploy-mode为cluster或者client。 还有,不同的部署模式(cluster vs client)有什么区别呢?Cluster模式下,Driver运行在YARN的Application Master中,而Client模式下,Driver运行在提交任务的客户端机器上。这可能影响到日志的查看应用的稳定性,比如如果客户端断开连接,Client模式的应用可能会失败。 另外,关于Spark on YARN的资源分配,YARN负责管理资源,所以需要正确设置Spark的资源参数,使其不超过YARN集群的总资源。例如,每个executor的内存不能超过YARN容器允许的最大值,否则会被拒绝。可能需要调整YARN的配置参数,比如yarn.scheduler.maximum-allocation-mb,yarn.nodemanager.resource.memory-mb等,以适应Spark的需求。 还有,可能需要注意Spark依赖问题。当Spark应用提交到YARN时,需要确保所有的依赖包都能被YARN的NodeManager访问到。可以通过--jars参数指定额外的jar包,或者将依赖打包到应用的jar文件中。另外,可能需要设置spark.yarn.jars或spark.yarn.archive,将Spark的库上传到HDFS,这样每次提交应用时不需要重复上传,节省时间。 现在,有没有遗漏的步骤或者配置项?比如,安全方面的配置,比如Kerberos认证,如果Hadoop集群启用了安全模式,那么Spark提交任务时需要进行相应的认证处理。这时候可能需要使用keytab文件或者kinit命令获取凭证。 再检查一下,Spark on YARN的两种模式:clientcluster。在配置的时候,是否需要特别指定?是的,在spark-submit的时候,需要设置--deploy-mode参数为cluster或者client,默认是client吗?还是需要显式指定? 总结一下,配置Spark on YARN的主要步骤包括: 1. 安装配置Hadoop YARN。 2. 安装Spark并配置环境变量指向YARN的配置。 3. 调整Spark的配置参数以适应YARN集群的资源管理。 4. 确保依赖资源文件正确上传访问。 5. 使用spark-submit提交应用,并选择合适的部署模式。 可能还需要验证配置是否正确,比如运行一个简单的Spark应用,比如计算Pi的示例,看看是否能成功在YARN执行。如果失败,查看日志,比如YARN的Application Master日志,或者Spark的日志,找出错误原因进行调整。 比如,如果出现类找不到的问题,可能需要检查spark.yarn.jars是否正确配置,或者是否将必要的jar包上传到HDFS。如果资源不足,可能需要调整YARN的资源分配参数或者Spark的executor配置。 现在,我应该把这些步骤整理成一个清晰的指南,帮助用户一步步配置Spark on YARN,同时提醒常见的配置项可能遇到的问题。</think>### Spark on YARN 配置指南 #### 一、核心概念 Spark on YARN 允许 Spark 应用程序利用 YARN(Hadoop 资源管理器)进行资源调度,无需独立部署 Spark 集群。其优势包括: - **资源共享**:与其他 Hadoop 任务(如 MapReduce)共享资源。 - **简化运维**:复用 YARN 的监控、容错安全机制。 #### 二、前提条件 1. **Hadoop 环境**:已部署 Hadoop(2.x+)且 YARN 正常运行。 2. **Spark 安装**:下载并解压 Spark(需与 Hadoop 版本兼容)。 3. **配置权限**:确保提交任务的用户具有 YARN 队列权限 HDFS 读写权限。 #### 三、关键配置步骤 ##### 1. **配置 Hadoop 环境变量** 在 Spark 的 `conf/spark-env.sh` 中设置 Hadoop 配置路径: ```bash export HADOOP_CONF_DIR=/path/to/hadoop/etc/hadoop # 指向 Hadoop 配置文件目录 export YARN_CONF_DIR=/path/to/hadoop/etc/hadoop # 同 HADOOP_CONF_DIR ``` ##### 2. **修改 Spark 默认配置** 在 `conf/spark-defaults.conf` 中指定 YARN 模式: ```properties spark.master=yarn spark.yarn.jars=hdfs:///spark-jars/*.jar # 将 Spark 依赖上传至 HDFS(可选但推荐) ``` ##### 3. **资源参数调优** 根据集群资源调整参数(示例): ```properties spark.executor.memory=4g # 每个 Executor 内存 spark.executor.cores=2 # 每个 Executor 使用的 CPU 核心 spark.driver.memory=2g # Driver 内存(Cluster 模式下由 YARN 管理) ``` ##### 4. **提交 Spark 应用** 使用 `spark-submit` 提交任务时需指定部署模式: ```bash spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ # 或 client --executor-memory 2g \ --num-executors 4 \ /path/to/spark-examples.jar ``` #### 四、部署模式对比 | **模式** | **Driver 运行位置** | **适用场景** | |------------|-------------------------------|--------------------------| | **Client** | 提交任务的客户端机器 | 调试(日志直接输出到终端) | | **Cluster**| YARN Application Master(AM) | 生产环境(高可用性) | #### 五、常见问题与解决 1. **资源申请失败** - **原因**:YARN 资源不足或参数超限。 - **解决**: - 检查 `yarn.nodemanager.resource.memory-mb` `yarn.scheduler.maximum-allocation-mb`。 - 调低 `spark.executor.memory` 或 `spark.executor.cores`。 2. **依赖缺失** - **现象**:`ClassNotFoundException`。 - **解决**: - 通过 `--jars` 添加额外依赖: ```bash spark-submit --jars /path/to/dependency.jar ... ``` - 或预上传 Spark 依赖HDFS: ```bash hdfs dfs -put $SPARK_HOME/jars/*.jar /spark-jars/ ``` 3. **Kerberos 认证问题** - **现象**:任务因权限失败。 - **解决**: - 使用 `kinit` 提前获取凭证: ```bash kinit -kt /path/to/user.keytab user@REALM ``` - 或在 `spark-submit` 中添加参数: ```bash --principal user@REALM --keytab /path/to/user.keytab ``` #### 六、验证配置 运行测试任务(如计算 Pi): ```bash spark-submit \ --master yarn \ --deploy-mode client \ $SPARK_HOME/examples/jars/spark-examples.jar 100 ``` 观察 YARN Web UI(默认 `http://<ResourceManager>:8088`)确认任务状态。 #### 七、高级配置(可选) - **动态资源分配**: 在 `spark-defaults.conf` 中启用: ```properties spark.dynamicAllocation.enabled=true spark.shuffle.service.enabled=true ``` - **日志聚合**: 确保 YARN 的 `yarn.log-aggregation-enable` 设为 `true`,通过 `yarn logs -applicationId <appId>` 查看日志。 通过以上步骤,您应能完成 Spark on YARN 的基础配置调优。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值