spark-on-yarn jar包问题
submit运行过程中会把spark的jar包上传到HDFS的/user/hadoop/.sparkStaging路径下面,运行完毕进行释放,上传的这个过程实际上比较耗费时间
WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
spark.yarn.jars和spark.yarn.archive参数都不设置的情况下,会上传所有的jar包
INFO yarn.Client: Uploading resource file:/tmp/spark-668107c8-8b33-46ba-abea-ec3d6ccf12ef/__spark_libs__1763828378893967375.zip -> hdfs://hadoop001:9000/user/wzj/.sparkStaging/application_1585137346352_0005/__spark_libs__1763828378893967375.zip
INFO yarn.Client: Uploading resource file:/tmp/spark-668107c8-8b33-46ba-abea-ec3d6ccf12ef/__spark_conf__1888492531721785739.zip -> hdfs://hadoop001:9000/user/wzj/.sparkStaging/application_1585137346352_0005/__spark_conf__.zip
优化
1.在hdfs上新建一个目录并上传spark的所有jar包
[wzj@hadoop001 logs]$ hadoop fs -mkdir -p /spark/j

本文探讨了在Spark-on-YARN环境下遇到的jar包上传问题,指出该过程耗时较长。当不设置`spark.yarn.jars`和`spark.yarn.archive`参数时,Spark会上传所有jar包到HDFS。为优化此过程,建议在HDFS创建专用目录并预先上传所需jar包。实测表明,这种优化能节省约十秒的运行时间。
最低0.47元/天 解锁文章
1103

被折叠的 条评论
为什么被折叠?



