submit spark code to yarn

- configure eclipse, add scala-ide plugin and m2e-scala plugin (http://alchim31.free.fr/m2e-scala/update-site/)

- configure spark to submit code to remote yarn

 val sparkConf = new SparkConf().setAppName(s"Bulk Import $manualNbr").setMaster("yarn").set("deploy-mode", "client")
//                                   .setJars(Seq("hdfs://ubuntu1:50071/jarcache/spark-assembly-2.3.0.jar", 
//                                                 "hdfs://ubuntu1:50071/jarcache/spark-2.3.0-yarn-shuffle.jar"))
                                     .set("spark.yarn.jars", "hdfs://ubuntu1:50071/cache/spark/import-0.0.1-SNAPSHOT.jar")
                                     .set("spark.executor.memory", "512m")
                                     .set("spark.driver.host", "192.168.1.105").set("spark.yarn.historyServer.address", "http://spark_history_server:history_port")
System.setProperty("HADOOP_CONF_DIR", "\\configuration\\src\\main\\resources");
System.setProperty("YARN_CONF_DIR", "\\configuration\\src\\main\\resources");

make sure hadoop and yarn configuration files are put under HADOOP_CONF_DIR and YARN_CONF_DIR

- no firewall is enabled so that yarn AM can talk to driver machine. (cannot connect to driver)

- make sure Yarn is configured well to avoid no resource can be allocated to start AM or other needed executors (cannot connect to driver)

<configuration>
<!--
<property>
    <description>Amount of physical memory, in MB, that can be allocated 
    for containers.</description>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
  </property>

<property>
    <description>Ratio between virtual memory to physical memory when
    setting memory limits for containers. Container allocations are
    expressed in terms of physical memory, and virtual memory usage
    is allowed to exceed this allocation by this ratio.
    </description>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>4</value>
  </property>

<property>
    <description>Number of vcores that can be allocated
    for containers. This is used by the RM scheduler when allocating
    resources for containers. This is not used to limit the number of
    physical cores used by YARN containers.</description>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>1</value>
  </property>-->

<property>
    <description>The hostname of the RM.</description>
    <name>yarn.resourcemanager.hostname</name>
    <value>ubuntu1</value>
  </property>   
  
  <property>
    <description>The address of the applications manager interface in the RM.</description>
    <name>yarn.resourcemanager.address</name>
    <value>${yarn.resourcemanager.hostname}:8032</value>
  </property>
  
  <property>
    <description>The address of the scheduler interface.</description>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>${yarn.resourcemanager.hostname}:8030</value>
  </property>

<property>
    <description>Ratio between virtual memory to physical memory when
    setting memory limits for containers. Container allocations are
    expressed in terms of physical memory, and virtual memory usage
    is allowed to exceed this allocation by this ratio.
    </description>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>3</value>
  </property>

<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>512</value>
  </property>
<!-- Site specific YARN configuration properties -->
<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
		<value>org.apache.hadoop.mapred.ShuffleHandler</value>
	</property>
	<property>
		<name>yarn.nodemanager.log-dirs</name>
		<value>file:///home//bigdata/hadoop-2.7.6/userlog</value>
		<final>true</final>
	</property>
	<property>
		<name>yarn.nodemanager.local-dirs</name>
		<value>file:///home//bigdata/hadoop-2.7.6/temp/nm-local-dir</value>
	</property><property><name>yarn.log.server.url</name><value>...</value></property>
	<property>
		<name>yarn.nodemanager.delete.debug-delay-sec</name>
		<value>600</value>
	</property>
	<!--<property>
		<name>yarn.application.classpath</name>
		<value>file:///home//bigdata/hadoop-2.7.6/,file:///home//bigdata/hadoop-2.7.6/share/hadoop/common/*,file:///home//bigdata/hadoop-2.7.6/share/hadoop/common/lib/*,file:///home//bigdata/hadoop-2.7.6/share/hadoop/hdfs/*,file:///home//bigdata/hadoop-2.7.6/share/hadoop/hdfs/lib/*,file:///home//bigdata/hadoop-2.7.6/share/hadoop/mapreduce/*,file:///home//bigdata/hadoop-2.7.6/share/hadoop/mapreduce/lib/*,file:///home//bigdata/hadoop-2.7.6/share/hadoop/yarn/*,file:///home//bigdata/hadoop-2.7.6/share/hadoop/yarn/lib/*</value>
	</property>-->

</configuration>

- manage and resolve all dependencies with maven dependency exclusions

- assemble all codes with maven shade plugin

<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-shade-plugin</artifactId>
				<executions>
					<execution>
						<phase>package</phase>
						<goals>
							<goal>shade</goal>
						</goals>
						<configuration>
							<filters>
								<filter>
									<artifact>*:*</artifact>
									<excludes>
										<exclude>META-INF/*.SF</exclude>
										<exclude>META-INF/*.DSA</exclude>
										<exclude>META-INF/*.RSA</exclude>
										<exclude>META-INF/LICENSE</exclude>
										<exclude>LICENSE</exclude> <!--if this is same as above, not required -->
										<exclude>/*.png</exclude>
										<exclude>/*.html</exclude>
										<exclude>/*.jpeg</exclude>
									</excludes>
								</filter>
							</filters>
							<artifactSet>
								<excludes>
									<exclude>junit:junit</exclude>
								</excludes>
							</artifactSet>
							<transformers>
								<transformer
									implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
							</transformers>
						</configuration>
					</execution>
				</executions>

			</plugin>

transformer: to merge service files with same name under meta-info folder

- debug, check logs in yarn or go to node manage machine to see details under <yarn.nodemanager.local-dirs>

### Spark Submit 的工作流程和执行过程 #### 任务提交阶段 当通过 `spark-submit` 提交一个应用到集群上运行时,首先会解析命令行参数以及配置文件中的设置。这些参数指定了诸如主节点地址、使用的类路径以及其他必要的资源配置选项[^1]。 ```bash ./bin/spark-submit \ --master spark://207.184.161.138:7077 \ examples/src/main/python/pi.py \ 1000 ``` 此脚本向指定的 Master 节点发送请求来启动应用程序实例,并告知它要执行的具体程序位置及其入口方法或Python脚本名称等信息[^2]。 #### 驱动器初始化与调度 一旦接收到有效的作业提交请求后,Master 将负责创建一个新的 Driver 进程,在某些情况下可能会选择最合适的 Worker 来承载这个新的驱动器进程。这一步骤涉及到资源分配评估,即寻找具有足够可用内存和其他硬件条件满足需求的工作节点[^4]。 对于 YARN 或 Mesos 环境下的部署模式,则由相应的框架内部机制完成上述决策逻辑;而对于独立版 (Standalone),则是由内置算法决定最佳放置方案[^5]。 #### 应用程序执行期间的状态更新 随着 Driver 成功启动之后,便会开始加载用户定义好的业务逻辑代码——比如 Java 类或者 PySpark Script 文件,并按照既定计划分发计算任务给各个 Executor 执行单元处理数据集片段。与此同时,Driver 不断监控整个批处理进度并向客户端汇报最新状况直至全部结束为止[^3]。 ```python def main(): conf = SparkConf().setAppName("example") sc = SparkContext(conf=conf) # Your code here... if __name__ == "__main__": main() ``` 在整个过程中,日志记录功能可以帮助开发者追踪任何可能发生的错误消息或是性能瓶颈所在之处,从而便于后续优化改进措施实施。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值