SparkCore：Spark术语和运行架构

最新推荐文章于 2025-05-07 14:30:23 发布

11号车厢

最新推荐文章于 2025-05-07 14:30:23 发布

阅读量353

点赞数

CC 4.0 BY-SA版权

分类专栏： Spark2 文章标签： Spark2

本文链接：https://blog.youkuaiyun.com/greenplum_xiaofan/article/details/98152087

Spark2 专栏收录该内容

28 篇文章

订阅专栏

本文深入探讨了Apache Spark的架构设计与核心组件，详细解释了Application、Job、Stage、Task及Partitions之间的关系，以及它们如何在Spark的运行过程中发挥作用。同时，文章还介绍了Spark在不同集群管理器如Standalone、Mesos和YARN上的部署方式。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

官网Deploying–>Overview
http://spark.apache.org/docs/2.4.2/cluster-overview.html
在这里插入图片描述

1、Glossary术语

1.1 术语

Application
User program built on Spark. Consists of a driver program and executors on the cluster.
基于Spark的用户程序。由集群上的driver program和executors 组成。
Application jar
A jar containing the user’s Spark application. In some cases users will want to create an “uber jar” containing their application along with its dependencies. The user’s jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
基于Spark的用户程序。由一个驱动程序和集群jar上的执行器组成，集群jar包含用户的Spark应用程序。在某些情况下，用户希望创建一个“uber jar”，其中包含他们的应用程序及其依赖项。用户的jar不应该包含Hadoop或Spark库，但是，这些库将在运行时添加。
简单理解就是你在IDEA上打的jar包，但这里提示最好是打瘦包
Driver program
The process running the main() function of the application and creating the SparkContext
运行应用程序main()函数的进程，并创建SparkContext
Cluster manager集群管理器
An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
用于获取集群资源的外部服务(例如，standalone 、Mesos、YARN)
Deploy mode
Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster.
区分驱动程序进程运行的位置。在“cluster”模式下，driver 在cluster内部。在“client”模式下，driver 在cluster外部。
在client模式下，driver运行在本地，即在哪台机器上提交的就运行在哪台
Worker node
Any node that can run application code in the cluster
可以在集群中运行应用程序代码的任何节点
这是standlone架构的有的，对于运行在YARN上来说，相当于NM节点
Executor
A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
在工作节点上为应用程序启动的进程，它运行任务并将数据保存在内存或磁盘存储中。每个应用程序都有自己的执行器。
对于跑在YARN来说，相当于Container
Task
A unit of work that will be sent to one executor
executor的工作单元
Job
A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect)
由多个任务组成的并行计算，这些任务响应Spark action 操作(e.g. save, collect)
简单理解：只要Spark遇到action操作，就是一个job
Stage
Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce);
每个作业被划分为更小的任务集，称为stages ，这些stages 相互依赖(类似于MapReduce中的map和reduce阶段)

1.2 Job、Stage、Task、Partitions关系

一个Action就是一个Job
一个Job会拆成多个Stage，拆Stage的原理是算子产生了Shuffle
一个Stage有一堆Task，Task个数跟你设置的Patition有关
一个Partition就是一个Task
一个Task就会产生一个文件
文件最终存储在Hdfs中，文件的个数跟设置的副本数有关
比如 map、reduceByKey 是会产生Shuffle的，Shuffle会产生Stage

# 设置两个分区
scala> val a =sc.parallelize(List(1,1,1,2,2,2,3,3,4,5,6,6,7),2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala>  a.map((_,1)).reduceByKey(_+_).collect
res2: Array[(Int, Int)] = Array((4,1), (6,2), (2,3), (1,3), (3,2), (7,1), (5,1))

这里看到1个Job产生了4个Task–>collect at :26点击进去往下看
在这里插入图片描述
这里可以看出map、reduceByKey产生了shuffle，所以拆成了两个Stage，每个Stage是两个Task，因为设最开始设置的是2个分区，所以一共4个Task

2、Components组成

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
Spark应用程序作为集群上的独立进程集运行，在你的main program中通过SparkContext协调。(被称为driver program)
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
具体来说，要在集群上运行，SparkContext可以连接到几种类型的集群管理器(standalone 、Mesos或YARN)，这些管理器分配资源给应用程序。一旦连接好，Spark将获得集群中节点上的executor，executor能够为你的程序计算和存储数据。接下来，它(Driver端)将您的应用程序代码(由传递给SparkContext的JAR或Python文件定义)发送给执行器。最后，SparkContext将任务发送给执行者运行。

There are several useful things to note about this architecture:

Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
每个应用程序都有自己的executor 进程，这些进程(executor)在整个应用程序期间保持运行，并以多线程运行任务。这样做的好处是在调度端(each driver schedules its own tasks)和执行端(tasks from different applications run in different JVMs)彼此隔离应用程序。然而，这意味着数据并不能夸应用程序共享，除非写到一个外部存储系统
Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
Spark与底层集群管理器无关。一旦Spark获得executor进程，并且这些进程彼此通信，即使在支持其他应用程序(例如Mesos/YARN)的集群管理器上运行它也容易。
The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
driver program必须在其整个生命周期中侦听并接受来自executors 的传入连接。因此，driver program必须和worker nodes网络是通的。
driver 和executors 必须保持通信
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.
因为driver 在集群上调度任务，所以它应该靠近worker nodes，最好是在相同的局域网上。如果希望远程向集群发送请求，最好向驱动程序打开RPC并让它从附近提交操作，而不是远离工作节点。