Spark Cluster Mode Overview

最新推荐文章于 2025-02-08 14:20:56 发布

蓝色水彼

最新推荐文章于 2025-02-08 14:20:56 发布

阅读量319

点赞数

分类专栏： spark

本文链接：https://blog.youkuaiyun.com/xingyx1990/article/details/89031133

版权

spark 专栏收录该内容

3 篇文章

订阅专栏

本文详细介绍了Spark如何在集群环境中运行，包括其组件、集群管理器类型、应用程序提交流程，并提供了一些关键术语的解释。Spark程序由SparkContext协调，连接到standalone、Mesos、YARN或Kubernetes等集群管理器获取executors。driver程序负责调度任务，executor在worker节点上运行，应用之间通过外部存储共享数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文档描述spark如何在集群运行，便于理解过程中包含的组件。请阅读application submission guide来了解如何向集群提交程序。

Components

spark程序在集群中作为独立的线程集合运行，由主程序（driver程序）中的SparkContext对象来协调。

特别地，SparkContext可以连接几种不同的集群管理者（如standalone集群管理、Mesos或者YARN），用来为程序分配资源。连接建立后，spark获取集群节点上的executors（用来为程序运行计算和数据存储的线程）。然后，将application code发给executors。最后，SparkContext将tasks发送给executors执行。

有几点需要说明：

1、每个程序获取它自己的executor进程，这些进程在整个程序期间存活，并且以多线程方式运行tasks。这样的好处是不同程序之间相互独立，不管是在调度端(每个driver调度自己的tasks)还是在executor application(tasks from different applications run in different JVMs)。然而，这也意味着，不同的spark程序（SparkContext实例）之间不能共享数据，除非将数据写入到外部存储系统。

2、spark对集群管理者是不可知的。只要它能获取到executors进程，并且每个进程之间可以通信，及时集群管理者同时也支持其他程序也是可以的。

3、driver程序必须在整个生命周期监听和接收来自executors的连接，因此，要保证driver程序与worker节点要网络互通。

4、Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.