Spark 简介

最新推荐文章于 2025-06-30 23:11:49 发布

原创最新推荐文章于 2025-06-30 23:11:49 发布 · 905 阅读

1 ·

CC 4.0 BY-SA版权

本文介绍了Apache Spark的架构特点及其主要组件的功能。Spark是一个快速、通用的集群计算系统，支持多种高级API如SQL、机器学习等。文章详细解释了Spark应用程序如何在集群上运行，包括与不同集群管理器的交互方式。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Spark

Apache Spark是一种快速、通用的集群计算系统。它提供了Java、Scala、Python和R的高级api，以及一个支持通用执行图的优化引擎。它还支持丰富的高级工具集，包括用于SQL和结构化数据处理的Spark SQL、用于机器学习的MLlib、图形处理的GraphX和Spark流。

Components

Spark应用程序在集群上作为独立的进程集运行，由主程序中的SparkContext对象(称为驱动程序)进行协调。

要在集群上运行，SparkContext可以连接到几种类型的集群管理器，它们在应用程序之间分配资源。一旦连接起来，Spark就会在集群中的节点上获取执行器，这些节点是运行计算和存储应用程序数据的进程。接下来，它向执行器发送您的应用程序代码(通过JAR或Python文件定义的文件)。最后，SparkContext将任务发送给执行程序运行。

注意点：

1. 每个应用程序都有自己的executor进程，在整个应用程序的持续时间内，并在多个线程中运行任务。这有利于将应用程序彼此隔离，同时在调度方面(每个驱动程序都调度自己的任务)和执行程序端(来自不同应用程序的任务在不同的jvm中运行)。然而，这也意味着数据不能在不同的Spark应用程序(SparkContext实例)之间共享，而不需要将其写入外部存储系统。

2。 Spark与底层的集群管理器无关。只要它能够获取executor进程，并且它们彼此通信，那么即使在支持其他应用程序的集群管理器上运行它也相对容易(例如Mesos/yarn)。

3. 驱动程序必须侦听并接受其执行器在其整个生命周期内的传入连接(端口在网络配置部分)。因此，驱动程序必须是来自工作节点的网络地址。

4. 因为驱动程序在集群上调度任务，所以它应该运行在工作节点附近，最好是在相同的本地区域网络上。如果您想要远程地向集群发送请求，最好向驱动程序打开一个RPC，并让它从附近提交操作，而不是在远离工作节点的地方运行一个驱动程序。

Cluster Manager Types

Standalone – 一个简单的集群管理器包含了Spark，这使得设置集群变得很容易。
Apache Mesos – 一个通用的集群管理器，它还可以运行Hadoop MapReduce和服务应用程序
Hadoop YARN – Hadoop 2的资源管理器.
Kubernetes – 一个用于自动化部署、扩展和管理容器应用程序的开源系统

Glossary (术语)

Application	User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar	A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program	The process running the main() function of the application and creating the SparkContext
Cluster manager	An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode	Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
Worker node	Any node that can run application code in the cluster
Executor	A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task	A unit of work that will be sent to one executor
Job	A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage	Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.