Dive into Spark Context

最新推荐文章于 2025-08-19 13:22:35 发布

转载最新推荐文章于 2025-08-19 13:22:35 发布 · 129 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://my.oschina.net/hunglish/blog/1547371

文章标签：

#大数据 #runtime #python

本文详细介绍了Apache Spark中的核心组件SparkContext，包括其角色、创建过程及提供的主要功能。通过实例展示了如何使用SparkContext进行WordCount任务，并总结了SparkContext中重要的函数。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

2019独角兽企业重金招聘Python工程师标准>>>

放一张老生常谈的架构图吧：

SparkContext可以说是Apache Spark的“众妙之门”，无论任何Spark工程，最最重要的一步就是正确的创建SparkContext，这是一切项目代码能够正确运行的基石。为什么这么说？因为Spark的作者们为SparkContext对象集成了各种各样的功能，譬如它能够让你的Spark Application通过Cluster Manager（亦称为Resource Manager，无论是Yarn也好Mesos也罢）与整个Spark集群通信。

另外，创建SparkContext之前，你需要SparkConf，也就是Spark配置集成对象，SparkConf拥有一个配置参数，用来调整我们driver程序传递给SparkContext的信息，我相信有过spark经验的工程师们都应该有所了解，我们在下文中也会以例子的形式进一步阐述。

由于文章的受众参差不齐，这里我想表达以下几个要点：

SparkContext在Spark中到底是一个什么样的角色？
如何创建SparkContext？
实战训练
SparkContext的功能汇总

角色

上面已经说过了，SparkContext是Spark成功运行的基础，It is the entry point of Spark functionality

如何创建SparkContext

创建SparkContext的第一步，是创建SparkConf。SparkConf对象具有传递参数给SparkContext的作用，这些参数定义了driver application的性质，此外这些参数还可以用来分配集群上的资源，比如数目，内存大小，以及worker nodes上executors执行tasks的核数等等。

总而言之，SparkContext为我们指引了通向Spark集群的道路，在其创建以后，我们可以调用诸如textFile, sequenceFile, parallelize等的函数。The different contexts in which it can run are local, yarn-client, Mesos URL and Spark URL。

一旦SparkContext被创建，它能够被用来创建RDDs，广播变量，累加器，spark入口服务并运行job。这些进程可以在SparkContext停止之前的时刻运行。

实例展示

我相信有了实例，各位能懂的更快：

package com.panda.spark
 
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
 
object Wordcount {
 def main(args: Array[String]) {
  
 //创建conf对象
 val conf = new SparkConf().setAppName("WordCount")
  
 //创建spark context对象
 val sc = new SparkContext(conf)
 
//检查变量长度是否合法
 if (args.length < 2) {
 println("Usage: ScalaWordCount <input> <output>")
 System.exit(1)
 }
 //以外部源的形式创建RDD
 val rawData = sc.textFile(args(0))
  
 //利用flatMap将lines转化为words
 val words = rawData.flatMap(line => line.split(" "))
  
 //统计结果
 val wordCount = words.map(word => (word, 1)).reduceByKey(_ + _)
  
 //保存结果
 wordCount.saveAsTextFile(args(1))
 
//停止spark context
 sc.stop
 }
}

SparkContext函数汇总

如上图所示，spark为我们提供了各种各样的函数，大致上可以归为以下几种，有些东西用英文原文的话体会的会更加深刻一些：

获得Spark Application目前状态

SpkEnv – 这是spark运行时提供公共服务的环境变量。 A SparkEnv object that holds the required runtime services for running Spark application with the different environment for the driver and executor represents the Spark runtime environment.
SparkConf – The Spark Properties handles maximum applications settings and are configured separately for each application. We can also easily set these properties on a SparkConf. Some common properties like master URL and application name, as well as an arbitrary key-value pair, configured through theset() method.
Deployment environment (as master URL) – Spark deployment environment are of two types namely local and clustered. Local mode is non-distributed single-JVM deployment mode. All the execution components – driver, executor, LocalSchedulerBackend, and master are present in same single JVM. Hence, the only mode where drivers are useful for execution is the local mode. For testing, debugging or demonstration purpose, the local mode is suitable because it requires no earlier setup to launch spark application. While in clustered mode, the Spark runs in distributive mode.

设置配置变量configuration

Master URL – The master method returns back the current value of spark.master which is deployment environment in use.
Local properties-Creating Logical Job Groups – The reason of local properties concept is to form logical groups of jobs by means of properties that create the separate job launched from different threads belong to a single logic group. We can set a local property which will affect Spark jobs submitted from a thread, such as the Spark fair scheduler pool.
Default Logging level – It lets you set the root login level in a Spark application, for example, Spark Shell.

获取各种services

SparkContext可以获得诸如TaskScheduler, LiveListenBus, BlockManager, SchedulerBackend, ShuffelManager以及optional ContextCleaner等服务。

取消job

取消job需要DAGScheduler来drop spark job

取消stage

同样的，取消一个stage也需要DAGScheduler来drop spark stage

为Spark清理闭包

Spark在每次action函数执行前，都会调用ClosureClean.clean对闭包进行清理，这一点我们在讨论RDD函数的时候在源码中也看到了相关的操作，也就是在action函数被序列化送到executor上执行之前闭包会被彻底清理。需要注意到的是，我们清理的是所有被引用到的闭包。

注册Spark监听器

We can register a custom SparkListenerInterface with the help of addSparkListener method. We can also register custom listeners using the spark.extraListeners setting.

Programmable Dynamic allocation

It also provides the following method as the developer API for dynamic allocation of executors: requestExecutors, killExecutors, requestTotalExecutors, getExecutorIds.