二十四：RDD源码分析

最新推荐文章于 2024-11-15 16:06:55 发布

原创最新推荐文章于 2024-11-15 16:06:55 发布 · 308 阅读

0 ·

CC 4.0 BY-SA版权

SparkCore 专栏收录该内容

3 篇文章

订阅专栏

本文深入介绍了Apache Spark作为统一的大数据分析引擎的特点，包括其高速运行、易用性、通用性和跨平台能力。详细分析了Spark的基本组件RDD的内部实现原理，并提供了初始化Spark环境的实践指导。

一：初识Spark:

进入官网 http://spark.apache.org
Apache Spark™ is a unified analytics engine for large-scale data processing
Apache Spark是一个标准的大型数据处理分析引擎，具有如下4个特性：

1.1：运行速度快：

相对于hadoop：编程模型不一样：mapreduce是基于进程计算，基本每一步都需要落到磁盘上，而spark是线程的，基于DAG的pipeline的计算。

1.2：易用：

可以用这么多语言来编程 scala python java R and SQL，支持80多个API

1.3：通用性

生态栈上体现，对各种问题可以有效的解决。

1.4：运行在任何地方：

Spark runs on Hadoop（on yanr）, Apache Mesos, Kubernetes（2.3以后支持）, standalone（spark集群）, or in the cloud. It can access diverse data sources

二：RDD 源码：

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an
immutable 不可变的 map 成新的集合

partitioned collection of elements 分区集合
that can be operated on in parallel. 单机开发运行并行数据
This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition,

2.1 Internally, each RDD is characterized by five main properties:

—A list of partitions 一系列分区
—A function for computing each split 一个函数去计算每个分区
—A list of dependencies on other RDDs 一系列的依赖关系
–Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 一个分区器
–Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) 每个分区计算有个最佳位置：

RDD是一个继承了序列化和日志的一个抽象类：

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {

2.2 RDD的五大特性实现的5个方法：

这些特性在HADOOPRDD和JDBCRDD等中需要去具体的实现

protected def getPartitions: Array[Partition]

  /**
   * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   */

def compute(split: Partition, context: TaskContext): Iterator[T]

  /**
   * Implemented by subclasses to return the set of partitions in this RDD. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   *
   * The partitions in this array must satisfy the following property:
   *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
   */

 protected def getDependencies: Seq[Dependency[_]] = deps

  /**
   * Optionally overridden by subclasses to specify placement preferences.
   */

protected def getPreferredLocations(split: Partition): Seq[String] = Nil

  /** Optionally overridden by subclasses to specify how they are partitioned. */
  @transient val partitioner: Option[Partitioner] = None

  // =======================================================================
  // Methods and fields available on all RDDs
  // =======================================================================

  /** The SparkContext that created this RDD. */

三：Initializing Spark

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

3.1：SparkContext

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

3.2：SparkConf

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs
Most of the time, you would create a SparkConf object with new SparkConf(), which will load
values from any spark.* Java system properties set in your application as well. In this case,
parameters you set directly on the SparkConf object take priority over system properties

import org.apache.spark.{SparkConf, SparkContext}

object SparkContextApp {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("First").setMaster("local[2]")
    val sc = new SparkContext(conf)
    
 // TODO----   
   
  sc.stop()

In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process