一:初识Spark:
进入官网 http://spark.apache.org
Apache Spark™ is a unified analytics engine for large-scale data processing
Apache Spark是一个标准的大型数据处理分析引擎,具有如下4个特性:
1.1:运行速度快:
相对于hadoop:编程模型不一样:mapreduce是基于进程计算,基本每一步都需要落到磁盘上,而spark是线程的,基于DAG的pipeline的计算。
1.2:易用:
可以用这么多语言来编程 scala python java R and SQL,支持80多个API
1.3:通用性
生态栈上体现,对各种问题可以有效的解决。
1.4:运行在任何地方:
Spark runs on Hadoop(on yanr), Apache Mesos, Kubernetes(2.3以后支持), standalone(spark集群), or in the cloud. It can access diverse data sources
二:RDD 源码:
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an
immutable 不可变的 map 成新的集合
- partitioned collection of elements 分区集合
- that can be operated on in parallel. 单机开发运行并行数据
- This class contains the basic operations available on all RDDs, such as
map,filter, andpersist. In addition,
2.1 Internally, each RDD is characterized by five main properties:
—A list of partitions 一系列分区
—A function for computing each split 一个函数去计算每个分区
—A list of dependencies on other RDDs 一系列的依赖关系
–Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 一个分区器
–Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) 每个分区计算有个最佳位置:
RDD是一个继承了序列化和日志的一个抽象类:
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
2.2 RDD的五大特性实现的5个方法:
这些特性在HADOOPRDD和JDBCRDD等中需要去具体的实现
protected def getPartitions: Array[Partition]
/**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*/
def compute(split: Partition, context: TaskContext): Iterator[T]
/**
* Implemented by subclasses to return the set of partitions in this RDD. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*
* The partitions in this array must satisfy the following property:
* `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
*/
protected def getDependencies: Seq[Dependency[_]] = deps
/**
* Optionally overridden by subclasses to specify placement preferences.
*/
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
/** Optionally overridden by subclasses to specify how they are partitioned. */
@transient val partitioner: Option[Partitioner] = None
// =======================================================================
// Methods and fields available on all RDDs
// =======================================================================
/** The SparkContext that created this RDD. */
三:Initializing Spark
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.
3.1:SparkContext
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
3.2:SparkConf
Configuration for a Spark application. Used to set various Spark parameters as key-value pairs
Most of the time, you would create a SparkConf object with new SparkConf(), which will load
values from any spark.* Java system properties set in your application as well. In this case,
parameters you set directly on the SparkConf object take priority over system properties
import org.apache.spark.{SparkConf, SparkContext}
object SparkContextApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("First").setMaster("local[2]")
val sc = new SparkContext(conf)
// TODO----
sc.stop()
In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process
本文深入介绍了Apache Spark作为统一的大数据分析引擎的特点,包括其高速运行、易用性、通用性和跨平台能力。详细分析了Spark的基本组件RDD的内部实现原理,并提供了初始化Spark环境的实践指导。
1484

被折叠的 条评论
为什么被折叠?



