RDD的创建与分区
1.RDD的创建
在Spark中创建RDD的创建方式可以分为三种:从集合中创建RDD、从外部存储创建
RDD、从其他RDD创建。
a)新建一个 maven 工程 SparkCoreTest
b)添加 scala 框架支持
c)创建一个scala文件夹,并把它修改为Source Root
d)在pom文件中添加:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
</dependencies>
<build>
<finalName>SparkCoreTest</finalName>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.4.6</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
①从集合中创建RDD
package com.xiao_after.createadd
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* @author xiaohu
* @create 2020-09-24 18:17
*/
object createadd01_array {
def main(args: Array[String]): Unit = {
//1.创建SparkConf并设置App名称
val conf: SparkConf = new SparkConf().setAppName("SparkCoreTest").setMaster("local[*]")
//2.创建SparkContext,该对象是提交Spark App的入口
val sc: SparkContext = new SparkContext(conf)
//3.使用parallelize()创建rdd
val rdd1: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8))
//遍历rdd1
rdd1.collect().foreach(println)
println("----------------------------------")
//4.使用makeRDD()创建rdd
val rdd2: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4, 5, 6, 7, 8))
//遍历rdd2
rdd2.collect().foreach(println)
//5.关闭连接
sc.stop()
}
}
注:源码中 makeRDD 方法调用的是 parallelize 方法,但是二者并不完全相等。
②从外部存储创建RDD
由外部存储系统的数据集创建RDD包括:本地的文件系统,还有所有Hadoop支持的数据
集,比如HDFS、HBase等。
数据准备:在新建的SparkCoreTest项目名称上右键 =》新建input文件夹=》在input文件
夹上右键=》分别新建1.txt和2.txt。每个文件里面准备一些word单词。
package com.xiao_after.createadd
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* @author xiaohu
* @create 2020-09-24 18:27
*/
object createadd02_file {
def main(args: Array[String]): Unit = {
//1.创建SparkConf并设置App名称
val conf: SparkConf = new SparkConf().setAppName("SparkCoreTest").setMaster("local[*]")
//2.创建SparkContext,该对象是提交Spark App的入口
val sc: SparkContext = new SparkContext(conf)
//3.读取文件。如果是集群路径:hdfs:hadoop102:9000/input
val rdd: RDD[String] = sc.textFile("D:\\MyselfPractice\HdfsClientDemo\\SparkCoreTest\\input\\1.txt")
//4.遍历打印
rdd.collect().foreach(println)
//5.关闭连接
sc.stop()
}
}
③从其他RDD创建
主要是通过一个RDD运算完后,再产生新的RDD。
2.分区规则
2.1 默认分区源码(RDD数据从集合中创建)
1)默认分区数源码解读
a)创建一个rdd
val rdd: RDD[Int] = sc.makeRDD(Array(1,2,3,4))
b)ctrl + 鼠标左键点击 makeRDD,查看 makeRDD 源码
def makeRDD[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
parallelize(seq, numSlices)
}
c)ctrl + 鼠标左键点击 defaultParallelism,查看默认分区 defaultParallelism 源码
def defaultParallelism: Int = {
assertNotStopped()
taskScheduler.defaultParallelism
}
d)ctrl + 鼠标左键点击下面的 defaultParallelism
def defaultParallelism(): Int
e)发现 defaultParallelism 是一个抽象方法,需要找他的实现类;ctrl + h 查找,实现类如下
TaskSchedulerImpl(org.apache.spark.scheduler)
//点击该实现类
f)ctrl + f 查找 defaultParallelism
override def defaultParallelism(): Int = backend.defaultParallelism()
g)ctrl + 鼠标左键点击右边的 defaultParallelism,然后 ctrl + h 查找 defaultParallelism
的实现类,如下两个
CoarseGrainedSchedulerBackend(org.apache.spark.scheduler.cluster)
LocalSchedulerBackend(org.apache.spark.scheduler.local)
h)因为我们是在本地环境运行,所以点选第二
个:LocalSchedulerBackend(org.apache.spark.scheduler.local)
然后 ctrl + f 查找 defaultParallelism
override def defaultParallelism(): Int =
scheduler.conf.getInt("spark.default.parallelism", totalCores)
源码表明:如果我们在本地运行环境下,不设置分区个数,默认分区个数为当前机器的CPU核数。
2)代码验证:
package com.xiao_after.partition
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* @author xiaohu
* @create 2020-09-24 18:49
*/
object partition01_default {
def main(args: Array[String]): Unit = {
//1.创建SparkConf并设置App名称
val conf: SparkConf = new SparkConf().setAppName("SparkCoreTest").setMaster("local[*]")
//2.创建SparkContext,该对象是提交Spark App的入口
val sc: SparkContext = new SparkContext(conf)
//3.创建一个rdd
val rdd: RDD[Int] = sc.makeRDD(Array(1,2,3,4))
//3.1 输出到文件夹 rdd.saveAsTextFile("D:\\MyselfPractice\\HdfsClientDemo\\SparkCoreTest\\output")
//4.关闭连接
sc.stop()
}
}
在output文件夹下面查看分区个数:因为作者的电脑是 8核16线程的,所以产生了 16个分
区。
2.2 分区源码(RDD数据从集合中创建)
1)分区测试(RDD数据从集合中创建)
package com.xiao_after.partition
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* @author xiaohu
* @create 2020-09-24 18:49
*/
object partition02_Array {
def main(args: Array[String]): Unit = {
//1.创建SparkConf并设置App名称
val conf: SparkConf = new SparkConf().setAppName("SparkCoreTest").setMaster("local[*]")
//2.创建SparkContext,该对象是提交Spark App的入口
val sc: SparkContext = new SparkContext(conf)
//3.1.创建第一个RDD:4个数据,4个分区;输出:0分区->1;1分区->2;2分区->3;3分区->4
val rdd1: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4), 4)
//保存rdd3到文件
rdd1.saveAsTextFile("D:\\MyselfPractice\\HdfsClientDemo\\SparkCoreTest\\output1")
//3.2.创建第二个RDD:4个数据,3个分区;输出:0分区->1;1分区->2;2分区->3,4
val rdd2: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4), 3)
//保存rdd2到文件
rdd2.saveAsTextFile("D:\\MyselfPractice\\HdfsClientDemo\\SparkCoreTest\\output2")
//3.3.创建第三个RDD:5个数据,3个分区;输出:0分区->1;1分区->2,3;2分区->4,5
val rdd3: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4, 5), 3)
//保存rdd3到文件
rdd3.saveAsTextFile("D:\\MyselfPractice\\HdfsClientDemo\\SparkCoreTest\\output3")
//4.关闭连接
sc.stop()
}
}
2)分区源码(RDD数据在集合中创建)
a)创建一个rdd
val rdd3: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4, 5), 3)
b)ctrl + 鼠标左键点击 makeRDD,查看 makeRDD 源码
def makeRDD[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
parallelize(seq, numSlices)
}
c)ctrl + 鼠标左键点击 parallelize,查看默认分区 parallelize源码
def parallelize[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
assertNotStopped()
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
d)ctrl + 鼠标左键点击 ParallelCollectionRDD,查看 ParallelCollectionRDD 源码
private[spark] class ParallelCollectionRDD[T: ClassTag](
sc: SparkContext,
@transient private val data: Seq[T],
numSlices: Int,
locationPrefs: Map[Int, Seq[String]])
extends RDD[T](sc, Nil) {
e)在下面看见一个 ParallelCollectionRDD 的伴生类,查看其中滚动方法
private object ParallelCollectionRDD {
/**
* Slice a collection into numSlices sub-collections. One extra thing we do here is to treat Range
* collections specially, encoding the slices as other Ranges to minimize memory cost. This makes
* it efficient to run Spark over RDDs representing large sets of numbers. And if the collection
* is an inclusive Range, we use inclusive range for the last slice.
*/
def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
if (numSlices < 1) {
throw new IllegalArgumentException("Positive number of partitions required")
}
// Sequences need to be sliced at the same set of index positions for operations
// like RDD.zip() to behave as expected
def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
(0 until numSlices).iterator.map { i =>
val start = ((i * length) / numSlices).toInt
val end = (((i + 1) * length) / numSlices).toInt
(start, end)
}
}
seq match {
case r: Range =>
positions(r.length, numSlices).zipWithIndex.map { case ((start, end), index) =>
// If the range is inclusive, use inclusive range for the last slice
if (r.isInclusive && index == numSlices - 1) {
new Range.Inclusive(r.start + start * r.step, r.end, r.step)
}
else {
new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}
}.toSeq.asInstanceOf[Seq[Seq[T]]]
case nr: NumericRange[_] =>
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
var r = nr
for ((start, end) <- positions(nr.length, numSlices)) {
val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
r = r.drop(sliceSize)
}
slices
case _ =>
val array = seq.toArray // To prevent O(n^2) operations for List etc
positions(array.length, numSlices).map { case (start, end) =>
array.slice(start, end).toSeq
}.toSeq
}
}
}
f)重点是:
def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
(0 until numSlices).iterator.map { i =>
val start = ((i * length) / numSlices).toInt
val end = (((i + 1) * length) / numSlices).toInt
(start, end)
}
}
//和
case _ =>
val array = seq.toArray // To prevent O(n^2) operations for List etc
positions(array.length, numSlices).map { case (start, end) =>
array.slice(start, end).toSeq
}.toSeq
根据源码:我们得出如下结论:
分区的开始位置 = 分区号 * 数据总长度/分区总数
分区的结束位置 = (分区号 + 1) 数据总长度/分区总数*
length:数据长度,numSlices:分区数 | 角标:左闭右开 | 具体分区数据 |
---|---|---|
i = 0 start =(0*5)/3=0 end =(0+1)*5/3 =1 | (0,1) | 0分区:1 |
i = 1 start =(1*5)/3=1 end =(1+1)*5/3 =3 | (1,3) | 1分区:2,3 |
i = 2 start =(2*5)/3=3 end =(2+1)*5/3 =5 | (3,5) | 2分区:4,5 |
2.3 默认分区源码(RDD数据从文件中读取后创建)
1)分区测试
package com.xiao_after.partition
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* @author xiaohu
* @create 2020-09-24 18:49
*/
object partition03_file_default {
def main(args: Array[String]): Unit = {
//1.创建SparkConf并设置App名称
val conf: SparkConf = new SparkConf().setAppName("SparkCoreTest").setMaster("local[*]")
//2.创建SparkContext,该对象是提交Spark App的入口
val sc: SparkContext = new SparkContext(conf)
//3.默认分区的数量,默认取值为当前核数和2的最小值
val rdd: RDD[String] = sc.textFile("D:\\MyselfPractice\\HdfsClientDemo\\SparkCoreTest\\input\\1.txt")
//输出rdd到文件
rdd.saveAsTextFile("D:\\MyselfPractice\\HdfsClientDemo\\SparkCoreTest\\output4")
//4.关闭连接
sc.stop()
}
}
2)分区源码
默认分区源码(RDD数据从文件中读取后创建)
a)ctrl + 鼠标左键点击 textFile,查看 textFile源码
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
b)ctrl + 鼠标左键点击 defaultMinPartitions,查看 defaultMinPartitions源码
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
注:源码显示:文件中创建的RDD分区,默认分区数为:defaultParallelism 和 2 的最小值,其中defaultParallelism为 CPU 的核数。