回归spark30多个算子

最新推荐文章于 2024-12-08 19:29:34 发布

wtzhm

最新推荐文章于 2024-12-08 19:29:34 发布

阅读量295

点赞数 1

CC 4.0 BY-SA版权

分类专栏： sparksql 文章标签： spark 算子 rdd 算子

本文链接：https://blog.youkuaiyun.com/wtzhm/article/details/86288844

sparksql 专栏收录该内容

22 篇文章

订阅专栏

本文深入解析Spark中的算子，包括Transformation和Action两类。详细介绍了map、flatMap、mapPartitions、glom、union、cartesian、filter、distinct、subtract、sample、takeSample、persist与cache、intersection、mapValues、groupByKey、reduceByKey、aggregateByKey、sortByKey、reduce、collect、count、first、takeSample、take、takeOrdered、saveAsTextFile、saveAsSequenceFile、saveAsObjectFile、countByKey和foreach等算子的功能及使用场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

回归spark30多个算子

1. spark算子分类

Transformation 变换/转换算子

Transformation 操作是延迟计算的，也就是说从一个RDD 转换生成另一个 RDD 的转换操作不是马上执行，需要等到有 Action 操作的时候才会真正触发运算。
Action 行动算子

Action 算子会触发 Spark 提交作业（Job），并将数据输出 Spark系统。

2. Transformation算子

2.1 Value数据类型的Transformation算子

map 算子

 def map[U](f: (T) ⇒ U)
 Return a new RDD by applying a function to all elements of this RDD.
 将原来 RDD 的每个数据项通过 map 中的用户自定义函数 f 映射转变为一个新的元素
 
 scala> val data = Array(1,2,3)
 data: Array[Int] = Array(1, 2, 3)
 
 scala> data.map(x=>x+2)
 res0: Array[Int] = Array(3, 4, 5)

flatMap

 def flatMap[U](f: (T) ⇒ TraversableOnce[U])
 Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
 将原来 RDD 中的每个元素通过函数 f 转换为新的元素，并将生成的 RDD 的每个集合中的元素合并
 为一个集合，内部创建 FlatMappedRDD(this，sc.clean(f))

 scala> val data  = Array(1,2,3)
 data: Array[Int] = Array(1, 2, 3)
 
 scala> data.flatMap(x=>(x to 3))
 res4: Array[Int] = Array(1, 2, 3, 2, 3, 3)

mapPartitions

 def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)
 Return a new RDD by applying a function to each partition of this RDD
 mapPartitions函数获取到每个分区的迭代器，在函数中通过这个分区整体的迭代器对整个分区的元素进行操作。内部实现是生成MapPartitionsRDD

 def main(args: Array[String]): Unit = {
         val conf = new SparkConf().setMaster("local[2]").setAppName("SparkRdd")
         val sc = new SparkContext(conf)
         val data = Array(1, 2, 3, 4)
         val rdd = sc.parallelize(data, 3)
         val rdd1 = rdd.mapPartitions(myFuncPerPartition)
         rdd1.foreach(println)
     }
 
     def myFuncPerPartition(iter: Iterator[Int]): Iterator[Int] = {
         var result = List[Int]()
         while (iter.hasNext) {
             val cur = iter.next()
             result.::=(cur * 2)
         }
         result.iterator
     }

glom算子

 def glom(): RDD[Array[T]]
 Return an RDD created by coalescing all elements within each partition into an array.

 glom函数将每个分区形成一个数组，内部实现是返回的GlommedRDD

 scala> var rdd = sc.makeRDD(1 to 10,3)
 rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[38] at makeRDD at :21

 scala> rdd.glom().collect
 res35: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
 //glom将每个分区中的元素放到一个数组中，这样，结果就变成了3个数组

union

 defunion(other: RDD[T]): RDD[T]
 Return the union of this RDD and another one.
 
 union 函数时需要保证两个 RDD 元素的数据类型相同，返回的 RDD 数据类型和被合并的 RDD 元素数据类型相同，并不进行去重操作，保存所有元素。

 scala> val data = Array(1,2,3)
 data: Array[Int] = Array(1, 2, 3)
 
 scala> val data1=Array(2,4,5)
 data1: Array[Int] = Array(2, 4, 5)
 
 scala> val rdd1 = sc.parallelize(data)
 rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26
 
 scala> val rdd2 = sc.parallelize(data1)
 rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:26
 
 scala> rdd1.union(rdd2).collect
 res0: Array[Int] = Array(1, 2, 3, 2, 4, 5)

cartesian算子

 def cartesian[U](other: RDD[U])
 
 Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in this and b is in other.

 对 两 个 RDD 内 的 所 有 元 素 进 行 笛 卡 尔 积 操 作。 操 作 后， 内 部 实 现 返 回CartesianRDD。
 
 scala> val data = Array(1,2,3)
 data: Array[Int] = Array(1, 2, 3)

 scala> val data1=Array(2,4,5)
 data1: Array[Int] = Array(2, 4, 5)
 
 scala> val rdd1 = sc.parallelize(data)
 rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26
 
 scala> val rdd2 = sc.parallelize(data1)
 rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:26

 scala> rdd1.cartesian(rdd2).collect
 res5: Array[(Int, Int)] = Array((1,2), (1,4), (1,5), (2,2), (3,2), (2,4), (2,5), (3,4), (3,5))

filter

 def filter(f: (T) ⇒ Boolean): RDD[T]
  
 Return a new RDD containing only the elements that satisfy a predicate.
 
 filter 函数功能是对元素进行过滤，对每个 元 素 应 用 f 函 数， 返 回 值 为 true 的 元 素 在RDD 中保留，返回值为 false 的元素将被过滤掉
 
 scala> val rdd = sc.parallelize(1 to 10)
 scala> rdd.filter(x=>(x%2==0)).collect
 res7: Array[Int] = Array(2, 4, 6, 8, 10)

distinct

 def distinct(): RDD[T]
  
 Return a new RDD containing the distinct elements in this RDD.
 
 distinct将RDD中的元素进行去重操作。
 
 scala> val rdd = sc.parallelize(Array(1,2,3,4,1,2))
  
 scala> rdd.distinct.collect
 res8: Array[Int] = Array(4, 2, 1, 3)

subtract

def subtract(other: RDD[T]): RDD[T]

Return an RDD with the elements from this that are not in other.

subtract相当于进行集合的差操作

scala> val rdd1 = sc.parallelize(Array(1,3,5,7,9,2,4))

scala> val rdd1 = sc.parallelize(Array(1,3,5,7,9,2,4))

scala> rdd1.subtract(rdd2).collect
res9: Array[Int] = Array(1, 9)

sample

def sample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]

Return a sampled subset of this RDD.
sample 将 RDD 这个集合内的元素进行采样，获取所有元素的子集。用户可以设定是否有放回的抽样、百分比、随机种子，进而决定采样方式。

withReplacement=true，表示有放回的抽样,反之
fraction：期望样本的大小作为RDD大小的一部分，

object sample{
  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local[2]").setAppName(this.getClass.getName)
    val sc = new SparkContext(conf)
    val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
    rdd1.sample(false,0.6).collect().mkString(",").map(print)// 输出 2,3,6,8,9
    rdd1.sample(true,0.6).collect().mkString(".").map(print) // 输出 2.4.4.6
  }
}

takeSample

takeSample（）函数和上面的sample函数是一个原理，但是不使用相对比例采样，而是按设定的采样个数进行采样，同时返回结果不再是RDD，而是相当于对采样后的数据进行
Collect（），返回结果的集合为单机的数组。

12.persist 与 cache

cache()是使用persist()的快捷方法,cache()方法使用了默认的存储级别—StorageLevel.MEMORY_ONLY

* StorageLevel.MEMORY_ONLY()，纯内存，无序列化，那么就可以用cache()方法来替代

* StorageLevel.MEMORY_ONLY_SER() ，将RDD作为非序列化的Java对象存储jvm中，如果RDD不合适存在内存中，将这些不合适在内存中的分区存储在磁盘中，每次需要时读取它们。

* StorageLevel.MEMORY_AND_DISK() 将RDD作为序列化的Java对象存储，这种方式比非序列化方式更节俭空间，快速序列化会比较耗费CPU资源

* StorageLevel.MEMORY_AND_DISK_SER()， 与MEMORY_ONLY_SER()类似，但不是每次需要时重复计算这些不合适存储到内存中的分区，而是将这些分区存储到磁盘中。

* StorageLevel.DISK_ONLY() ，仅仅将RDD分区存储到磁盘中

* 如果内存充足，要使用双副本高可靠机制,选择后缀带_2的策略,StorageLevel.MEMORY_ONLY_2()

13.intersection

	def intersection(other: RDD[T]): RDD[T]
	  
	Return the intersection of this RDD and another one.
	对于源数据集和其他数据集求交集，并去重，且无序返回

	scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5))
	scala> val rdd2 = sc.parallelize(Array(2,4,6))
	scala> rdd1.intersection(rdd2).collect
	res0: Array[Int] = Array(4, 2)

2.2 Key-Value数据类型的Transfromation算子

mapValues

mapValues ：针对（Key， Value）型数据中的 Value 进行 Map 操作，而不对 Key 进行处理

scala> val rdd = sc.parallelize(Array(“kjlksdj”,“kdjio”,“ab”,“a”)).map(x=>(x,x.length))

scala> rdd.collect
res1: Array[(String, Int)] = Array((kjlksdj,7), (kdjio,5), (ab,2), (a,1))

scala> rdd.mapValues(x=>x*10).collect
res2: Array[(String, Int)] = Array((kjlksdj,70), (kdjio,50), (ab,20), (a,10))

groupByKey

 在一个PairRDD或（k,v）RDD上调用，返回一个（k,Iterable<v>）。主要作用是将相同的所有的键值对分组到一个集合序列当中，其顺序是不确定的。
 groupByKey是把所有的键值对集合都加载到内存中存储计算，若一个键对应值太多，则易导致内存溢出。

 val words = Array("one", "two", "two", "three", "three", "three")
 val wordsRDD = sc.parallelize(words).map(word => (word, 1))
 wordsRDD.groupByKey().collect
 res5: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1)), (three,CompactBuffer(1, 1, 1)))

reduceByKey

 与groupByKey类似，却有不同。如(a,1), (a,2), (b,1), (b,2)。groupByKey产生中间结果为
 ( (a,1), (a,2) ), ( (b,1), (b,2) )。而reduceByKey为(a,3), (b,3)。
 
 reduceByKey主要作用是聚合，groupByKey主要作用是分组

aggregateByKey

类似reduceByKey，对pairRDD中想用的key值进行聚合操作，使用初始值（seqOp中使用，而combOpenCL中未使用）对应返回值为pairRDD，而区于aggregate（返回值为非RDD）
sortByKey

基于pairRDD的，根据key值来进行排序。ascending升序，默认为true，即升序；numTasks
join

加入一个RDD，在一个（k，v）和（k，w）类型的dataSet上调用，返回一个（k，（v，w））的pair dataSet。

3. Action 行动算子

reduce（function）

 reduce将RDD中元素两两传递给输入函数，同时产生一个新值，新值与RDD中下一个元素再被传递给输入函数，直到最后只有一个值为止。
 
 val rdd = sc.parallelize(Array(1,2,3,4,5))
 
 rdd.reduce((x:Int,y:Int)=>{ x+y })
 res0: Int = 15

collect（）

将一个RDD以一个Array数组形式返回其中的所有元素。
count（）

返回数据集中元素个数，默认Long类型。
first（）

返回数据集的第一个元素（类似于take(1)）

takeSample（withReplacement， num， [seed]）

 对于一个数据集进行随机抽样，返回一个包含num个随机抽样元素的数组，withReplacement表示是否有放回抽样，参数seed指定生成随机数的种子。
 
 该方法仅在预期结果数组很小的情况下使用，因为所有数据都被加载到driver端的内存中。

take（n）

 返回一个包含数据集前n个元素的数组（从0下标到n-1下标的元素），不排序。

takeOrdered（n，[ordering]）

返回RDD中前n个元素，并按默认顺序排序（升序）或者按自定义比较器顺序排序。
saveAsTextFile（path）

将dataSet中元素以文本文件的形式写入本地文件系统或者HDFS等。Spark将对每个元素调用toString方法，将数据元素转换为文本文件中的一行记录。
saveAsSequenceFile（path）（Java and Scala）

将dataSet中元素以Hadoop SequenceFile的形式写入本地文件系统或者HDFS等。（对pairRDD操作）
saveAsObjectFile（path）（Java and Scala）

将数据集中元素以ObjectFile形式写入本地文件系统或者HDFS等。
countByKey（）

用于统计RDD[K,V]中每个K的数量，返回具有每个key的计数的（k，int）pairs的hashMap。
foreach（function）

对数据集中每一个元素运行函数function