转换操作 :
0.参考网站:http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#intersection
$> spark-shell --master spark://master:7077
http://master:8080/
1.map、flatMap、distinct
map说明:将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素。
输入分区与输出分区一一对应,即:有多少个输入分区,就有多少个输出分区。
flatMap说明:同Map算子一样,最后将所有元素放到同一集合中;
distinct说明:将RDD中重复元素做去重处理
注意:针对Array[String]类型,将String对象视为字符串数组
scala> val rdd =sc.textFile("/input/input1.txt")
rdd: org.apache.spark.rdd.RDD[String] = /worldcount/test1.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val rdd1 = rdd.map(x=>x.split(" "))
rdd1: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26
scala> rdd1.collect
res0: Array[Array[String]] = Array(Array(hello, world), Array(how, are, you?), Array(ni, hao), Array(hello, tom))
scala> val rdd2 = rdd1.flatMap(x=>x)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:28
scala> rdd2.collect
res1: Array[String] = Array(hello, world, how, are, you?, ni, hao, hello, tom)
scala> rdd2.flatMap(x=>x).collect
res3: Array[Char] = Array(h, e, l, l, o, w, o, r, l, d, h, o, w, a, r, e, y, o, u, ?, n, i, h, a, o, h, e, l, l, o, t, o, m)
scala> val rdd3 = rdd2.distinct
rdd3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at distinct at <console>:30
scala> rdd3.collect
res4: Array[String] = Array(are, tom, how, you?, hello, hao, world, ni)
2.coalesce和repartition :
coalesce修改RDD分区数; repartition:重分区
coalesce说明:将RDD的分区数进行修改,并生成新的RDD;
有两个参数:第一个参数为分区数,第二个参数为shuffle Booleean类型,默认为false
如果更改分区数比原有RDD的分区数小,shuffle为false;
如果更改分区数比原有RDD的分区数大,shuffle必须为true;
应用说明:一般处理filter或简化操作时,新生成的RDD中分区内数据骤减,可考虑重分区
查看默认rdd分区数
scala> rdd.partitions.size
res4: Int = 2
默认分区2个,往小了分Yes 修改rdd分区数,并生成新的rdd
scala> val rdd4 = rdd.coalesce(1)
rdd4: org.apache.spark.rdd.RDD[String] = CoalescedRDD[8] at coalesce at <console>:26
scala> rdd4.partitions.size
res10: Int = 1
默认分区2个,往大了分NO
scala> val rdd5 = rdd.coalesce(5)
rdd5: org.apache.spark.rdd.RDD[String] = CoalescedRDD[9] at coalesce at <console>:26
scala> rdd5.partitions.size
res12: Int = 2
默认分区2个,往大了分 增加属性shuffle设为true
scala> val rdd5 = rdd.coalesce(5,true)
rdd5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at coalesce at <console>:26
scala> rdd5.partitions.size
res13: Int = 5
重分区 repartition ,可增可减分区
scala> val rdd6 = rdd5.repartition(8)
rdd6: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at <console>:34
scala> rdd6.partitions.size
res6: Int = 8
*******修改分区即修改Task任务数*******
1)textFile可以修改分区,如果加载文件后再想修改分区,可以使用以上两种方法
2)考虑场景业务需求清洗后,数据会减少,通过glom查看分区里的数据会有空值情况,采用重新分区解决
3.randomSplit:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
说明:将RDD按照权重(weights)进行随机分配,返回指定个数的RDD集合;
应用案例:Hadoop全排操作
scala> val rdd = sc.parallelize(List(1,2,3,4,5,6,7))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
//0.7+0.1+0.2 = 1 ,将rdd中的7个元素按权重分配, 权重加起来一定等于1
scala> val rdd1 = rdd.randomSplit(Array(0.7,0.1,0.2))
rdd1: Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[1] at randomSplit at <console>:26, MapPartitionsRDD[2] at randomSplit at <console>:26, MapPartitionsRDD[3] at randomSplit at <console>:26)
scala> rdd1(0).collect
res0: Array[Int] = Array(1, 5)
scala> rdd1(1).collect
res1: Array[Int] = Array()
scala> rdd1(2).collect
res2: Array[Int] = Array(2, 3, 4, 6, 7)
rdd重分区 ,按权重分配
4.glom 说明:返回每个分区中的数据项
scala>val a = sc.parallelize(1 to 100, 3)
scala>a.glom.collect
5.union:并集 将两个RDD进行合并,不去重
scala>val rdd1 = sc.parallelize(1 to 5)
scala>val rdd2 = sc.parallelize(5 to 10)
scala>val rdd3 =rdd1.union(rdd2)
6.subtract:差集
val a = sc.parallelize(1 to 9, 3)
val b = sc.parallelize(1 to 3, 3)
val c = a.subtract(b)
c.collect
res3: Array[Int] = Array(6, 9, 4, 7, 5, 8)
7.intersection:交集,取重
val x = sc.parallelize(1 to 20)
val y = sc.parallelize(10 to 30)
val z = x.intersection(y)
z.collect
res74: Array[Int] = Array(16, 12, 20, 13, 17, 14, 18, 10, 19, 15, 11)
8.mapPartitions
说明:针对每个分区进行操作;
与map方法类似,map是对rdd中的每一个元素进行操作
而mapPartitions(foreachPartition)则是对rdd中的每个分区的迭代器进行操作。
val a = sc.parallelize(1 to 9, 3)
def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
var res = List[(T, T)]()
var pre = iter.next
while (iter.hasNext)
{
val cur = iter.next;
res .::= (pre, cur) // 或者res = (pre,cur)::res 右操作运算符
pre = cur;
}
res.iterator
}
按分区中内容拼res --》List[(T, T)]
a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))
如果在map过程中需要频繁创建额外的对象
(例如将rdd中的数据通过jdbc写入数据库,
map需要为每个元素创建一个链接而mapPartition为每个partition创建一个链接),
则mapPartitions效率比map高的多
应用:对RDD进行数据库操作时,需采用mapPartitions对每个分区实例化数据库连接conn对象;
在spark中,map与mapPartitions两个函数都是比较常用,这里使用代码来解释一下两者区别:
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ArrayBuffer
object MapAndPartitions {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(newSparkConf()
.setAppName("map_mapPartitions_demo").setMaster("local"))
val arrayRDD =sc.parallelize(Array(1,2,3,4,5,6,7,8,9))
//map函数每次处理一个/行数据
arrayRDD.map(element=>{
element
}).foreach(println)
/*mapPartitions每次处理一批数据
将 arrayRDD分成x批数据进行处理
elements是其中一批数据
mapPartitions返回一批数据(iterator)*/
arrayRDD.mapPartitions(elements=>{
var result = new ArrayBuffer[Int]()
elements.foreach(element=>{
result.+=(element)
})
result.iterator
}).foreach(println)
}
}
两个函数最终处理得到的结果是一样的
mapPartitions比较适合需要分批处理数据的情况,比如将数据插入某个表,每批数据只需要开启一次数据库连接,
大大减少了连接开支,伪代码如下:
arrayRDD.mapPartitions(datas=>{
dbConnect = getDbConnect() //获取数据库连接
datas.foreach(data=>{
dbConnect.insert(data) //循环插入数据
})
dbConnect.commit() //提交数据库事务
dbConnect.close() //关闭数据库连接
})
9.mapPartitionsWithIndex
类似于mappartition,但接受两个参数。
第一个参数是分区的索引,第二个参数是遍历该分区内所有项的迭代器。
输出是一个迭代器,包含应用函数编码的任何转换之后的项列表
import org.apache.spark.{SparkConf, SparkContext}
object MapPartitionWithIndex {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new
SparkConf().setMaster("local").setAppName("map_partition_index"))
val rdd = sc.parallelize(1 to 10,3)
println(rdd.mapPartitionsWithIndex(myfun).collect().toList)
rdd.map(x=>(x,1)).sortByKey()
}
def myfun(index:Int,iter:Iterator[Int]):Iterator[String]={
iter.map(x=>index+","+x)
}
}
注意:iter: Iterator[Int]:Iterator[T]类型,应和RDD内部数据类型一致
x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)
10.zip 组合新的RDD
def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]
说明:1.两个RDD之间数据类型可以不同;
2.要求每个RDD具有相同的分区数
3.需RDD的每个分区具有相同的数据个数
个数不相同报错 r.TaskSetManager: Lost task 0.0 in stage 11.0 (TID 29, 192.168.179.131, executor 1):
org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition
scala> val rdd = sc.parallelize(1 to 10 ,3)
scala> val rdd1 = sc.parallelize(List("a","b","c","d","e","f","g","h","i","j"),3)
scala> val rdd2 =rdd.zip(rdd1)
scala> rdd2.collect
res18: Array[(Int, String)] = Array((1,a), (2,b), (3,c), (4,d), (5,e), (6,f), (7,g), (8,h), (9,i), (10,j))
11.zipParititions 要求:需每个RDD具有相同的分区数;
12.zipWithIndex
def zipWithIndex(): RDD[(T, Long)]
将现有的RDD的每个元素和相对应的Index组合,生成新的RDD[(T,Long)]
rdd.zipWithIndex.collect
res8: Array[(Int, Long)] = Array((1,0), (2,1), (3,2), (4,3), (5,4), (6,5), (7,6), (8,7), (9,8), (10,9))
scala> res8.glom().collect
res9: Array[Array[(Int, Long)]] = Array(Array((1,0), (2,1), (3,2)), Array((4,3), (5,4), (6,5)), Array((7,6), (8,7), (9,8), (10,9)))
13.zipWithUniqueId
def zipWithUniqueId(): RDD[(T, Long)]
该函数将RDD中元素和一个唯一ID组合成键/值对,该唯一ID生成算法如下:
每个分区中第一个元素的唯一ID值为:该分区索引号,
每个分区中第N个元素的唯一ID值为:(前一个元素的唯一ID值) + (该RDD总的分区数)
该函数将RDD中的元素和这个元素在RDD中的ID(索引号)组合成键/值对。
scala> val rdd = sc.parallelize(List(1,2,3,4,5),2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:24
scala> rdd.glom.collect
res25: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))
scala> val rdd2 = rdd.zipWithUniqueId()
rdd2: org.apache.spark.rdd.RDD[(Int, Long)] = MapPartitionsRDD[23] at zipWithUniqueId at <console>:26
scala> rdd2.collect
res26: Array[(Int, Long)] = Array((1,0), (2,2), (3,1), (4,3), (5,5))
计算规则:
step1:第一个分区的第一个元素0,第二个分区的第一个元素1
step2:第一个分区的第二个元素0+2
step3:第二个分区的第二个元素1+2=3;第二个分区的第三个元素3+2=5;
res25: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))
0 2 1 3 5
3个分区:
scala> val rdd = sc.parallelize(List(1,2,3,4,5),3)
res37: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
0 1 4 2 5
res38: Array[(Int, Long)] = Array((1,0), (2,1), (3,4), (4,2), (5,5))
14.reduceByKey
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
说明:合并具有相同键的值
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res86: Array[(Int, String)] = Array((3,dogcatowlgnuant))
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))
15.keyBy
def keyBy[K](f: T => K): RDD[(K, T)]
说明:将f函数的返回值作为Key,与RDD的每个元素构成pairRDD{RDD[(K, T)]}
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
b.collect
res26: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))
16.groupByKey()
def groupByKey(): RDD[(K, Iterable[V])]
说明:按照相同的key进行分组,返回值为RDD[(K, Iterable[V])]
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
b.groupByKey.collect
res11: Array[(Int, Iterable[String])] = Array((6,CompactBuffer(salmon, salmon)), (3,CompactBuffer(rat, dog)), (8,CompactBuffer(elephant)))
17.keys
def keys: RDD[K]
说明:返回键值对RDD的key的集合
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
scala> b.collect
res1: Array[(Int, String)] = Array((3,dog), (5,tiger), (4,lion), (3,cat), (7,panther), (5,eagle))
b.keys.collect
res2: Array[Int] = Array(3, 5, 4, 3, 7, 5)
18.values
def values: RDD[V]
说明:返回键值对的RRD的value的RDD
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.values.collect
res3: Array[String] = Array(dog, tiger, lion, cat, panther, eagle)
19.sortByKey
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P]
说明:根据key进行排序,默认为ascending: Boolean = true(“升序”)
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.toInt, 2)
val c = a.zip(b)
c.sortByKey(true).collect
res74: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3)) // a>c>d>g>o
c.sortByKey(false).collect
res75: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))
20.partitionBy
def partitionBy(partitioner: Partitioner): RDD[(K, V)]
说明:通过设置Partitioner对RDD进行重分区
scala> val rdd = sc.parallelize(List((1,"a"),(2,"b"),(3,"c"),(4,"d")),2)
rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[26] at parallelize at <console>:24
查看原有分区:
scala> rdd.glom.collect
res28: Array[Array[(Int, String)]] = Array(Array((1,a), (2,b)), Array((3,c), (4,d)))
按照hashPartitioner分2个区,和java中的hash是一样的%2==0的在一个区
scala> val rdd1=rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[28] at partitionBy at <console>:26
查看按hashpathion分区:
scala> rdd1.glom.collect
res29: Array[Array[(Int, String)]] = Array(Array((4,d), (2,b)), Array((1,a), (3,c)))
21.mapValues[Pair]
def mapValues[U](f: V => U): RDD[(K, U)]
说明:将RDD[(K, V)] --> RDD[(K, U)],对Value做(f: V => U)操作,Key不变
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.mapValues("x" + _ + "x").collect
res5: Array[(Int, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx), (5,xeaglex))
22.flatMapValues[Pair]
def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)]
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.flatMapValues("x" + _ + "x").collect
res6: Array[(Int, Char)] = Array((3,x), (3,d), (3,o), (3,g), (3,x), (5,x), (5,t), (5,i), (5,g), (5,e), (5,r), (5,x), (4,x), (4,l), (4,i), (4,o), (4,n), (4,x), (3,x), (3,c), (3,a), (3,t), (3,x), (7,x), (7,p), (7,a), (7,n), (7,t), (7,h), (7,e), (7,r), (7,x), (5,x), (5,e), (5,a), (5,g), (5,l), (5,e), (5,x))
23.substractByKey[Pair]
def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]
说明:删掉RDD 中键与other RDD 中的键相同的元素
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant", "falcon", "squid"), 2)
val d = c.keyBy(_.length)
b.subtractByKey(d).collect
res15: Array[(Int, String)] = Array((4,lion))
24.combineByKey[Pair] 4.3.1 聚合操作
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
如下解释下3个重要的函数参数:
createCombiner: V => C ,这个函数把当前的值作为参数,此时我们可以对其做些附加操作(类型转换)并把它返回 (这一步类似于初始化操作)
mergeValue: (C, V) => C,该函数把元素V合并到之前的元素C(createCombiner)上 (这个操作在每个分区内进行)
mergeCombiners: (C, C) => C,该函数把2个元素C合并 (这个操作在不同分区间进行)
说明:createCombiner:当分区中遇到第一次出现的键时,触发此函数
mergeValue:当分区中再次出现的键时,触发此函数
mergeCombiners:处理不同区当中相同Key的Value,执行此函数
案例:
val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))
过程解释:
Array[Array[String]] = Array(Array(dog, cat, gnu), Array(salmon, rabbit, turkey), Array(wolf, bear, bee))
Array[Array[Int]] = Array(Array(1, 1, 2), Array(2, 2, 1), Array(2, 2, 2))
Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2,salmon), (2,rabbit), (1,turkey), (2,wolf), (2,bear), (2,bee))
Array[Array[(Int, String)]] = Array(Array((1,dog), (1,cat), (2,gnu)), Array((2,salmon), (2,rabbit), (1,turkey)), Array((2,wolf), (2,bear), (2,bee)))
Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))
25.foldByKey[Pair]
def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
说明:与reduceByKey作用类似,但通过柯里化函数,首先要初始化zeroValue
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x))
b.foldByKey("")(_ + _).collect
res84: Array[(Int, String)] = Array((3,dogcatowlgnuant)
初始值@
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.foldByKey("@")(_ + _).collect
res85: Array[(Int, String)] = Array((4,@lion), (3,@cat@dog), (7,@panther), (5,@eagle@tiger))
26.join
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
说明:将两个RDD进行内连接
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.join(d).collect
jion案例:
--------------------------------------
name.txt
1 zhangsan
2 lisi
3 wangwu
4 zhagnliu
5 haha
score.txt
1 600
2 590
3 450
4 610
5 544
import org.apache.spark.{SparkConf, SparkContext}
object jionTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("cartesian")
conf.setMaster("local")
val sc = new SparkContext(conf)
//名单
val nameRdd1 = sc.textFile("E:/inputwordcount/name.txt")
val nameRdd2 = nameRdd1.map(line => {
var arr = line.split(" ")
(arr(0).toInt,arr(1))
})
//总成绩
val score1 = sc.textFile("E:/inputwordcount/score.txt")
val score2 = score1.map(line =>{
var arr = line.split(" ")
(arr(0).toInt,arr(1).toInt)
})
val joinrdd = nameRdd2.join(score2)
//val joinrdd2 = joinrdd.sortBy(x=>(x._1),false)
joinrdd.collect().foreach(r=>{
println(r._1+":"+r._2)
})
//val joinrdd2 = joinrdd.sortBy(x=>(x._1),false).map(r=>(r._1+":"+r._2))
//joinrdd2.saveAsTextFile("E:/inputwordcount/spark_wordcount_res3")
}
}
----------------------------------------------------
27.rightOuterJoin
说明:对两个RDD 进行连接操作,确保第一个RDD 的键必须存在(右外连接)
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.rightOuterJoin(d).collect
res2: Array[(Int, (Option[String], String))] = Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat),dog)), (3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wolf)), (4,(None,bear)))
28.leftOuterJoin
说明:对两个RDD 进行连接操作,确保第二个RDD 的键必须存在(左外连接) val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val d = c.keyBy(_.length)
b.leftOuterJoin(d).collect
29..cogroup 允许使用键将最多3个键值RDD组合在一起
说明:将两个RDD 中拥有相同键的数据分组到一起,全连接
val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.map((_, "b"))
val c = a.map((_, "c"))
b.cogroup(c).collect
res7: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c))),
(3,(ArrayBuffer(b),ArrayBuffer(c))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))
)
val d = a.map((_, "d"))
b.cogroup(c, d).collect
res9: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] =
Array(
(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d)))
)
val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"), (4, "kiwi")), 2)
val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2)
x.cogroup(y).collect
res23: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),
(2,(ArrayBuffer(banana),ArrayBuffer())),
(3,(ArrayBuffer(orange),ArrayBuffer())),
(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),
(5,(ArrayBuffer(),ArrayBuffer(computer))))
30.cartesian:笛卡尔积:模拟破解账号对应的密码
import org.apache.spark.{SparkConf, SparkContext}
object a {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("cartesian")
conf.setMaster("local")
val sc = new SparkContext(conf)
val rdd1 = sc.parallelize(Array("zhangsan","lisi","wangwu","zhaoliu"))
val rdd2 = sc.parallelize(Array("123456","123","456","236"))
val rdd = rdd1.cartesian(rdd2)
rdd.collect().foreach(println(_))
}
}