1. keyValue(单个RDD操作)
(1)collectAsMap(把keyvalue的类型转换成Map,去掉重复的,后面覆盖前面的)
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6)), 2)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> pairRDD.collectAsMap()
res2: scala.collection.Map[Int,Int] = Map(1 -> 2, 3 -> 6)
(2)lookup(对于一个keyValue类型的rdd,通过lookup找到key对应的value,注意返回的是一个集合!!!!!,不是相加)
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6)), 2)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> pairRDD.lookup(3)
res5: Seq[Int] = WrappedArray(4, 6)
(3)combineByKey(对pairStrRDD这个RDD统计每一个相同key对应的所有value值的累加值以及这个key出现的次数)
val pairStrRDD = sc.parallelize[(String, Int)](Seq(("coffee", 1),
("coffee", 2), ("panda", 3), ("coffee", 9)), 2)
def createCombiner = (value: Int) => (value, 1)
def mergeValue = (acc: (Int, Int), value: Int) => (acc._1 + value, acc._2 + 1)
def mergeCombiners = (acc1: (Int, Int), acc2: (Int, Int)) =>
(acc1._1 + acc2._1, acc1._2 + acc2._2)
//功能:对pairStrRDD这个RDD统计每一个相同key对应的所有value值的累加值以及这个key出现的次数
//需要的三个参数:
//createCombiner: V => C, ==> Int -> (Int, Int)
//mergeValue: (C, V) => C, ==> ((Int, Int), Int) -> (Int, Int)
//mergeCombiners: (C, C) => C ==> ((Int, Int), (Int, Int)) -> (Int, Int)
val testCombineByKeyRDD =
pairStrRDD.combineByKey(createCombiner, mergeValue, mergeCombiners)
testCombineByKeyRDD.collect()
scala> testCombineByKeyRDD.collect()
res7: Array[(String, (Int, Int))] = Array((coffee,(12,3)), (panda,(3,1)))
(4)aggregateByKey(针对keyValue类型的RDD,统计每个key出现的次数,同时把相同的key的value相加)
scala> pairRDD.aggregateByKey((0, 0))( // createCombiner = mergeValue((0, 0), v)
| (acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1), //mergeValue
| (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2) // mergeCombiners
| ).collect()
res10: Array[(Int, (Int, Int))] = Array((1,(2,1)), (3,(10,2)), (5,(6,1)))
scala> pairRDD.collect
res11: Array[(Int, Int)] = Array((1,2), (3,4), (3,6), (5,6))
(5)reduceByKey (就是把一个keyValue类型的RDD,把相同的key的值的value累加一起,注意collect返回的集合是(key--累加数量))
val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6), (5, 6)), 2)
scala> pairRDD.collect
res13: Array[(Int, Int)] = Array((1,2), (3,4), (3,6), (5,6))
scala> pairRDD.reduceByKey((x, y) => x + y).collect()
res14: Array[(Int, Int)] = Array((1,2), (3,10), (5,6))
(6)groupByKey(注意返回的是RDD[(Int, Iterable[Int])]!!!,key是原来的,value是一个value的集合)
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6), (5, 6)), 2)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:24
scala> pairRDD.groupByKey().collect()
res3: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(2)), (3,CompactBuffer(6, 4)), (5,CompactBuffer(6)))
++++++++++++++++++++++++++++++++++++++++++++++
pairRDD.groupByKey().map { case (key, iter) =>
val sortedValues = iter.toArray.sorted
(key, sortedValues)
}.collect()
++++++++++++++++++++++++++++++++++++++++
val pairRDD = sc.parallelize(Seq(("a", 1), ("b", 2), ("c", 1), ("a", 2),
("c", 4), ("b", 1), ("a", 1), ("a", 1)), 3)
val a = pairRDD.groupByKey()
scala> val b = a.map{case (key, iter) =>
| val sortedValues = iter.toArray.sorted
| val m1 = sortedValues.max
| (key, m1)
| }
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[24] at map at <console>:28
scala> b.collect
res9: Array[(String, Int)] = Array((c,4), (a,2), (b,2))
(7)countByKey针对key-value类型的RDD,把其变成MAP类型!!!!=======countByKey用于统计RDD[K,V]中每个K的数量,即key出现的次数!!!!!。
注意此不再是一个rdd;这个类型特别像collectAsMap,区别是后者会去重!!!!
scala> val pair = sc.parallelize((1 to 10).zipWithIndex)
pair: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[30] at parallelize at <console>:24
scala> val b = pair.countByKey
b: scala.collection.Map[Int,Long] = Map(5 -> 1, 10 -> 1, 1 -> 1, 6 -> 1, 9 -> 1, 2 -> 1, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 1)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("B",3),("A",0),("A",0)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[39] at makeRDD at <console>:25
scala>
scala> val rdd2 = rdd1.countByKey
rdd2: scala.collection.Map[String,Long] = Map(A -> 4, B -> 3)
(8)mapValues 针对key-value类型的rdd,其实就是对value进行操作;其实等价val a = pairRDD.map(x=> (x._1, x._2 + 1))
scala> pairRDD
res39: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[40] at parallelize at <console>:25
scala> pairRDD.collect
res40: Array[(Int, Int)] = Array((5,2), (7,4), (3,3), (2,4))
scala> val mapValuesRDD = pairRDD.mapValues(x => x + 1)
mapValuesRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[41] at mapValues at <console>:26
(9)sortByKey(其实就是对key-value类型的key进行排序,key也可以是字符串,会按照字母顺序来,测试通过)
val pairRDD =
sc.parallelize[(Int, Int)](Seq((5, 2), (7, 4), (3, 3), (2, 4)), 4)
scala> val a = pairRDD.sortByKey()
a: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[74] at sortByKey at <console>:26
scala> a.collect
res69: Array[(Int, Int)] = Array((2,4), (3,3), (5,2), (7,4))
(10)sortBy==该函数可以针对key-value等多个元素的元组,按照第几个元素进行排序==========非常有用的函数;
如果是字符串排序,按照字母顺序,非常牛逼的用法!!!!!和filter groupBy一样牛逼
scala> val test = sc.parallelize(Seq(("aworld",2),("hello",1)))
test: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[90] at parallelize at <console>:24
scala> val test1 = test.sortBy(_._2)
test1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[95] at sortBy at <console>:26
scala> test1.collect
res74: Array[(String, Int)] = Array((hello,1), (aworld,2))
(11)filterByRange 按照key的一个范围进行过滤,非常有意思=========特别注意这个只针对keyvalue类型的RDD,注意只能两个元素
scala> val rangeTestRDD =
sc.parallelize[(Int, Int)](Seq((5, 2), (7, 4), (3, 6), (2, 6), (3, 6), (4,2),(3,777),(2, 6)), 4)
rangeTestRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[99] at parallelize at <console>:25
scala> val test = rangeTestRDD.filterByRange(3, 5)
test: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[100] at filterByRange at <console>:26
scala> test.collect
res76: Array[(Int, Int)] = Array((5,2), (3,6), (3,6), (4,2), (3,777))
(12)foldByKey针对key-value类型的RDD,按照key来进行应用函数和reduceBykey唯一的差别是有初始值
scala> val a = sc.parallelize(Seq((1,2),(2,20),(3,30),(2,50)))
a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[34] at parallelize at <console>:25
scala> val b = a.foldByKey(100)(_+_)
b: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[35] at foldByKey at <console>:27
scala> b.collect
res15: Array[(Int, Int)] = Array((1,102), (2,270), (3,130))
(13)
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
2. RDD单个元素操作API
(1)distinct(注意该RDD是单元素类型,数字和字符串都可以,把一个单类型的RDD中的元素去重)
val rdd = sc.parallelize(Seq(1,2,2,3,1))
scala> val distinctRDD = rdd.distinct()
distinctRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at distinct at <console>:26
scala> distinctRDD.collect
res3: Array[Int] = Array(1, 2, 3)
+++++++++++++++++++++++++++++++++++++++++++++++++++++
scala> val rdd = sc.parallelize(Seq("hello","ni","hao","hello"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[21] at parallelize at <console>:24
scala> val b = rdd.distinct
b: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[24] at distinct at <console>:26
scala> b.collect
res8: Array[String] = Array(hello, ni, hao)
+++++++++++++++++++++++++++++++++++++++++++++++++++++
scala> val rdd = sc.parallelize(Seq("hello","ni","hao","hello",1,2))
rdd: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[25] at parallelize at <console>:24
scala> val b = rdd.distinct
b: org.apache.spark.rdd.RDD[Any] = MapPartitionsRDD[28] at distinct at <console>:26
scala> b.collect
res9: Array[Any] = Array(1, hello, 2, ni, hao)
(2)take(n) 获得集合前面的n个元素,元素可以是单个的,也可以是元组;注意返回类型是一个数组!!!!
val a = sc.parallelize(Seq(1,2,23,4,5),2)
scala> a.take(4)
res14: Array[Int] = Array(22, 24, 23, 45)
scala> c.take(2)
res19: Array[(String, Int)] = Array((hello,1), (nihao,2))
(3) count(计算集合中元素的个数,RDD中可以是单个元素,也可以元组类型)===如果是元组类型
scala> c.collect
res16: Array[(String, Int)] = Array((hello,1), (nihao,2), (hello111111111,1), (nihao11111,2))
scala> c.count
res18: Long = 4
(4)top(n) 针对单个RDD操作,显示从大到小的前n 个;可以是单个元素,也可以是元组类型
scala> pair.top(2)
res28: Array[(Int, Int)] = Array((10,9), (9,8))
scala> pair.collect
res29: Array[(Int, Int)] = Array((1,0), (2,1), (3,2), (4,3), (5,4), (6,5), (7,6), (8,7), (9,8), (10,9))
(5)map对RDD集合中的元素进行操作
val a = sc.parallelize(1 to 9, 3)
val b = a.map(x => x*2)//x => x*2是一个函数,x是传入参数即RDD的每个元素,x*2是返回值
a.collect
//结果Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
b.collect
//结果Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
当然map也可以把Key变成Key-Value对
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", " eagle"), 2)
val b = a.map(x => (x, 1))
b.collect.foreach(println(_))
/*
(dog,1)
(tiger,1)
(lion,1)
(cat,1)
(panther,1)
( eagle,1)
*/
(6)mapPartitions(function) ===针对每个partitions进行操作
(7)flatMap(function) 先map形成一个集合,然后再flat打平!!!!!!!牢记这句话就不会有问题
val a = sc.parallelize(1 to 4, 2)
val b = a.flatMap(x => 1 to x)//每个元素扩展
b.collect
/*
结果 Array[Int] = Array( 1,
1, 2,
1, 2, 3,
1, 2, 3, 4)
*/
+++++++++++++++++++++++++++++++++++——++++
scala> val l = List(List(1,2,3), List(2,3,4))
l: List[List[Int]] = List(List(1, 2, 3), List(2, 3, 4))
scala> l.flatMap(x => x)
res36: List[Int] = List(1, 2, 3, 2, 3, 4)
(8)flatMapValues(function)其实就是把flatMap应用到key-value类型的Value上
val a = sc.parallelize(List((1,2),(3,4),(5,6)))
val b = a.flatMapValues(x=>1 to x)
b.collect.foreach(println(_))
/*
(1,1)
(1,2)
(3,1)
(3,2)
(3,3)
(3,4)
(5,1)
(5,2)
(5,3)
(5,4)
(5,5)
(5,6)
*/
(9)pairRDD.keys/pairRDD.values 这个API很有用,他可以把key-value类型的key单独拿出来组成一个新的RDD;
也就是说可以由key-value类型的,转换成key单类型的RDD
val pairRDD =
sc.parallelize[(Int, Int)](Seq((5, 2), (7, 4), (3, 3), (2, 4)), 4)
scala> pairRDD.collect
res60: Array[(Int, Int)] = Array((5,2), (7,4), (3,3), (2,4))
scala> val b = pairRDD.keys
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[50] at keys at <console>:26
scala> b.collect
res58: Array[Int] = Array(5, 7, 3, 2)
scala> val c = pairRDD.values
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[51] at values at <console>:26
scala> c.collect
res57: Array[Int] = Array(2, 4, 3, 4)
(10)filter 注意常见的是针对单个元素的RDD进行操作,但是还有更灵活的用法
scala> val rangeTestRDD = sc.parallelize(Seq(11,2,3,4,45,6,799,49))
rangeTestRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[101] at parallelize at <console>:24
scala> rangeTestRDD.filter(x => x>20).collect
res83: Array[Int] = Array(45, 799, 49)
(11)注意filter 牛逼的用法!!!!,指定按照哪个元素进行过滤;
scala> val rangeTestRDD = sc.parallelize(Seq(("hell",2),("world",1)))
rangeTestRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[104] at parallelize at <console>:24
scala> val a = rangeTestRDD.filter(x => x._2 > 1)
a: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[105] at filter at <console>:26
(12)reduce 就是把结合中的元素累计计算!!!!!!需要特别注意,rdd的reduce是reduceLeft,但是由于分区的存在,导致结果不可预知
val a = sc.parallelize(Seq(1,2,3,4,5))
scala> val b = a.reduce(_+_)
b: Int = 15
(13)groupBy 该函数非常牛逼!!!!,可以对元素中的任何分量进行分组,和filter有得一拼;注意下面的两个案例
scala> val a = sc.parallelize(Seq((1,2,3),(2,3,5),(1,2,4),(6,7,3)))
a: org.apache.spark.rdd.RDD[(Int, Int, Int)] = ParallelCollectionRDD[14] at parallelize at <console>:24
scala> val b = a.groupBy(_._3)
b: org.apache.spark.rdd.RDD[(Int, Iterable[(Int, Int, Int)])] = ShuffledRDD[16] at groupBy at <console>:26
scala> b.collect
res110: Array[(Int, Iterable[(Int, Int, Int)])] = Array((3,CompactBuffer((1,2,3), (6,7,3))), (4,CompactBuffer((1,2,4))), (5,CompactBuffer((2,3,5))))
+++++++++++++++++++++++++++++++++++
scala> case class TC(val Number:Int,val Name:String)
defined class TC
scala> val a = sc.parallelize(Seq(TC(4,"li"),TC(5,"feng"),TC(6,"yu")))
a: org.apache.spark.rdd.RDD[TC] = ParallelCollectionRDD[6] at parallelize at <console>:26
scala> val b = a.groupBy(x => x.Name)
b: org.apache.spark.rdd.RDD[(String, Iterable[TC])] = ShuffledRDD[8] at groupBy at <console>:28
scala> b.collect
res105: Array[(String, Iterable[TC])] = Array((feng,CompactBuffer(TC(5,feng))), (yu,CompactBuffer(TC(6,yu))), (li,CompactBuffer(TC(4,li))))
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
3. 两个RDD元素操作API(每个RDD都是key-value类型)
(1)cogroup(注意是针对key-value类型的RDD;把两个RDD分组,比如A,B两个RDD;结果:分组的名是key, 分的value是一个集合元组,其中第一个是A的list列表,第二是B的list列表)-----返回的是(Int, (Iterable[Int], Iterable[Int]))
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6), (5, 6)), 4)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> val otherRDD = sc.parallelize(Seq((3, 9), (4, 5)))
otherRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> pairRDD.cogroup(otherRDD).collect()
res4: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2),CompactBuffer())), (3,(CompactBuffer(4, 6),CompactBuffer(9))), (4,(CompactBuffer(),CompactBuffer(5))), (5,(CompactBuffer(6),CompactBuffer())))
(2)Join(注意是针对key-value类型的;首先关注返回值类型,(Int, (Int, Int));另外两个RDD中孤立的元组可以直接忽略;以key为第一个元素,第二个元素是元组,第一个是前面的rdd中的,第二是后面rdd中的元素)
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6), (5, 6)), 4)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> val otherRDD = sc.parallelize(Seq((3, 9), (4, 5)))
otherRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> pairRDD.join(otherRDD).collect()
res5: Array[(Int, (Int, Int))] = Array((3,(4,9)), (3,(6,9)))
(3)subtractByKey(针对的是key-value类型的;减去另外一个相同的key的元组,如果第一个rdd有两个元素相同,注意一起去掉了)
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6), (5, 6)), 4)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> val otherRDD = sc.parallelize(Seq((3, 9), (4, 5)))
otherRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> pairRDD.subtractByKey(otherRDD).collect()
res6: Array[(Int, Int)] = Array((1,2), (5,6))
(4)cartesian(otherDataset):对两个RDD中的所有元素进行笛卡尔积操作=========rdd中可以是单元素也可以key-Value;测试过成功
val rdd1 = sc.parallelize(1 to 3)
val rdd2 = sc.parallelize(2 to 5)
val cartesianRDD = rdd1.cartesian(rdd2)
cartesianRDD.foreach(x => println(x + " "))
(1,2)
(1,3)
(1,4)
(1,5)
(2,2)
(2,3)
(2,4)
(2,5)
(3,2)
(3,3)
(3,4)
(3,5)
(5)intersection(otherDataset):返回两个RDD的交集
val rdd1 = sc.parallelize(1 to 3)
val rdd2 = sc.parallelize(3 to 5)
val unionRDD = rdd1.intersection(rdd2)
unionRDD.collect.foreach(x => print(x + " ")) 返回结果:3 4
4. 两个RDD元素操作API(其中RDD可以是单个元素,也可以是多个元组)
(1)union (两个RDD聚集在一起,可以是单个元素,也可以是元组类型;==========!!!!!!特别注意其和++等价)
++++++++++++++++++++++++++++++++++++++++
scala> val a = sc.parallelize(Seq(("hello", 1), ("nihao", 2)),2)
a: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[31] at parallelize at <console>:24
scala> val b = sc.parallelize(Seq(("AAA", 1), ("BBB", 2)),2)
b: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:24
scala> c.collect
res11: Array[(String, Int)] = Array((hello,1), (nihao,2), (AAA,1), (BBB,2))
(2)两个RDD的zip 请注意这两个RDD必须是数量相同的,返回是一个元组集合(第一个RDD元素,第二RDD元素)
不要求两个结合的数据同类,只要数量相同,注意!!!!!!!!!分区也要一样;zip要求其实非常严格!!!!
scala> val b = sc.parallelize(11 to 20, 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at parallelize at <console>:24
scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24
scala> val d = a zip b
d: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedPartitionsRDD2[24] at zip at <console>:28
scala> d.collect
res17: Array[(Int, Int)] = Array((1,11), (2,12), (3,13), (4,14), (5,15), (6,16), (7,17), (8,18), (9,19), (10,20))
scala>
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
scala> val a = sc.parallelize(Seq((1,2),(3,4)))
a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[25] at parallelize at <console>:24
scala> val b = sc.parallelize(Seq((8,9),(10,11)))
b: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[26] at parallelize at <console>:24
scala> val c = a zip b
c: org.apache.spark.rdd.RDD[((Int, Int), (Int, Int))] = ZippedPartitionsRDD2[27] at zip at <console>:28
scala> c.collect
res18: Array[((Int, Int), (Int, Int))] = Array(((1,2),(8,9)), ((3,4),(10,11)))
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
scala> val a = sc.parallelize(Seq("hello","lifengyu"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at parallelize at <console>:24
scala> val b = sc.parallelize(Seq((8,9),(10,11)))
b: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[26] at parallelize at <console>:24
scala> val c = a zip b
c: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ZippedPartitionsRDD2[29] at zip at <console>:28
scala> c.collect
res19: Array[(String, (Int, Int))] = Array((hello,(8,9)), (lifengyu,(10,11)))
(3)subtract 针对单元素的RDD 减去相同元素!!!注意元组也可以;只是如果key相同,value不同,此种元素不会减去
scala> val oneRDD = sc.parallelize[Int](Seq(1, 2, 3), 3)
oneRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24
scala> val otherRDD = sc.parallelize(Seq(3, 4), 2)
otherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24
scala> val subtractRDD = oneRDD.subtract(otherRDD)
subtractRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[24] at subtract at <console>:28
scala> subtractRDD.collect
res16: Array[Int] = Array(1, 2)
++++++++++++++++++++++++++++++++++++++
scala> val a = sc.parallelize(Seq((1,2),(3,4)))
a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[25] at parallelize at <console>:24
scala> val b = sc.parallelize(Seq((3,4), (5,6)))
b: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[26] at parallelize at <console>:24
scala> val c = a.subtract(b)
c: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[30] at subtract at <console>:28
scala> c.collect
res17: Array[(Int, Int)] = Array((1,2))
(4)zipPartitions 特别针对单个元素的rdd,两个rdd分区数相同,元素个数可以不同;计算过程,先把自己分区相加,然后再累加另外的分区;最后元素的个数和分区数一样;------------zip 是对应元素位置形成元组,zipPartitions是按照partitions进行操作
scala> val oneRDD = sc.parallelize[Int](Seq(1, 2, 3), 3)
oneRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:24
scala> val otherRDD = sc.parallelize(Seq(3, 4, 5,6,7,8,9), 3)
otherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at parallelize at <console>:24
scala> val zipPartitionRDD = oneRDD.zipPartitions(otherRDD)((iterator1, iterator2) => Iterator(iterator1.sum + iterator2.sum))
zipPartitionRDD: org.apache.spark.rdd.RDD[Int] = ZippedPartitionsRDD2[46] at zipPartitions at <console>:28
scala> zipPartitionRDD.collect
res30: Array[Int] = Array(8, 13, 27)
5. 形成rdd
val hdfsFileRDD = sc.textFile("hdfs://master:9999/users/hadoop-twq/word.txt")
hdfsFileRDD.count()
val listRDD = sc.parallelize[Int](Seq(1, 2, 3, 3, 4), 2)
listRDD.collect()
listRDD.glom().collect()
val rangeRDD = sc.range(0, 10, 2, 4)
rangeRDD.collect()
val makeRDD = sc.makeRDD(Seq(1, 2, 3, 3))
makeRDD.collect()
(1)collectAsMap(把keyvalue的类型转换成Map,去掉重复的,后面覆盖前面的)
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6)), 2)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> pairRDD.collectAsMap()
res2: scala.collection.Map[Int,Int] = Map(1 -> 2, 3 -> 6)
(2)lookup(对于一个keyValue类型的rdd,通过lookup找到key对应的value,注意返回的是一个集合!!!!!,不是相加)
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6)), 2)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> pairRDD.lookup(3)
res5: Seq[Int] = WrappedArray(4, 6)
(3)combineByKey(对pairStrRDD这个RDD统计每一个相同key对应的所有value值的累加值以及这个key出现的次数)
val pairStrRDD = sc.parallelize[(String, Int)](Seq(("coffee", 1),
("coffee", 2), ("panda", 3), ("coffee", 9)), 2)
def createCombiner = (value: Int) => (value, 1)
def mergeValue = (acc: (Int, Int), value: Int) => (acc._1 + value, acc._2 + 1)
def mergeCombiners = (acc1: (Int, Int), acc2: (Int, Int)) =>
(acc1._1 + acc2._1, acc1._2 + acc2._2)
//功能:对pairStrRDD这个RDD统计每一个相同key对应的所有value值的累加值以及这个key出现的次数
//需要的三个参数:
//createCombiner: V => C, ==> Int -> (Int, Int)
//mergeValue: (C, V) => C, ==> ((Int, Int), Int) -> (Int, Int)
//mergeCombiners: (C, C) => C ==> ((Int, Int), (Int, Int)) -> (Int, Int)
val testCombineByKeyRDD =
pairStrRDD.combineByKey(createCombiner, mergeValue, mergeCombiners)
testCombineByKeyRDD.collect()
scala> testCombineByKeyRDD.collect()
res7: Array[(String, (Int, Int))] = Array((coffee,(12,3)), (panda,(3,1)))
(4)aggregateByKey(针对keyValue类型的RDD,统计每个key出现的次数,同时把相同的key的value相加)
scala> pairRDD.aggregateByKey((0, 0))( // createCombiner = mergeValue((0, 0), v)
| (acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1), //mergeValue
| (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2) // mergeCombiners
| ).collect()
res10: Array[(Int, (Int, Int))] = Array((1,(2,1)), (3,(10,2)), (5,(6,1)))
scala> pairRDD.collect
res11: Array[(Int, Int)] = Array((1,2), (3,4), (3,6), (5,6))
(5)reduceByKey (就是把一个keyValue类型的RDD,把相同的key的值的value累加一起,注意collect返回的集合是(key--累加数量))
val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6), (5, 6)), 2)
scala> pairRDD.collect
res13: Array[(Int, Int)] = Array((1,2), (3,4), (3,6), (5,6))
scala> pairRDD.reduceByKey((x, y) => x + y).collect()
res14: Array[(Int, Int)] = Array((1,2), (3,10), (5,6))
(6)groupByKey(注意返回的是RDD[(Int, Iterable[Int])]!!!,key是原来的,value是一个value的集合)
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6), (5, 6)), 2)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:24
scala> pairRDD.groupByKey().collect()
res3: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(2)), (3,CompactBuffer(6, 4)), (5,CompactBuffer(6)))
++++++++++++++++++++++++++++++++++++++++++++++
pairRDD.groupByKey().map { case (key, iter) =>
val sortedValues = iter.toArray.sorted
(key, sortedValues)
}.collect()
++++++++++++++++++++++++++++++++++++++++
val pairRDD = sc.parallelize(Seq(("a", 1), ("b", 2), ("c", 1), ("a", 2),
("c", 4), ("b", 1), ("a", 1), ("a", 1)), 3)
val a = pairRDD.groupByKey()
scala> val b = a.map{case (key, iter) =>
| val sortedValues = iter.toArray.sorted
| val m1 = sortedValues.max
| (key, m1)
| }
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[24] at map at <console>:28
scala> b.collect
res9: Array[(String, Int)] = Array((c,4), (a,2), (b,2))
(7)countByKey针对key-value类型的RDD,把其变成MAP类型!!!!=======countByKey用于统计RDD[K,V]中每个K的数量,即key出现的次数!!!!!。
注意此不再是一个rdd;这个类型特别像collectAsMap,区别是后者会去重!!!!
scala> val pair = sc.parallelize((1 to 10).zipWithIndex)
pair: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[30] at parallelize at <console>:24
scala> val b = pair.countByKey
b: scala.collection.Map[Int,Long] = Map(5 -> 1, 10 -> 1, 1 -> 1, 6 -> 1, 9 -> 1, 2 -> 1, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 1)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("B",3),("A",0),("A",0)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[39] at makeRDD at <console>:25
scala>
scala> val rdd2 = rdd1.countByKey
rdd2: scala.collection.Map[String,Long] = Map(A -> 4, B -> 3)
(8)mapValues 针对key-value类型的rdd,其实就是对value进行操作;其实等价val a = pairRDD.map(x=> (x._1, x._2 + 1))
scala> pairRDD
res39: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[40] at parallelize at <console>:25
scala> pairRDD.collect
res40: Array[(Int, Int)] = Array((5,2), (7,4), (3,3), (2,4))
scala> val mapValuesRDD = pairRDD.mapValues(x => x + 1)
mapValuesRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[41] at mapValues at <console>:26
(9)sortByKey(其实就是对key-value类型的key进行排序,key也可以是字符串,会按照字母顺序来,测试通过)
val pairRDD =
sc.parallelize[(Int, Int)](Seq((5, 2), (7, 4), (3, 3), (2, 4)), 4)
scala> val a = pairRDD.sortByKey()
a: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[74] at sortByKey at <console>:26
scala> a.collect
res69: Array[(Int, Int)] = Array((2,4), (3,3), (5,2), (7,4))
(10)sortBy==该函数可以针对key-value等多个元素的元组,按照第几个元素进行排序==========非常有用的函数;
如果是字符串排序,按照字母顺序,非常牛逼的用法!!!!!和filter groupBy一样牛逼
scala> val test = sc.parallelize(Seq(("aworld",2),("hello",1)))
test: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[90] at parallelize at <console>:24
scala> val test1 = test.sortBy(_._2)
test1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[95] at sortBy at <console>:26
scala> test1.collect
res74: Array[(String, Int)] = Array((hello,1), (aworld,2))
(11)filterByRange 按照key的一个范围进行过滤,非常有意思=========特别注意这个只针对keyvalue类型的RDD,注意只能两个元素
scala> val rangeTestRDD =
sc.parallelize[(Int, Int)](Seq((5, 2), (7, 4), (3, 6), (2, 6), (3, 6), (4,2),(3,777),(2, 6)), 4)
rangeTestRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[99] at parallelize at <console>:25
scala> val test = rangeTestRDD.filterByRange(3, 5)
test: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[100] at filterByRange at <console>:26
scala> test.collect
res76: Array[(Int, Int)] = Array((5,2), (3,6), (3,6), (4,2), (3,777))
(12)foldByKey针对key-value类型的RDD,按照key来进行应用函数和reduceBykey唯一的差别是有初始值
scala> val a = sc.parallelize(Seq((1,2),(2,20),(3,30),(2,50)))
a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[34] at parallelize at <console>:25
scala> val b = a.foldByKey(100)(_+_)
b: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[35] at foldByKey at <console>:27
scala> b.collect
res15: Array[(Int, Int)] = Array((1,102), (2,270), (3,130))
(13)
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
2. RDD单个元素操作API
(1)distinct(注意该RDD是单元素类型,数字和字符串都可以,把一个单类型的RDD中的元素去重)
val rdd = sc.parallelize(Seq(1,2,2,3,1))
scala> val distinctRDD = rdd.distinct()
distinctRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at distinct at <console>:26
scala> distinctRDD.collect
res3: Array[Int] = Array(1, 2, 3)
+++++++++++++++++++++++++++++++++++++++++++++++++++++
scala> val rdd = sc.parallelize(Seq("hello","ni","hao","hello"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[21] at parallelize at <console>:24
scala> val b = rdd.distinct
b: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[24] at distinct at <console>:26
scala> b.collect
res8: Array[String] = Array(hello, ni, hao)
+++++++++++++++++++++++++++++++++++++++++++++++++++++
scala> val rdd = sc.parallelize(Seq("hello","ni","hao","hello",1,2))
rdd: org.apache.spark.rdd.RDD[Any] = ParallelCollectionRDD[25] at parallelize at <console>:24
scala> val b = rdd.distinct
b: org.apache.spark.rdd.RDD[Any] = MapPartitionsRDD[28] at distinct at <console>:26
scala> b.collect
res9: Array[Any] = Array(1, hello, 2, ni, hao)
(2)take(n) 获得集合前面的n个元素,元素可以是单个的,也可以是元组;注意返回类型是一个数组!!!!
val a = sc.parallelize(Seq(1,2,23,4,5),2)
scala> a.take(4)
res14: Array[Int] = Array(22, 24, 23, 45)
scala> c.take(2)
res19: Array[(String, Int)] = Array((hello,1), (nihao,2))
(3) count(计算集合中元素的个数,RDD中可以是单个元素,也可以元组类型)===如果是元组类型
scala> c.collect
res16: Array[(String, Int)] = Array((hello,1), (nihao,2), (hello111111111,1), (nihao11111,2))
scala> c.count
res18: Long = 4
(4)top(n) 针对单个RDD操作,显示从大到小的前n 个;可以是单个元素,也可以是元组类型
scala> pair.top(2)
res28: Array[(Int, Int)] = Array((10,9), (9,8))
scala> pair.collect
res29: Array[(Int, Int)] = Array((1,0), (2,1), (3,2), (4,3), (5,4), (6,5), (7,6), (8,7), (9,8), (10,9))
(5)map对RDD集合中的元素进行操作
val a = sc.parallelize(1 to 9, 3)
val b = a.map(x => x*2)//x => x*2是一个函数,x是传入参数即RDD的每个元素,x*2是返回值
a.collect
//结果Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
b.collect
//结果Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
当然map也可以把Key变成Key-Value对
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", " eagle"), 2)
val b = a.map(x => (x, 1))
b.collect.foreach(println(_))
/*
(dog,1)
(tiger,1)
(lion,1)
(cat,1)
(panther,1)
( eagle,1)
*/
(6)mapPartitions(function) ===针对每个partitions进行操作
(7)flatMap(function) 先map形成一个集合,然后再flat打平!!!!!!!牢记这句话就不会有问题
val a = sc.parallelize(1 to 4, 2)
val b = a.flatMap(x => 1 to x)//每个元素扩展
b.collect
/*
结果 Array[Int] = Array( 1,
1, 2,
1, 2, 3,
1, 2, 3, 4)
*/
+++++++++++++++++++++++++++++++++++——++++
scala> val l = List(List(1,2,3), List(2,3,4))
l: List[List[Int]] = List(List(1, 2, 3), List(2, 3, 4))
scala> l.flatMap(x => x)
res36: List[Int] = List(1, 2, 3, 2, 3, 4)
(8)flatMapValues(function)其实就是把flatMap应用到key-value类型的Value上
val a = sc.parallelize(List((1,2),(3,4),(5,6)))
val b = a.flatMapValues(x=>1 to x)
b.collect.foreach(println(_))
/*
(1,1)
(1,2)
(3,1)
(3,2)
(3,3)
(3,4)
(5,1)
(5,2)
(5,3)
(5,4)
(5,5)
(5,6)
*/
(9)pairRDD.keys/pairRDD.values 这个API很有用,他可以把key-value类型的key单独拿出来组成一个新的RDD;
也就是说可以由key-value类型的,转换成key单类型的RDD
val pairRDD =
sc.parallelize[(Int, Int)](Seq((5, 2), (7, 4), (3, 3), (2, 4)), 4)
scala> pairRDD.collect
res60: Array[(Int, Int)] = Array((5,2), (7,4), (3,3), (2,4))
scala> val b = pairRDD.keys
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[50] at keys at <console>:26
scala> b.collect
res58: Array[Int] = Array(5, 7, 3, 2)
scala> val c = pairRDD.values
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[51] at values at <console>:26
scala> c.collect
res57: Array[Int] = Array(2, 4, 3, 4)
(10)filter 注意常见的是针对单个元素的RDD进行操作,但是还有更灵活的用法
scala> val rangeTestRDD = sc.parallelize(Seq(11,2,3,4,45,6,799,49))
rangeTestRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[101] at parallelize at <console>:24
scala> rangeTestRDD.filter(x => x>20).collect
res83: Array[Int] = Array(45, 799, 49)
(11)注意filter 牛逼的用法!!!!,指定按照哪个元素进行过滤;
scala> val rangeTestRDD = sc.parallelize(Seq(("hell",2),("world",1)))
rangeTestRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[104] at parallelize at <console>:24
scala> val a = rangeTestRDD.filter(x => x._2 > 1)
a: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[105] at filter at <console>:26
scala> a.collect
res84: Array[(String, Int)] = Array((hell,2))(12)reduce 就是把结合中的元素累计计算!!!!!!需要特别注意,rdd的reduce是reduceLeft,但是由于分区的存在,导致结果不可预知
val a = sc.parallelize(Seq(1,2,3,4,5))
scala> val b = a.reduce(_+_)
b: Int = 15
(13)groupBy 该函数非常牛逼!!!!,可以对元素中的任何分量进行分组,和filter有得一拼;注意下面的两个案例
scala> val a = sc.parallelize(Seq((1,2,3),(2,3,5),(1,2,4),(6,7,3)))
a: org.apache.spark.rdd.RDD[(Int, Int, Int)] = ParallelCollectionRDD[14] at parallelize at <console>:24
scala> val b = a.groupBy(_._3)
b: org.apache.spark.rdd.RDD[(Int, Iterable[(Int, Int, Int)])] = ShuffledRDD[16] at groupBy at <console>:26
scala> b.collect
res110: Array[(Int, Iterable[(Int, Int, Int)])] = Array((3,CompactBuffer((1,2,3), (6,7,3))), (4,CompactBuffer((1,2,4))), (5,CompactBuffer((2,3,5))))
+++++++++++++++++++++++++++++++++++
scala> case class TC(val Number:Int,val Name:String)
defined class TC
scala> val a = sc.parallelize(Seq(TC(4,"li"),TC(5,"feng"),TC(6,"yu")))
a: org.apache.spark.rdd.RDD[TC] = ParallelCollectionRDD[6] at parallelize at <console>:26
scala> val b = a.groupBy(x => x.Name)
b: org.apache.spark.rdd.RDD[(String, Iterable[TC])] = ShuffledRDD[8] at groupBy at <console>:28
scala> b.collect
res105: Array[(String, Iterable[TC])] = Array((feng,CompactBuffer(TC(5,feng))), (yu,CompactBuffer(TC(6,yu))), (li,CompactBuffer(TC(4,li))))
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
3. 两个RDD元素操作API(每个RDD都是key-value类型)
(1)cogroup(注意是针对key-value类型的RDD;把两个RDD分组,比如A,B两个RDD;结果:分组的名是key, 分的value是一个集合元组,其中第一个是A的list列表,第二是B的list列表)-----返回的是(Int, (Iterable[Int], Iterable[Int]))
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6), (5, 6)), 4)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> val otherRDD = sc.parallelize(Seq((3, 9), (4, 5)))
otherRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> pairRDD.cogroup(otherRDD).collect()
res4: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2),CompactBuffer())), (3,(CompactBuffer(4, 6),CompactBuffer(9))), (4,(CompactBuffer(),CompactBuffer(5))), (5,(CompactBuffer(6),CompactBuffer())))
(2)Join(注意是针对key-value类型的;首先关注返回值类型,(Int, (Int, Int));另外两个RDD中孤立的元组可以直接忽略;以key为第一个元素,第二个元素是元组,第一个是前面的rdd中的,第二是后面rdd中的元素)
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6), (5, 6)), 4)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> val otherRDD = sc.parallelize(Seq((3, 9), (4, 5)))
otherRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> pairRDD.join(otherRDD).collect()
res5: Array[(Int, (Int, Int))] = Array((3,(4,9)), (3,(6,9)))
(3)subtractByKey(针对的是key-value类型的;减去另外一个相同的key的元组,如果第一个rdd有两个元素相同,注意一起去掉了)
scala> val pairRDD = sc.parallelize[(Int, Int)](Seq((1, 2), (3, 4), (3, 6), (5, 6)), 4)
pairRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> val otherRDD = sc.parallelize(Seq((3, 9), (4, 5)))
otherRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[8] at parallelize at <console>:24
scala> pairRDD.subtractByKey(otherRDD).collect()
res6: Array[(Int, Int)] = Array((1,2), (5,6))
(4)cartesian(otherDataset):对两个RDD中的所有元素进行笛卡尔积操作=========rdd中可以是单元素也可以key-Value;测试过成功
val rdd1 = sc.parallelize(1 to 3)
val rdd2 = sc.parallelize(2 to 5)
val cartesianRDD = rdd1.cartesian(rdd2)
cartesianRDD.foreach(x => println(x + " "))
(1,2)
(1,3)
(1,4)
(1,5)
(2,2)
(2,3)
(2,4)
(2,5)
(3,2)
(3,3)
(3,4)
(3,5)
(5)intersection(otherDataset):返回两个RDD的交集
val rdd1 = sc.parallelize(1 to 3)
val rdd2 = sc.parallelize(3 to 5)
val unionRDD = rdd1.intersection(rdd2)
unionRDD.collect.foreach(x => print(x + " ")) 返回结果:3 4
4. 两个RDD元素操作API(其中RDD可以是单个元素,也可以是多个元组)
(1)union (两个RDD聚集在一起,可以是单个元素,也可以是元组类型;==========!!!!!!特别注意其和++等价)
++++++++++++++++++++++++++++++++++++++++
scala> val a = sc.parallelize(Seq(("hello", 1), ("nihao", 2)),2)
a: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[31] at parallelize at <console>:24
scala> val b = sc.parallelize(Seq(("AAA", 1), ("BBB", 2)),2)
b: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:24
scala> c.collect
res11: Array[(String, Int)] = Array((hello,1), (nihao,2), (AAA,1), (BBB,2))
(2)两个RDD的zip 请注意这两个RDD必须是数量相同的,返回是一个元组集合(第一个RDD元素,第二RDD元素)
不要求两个结合的数据同类,只要数量相同,注意!!!!!!!!!分区也要一样;zip要求其实非常严格!!!!
scala> val b = sc.parallelize(11 to 20, 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at parallelize at <console>:24
scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24
scala> val d = a zip b
d: org.apache.spark.rdd.RDD[(Int, Int)] = ZippedPartitionsRDD2[24] at zip at <console>:28
scala> d.collect
res17: Array[(Int, Int)] = Array((1,11), (2,12), (3,13), (4,14), (5,15), (6,16), (7,17), (8,18), (9,19), (10,20))
scala>
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
scala> val a = sc.parallelize(Seq((1,2),(3,4)))
a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[25] at parallelize at <console>:24
scala> val b = sc.parallelize(Seq((8,9),(10,11)))
b: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[26] at parallelize at <console>:24
scala> val c = a zip b
c: org.apache.spark.rdd.RDD[((Int, Int), (Int, Int))] = ZippedPartitionsRDD2[27] at zip at <console>:28
scala> c.collect
res18: Array[((Int, Int), (Int, Int))] = Array(((1,2),(8,9)), ((3,4),(10,11)))
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
scala> val a = sc.parallelize(Seq("hello","lifengyu"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at parallelize at <console>:24
scala> val b = sc.parallelize(Seq((8,9),(10,11)))
b: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[26] at parallelize at <console>:24
scala> val c = a zip b
c: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ZippedPartitionsRDD2[29] at zip at <console>:28
scala> c.collect
res19: Array[(String, (Int, Int))] = Array((hello,(8,9)), (lifengyu,(10,11)))
(3)subtract 针对单元素的RDD 减去相同元素!!!注意元组也可以;只是如果key相同,value不同,此种元素不会减去
scala> val oneRDD = sc.parallelize[Int](Seq(1, 2, 3), 3)
oneRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24
scala> val otherRDD = sc.parallelize(Seq(3, 4), 2)
otherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24
scala> val subtractRDD = oneRDD.subtract(otherRDD)
subtractRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[24] at subtract at <console>:28
scala> subtractRDD.collect
res16: Array[Int] = Array(1, 2)
++++++++++++++++++++++++++++++++++++++
scala> val a = sc.parallelize(Seq((1,2),(3,4)))
a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[25] at parallelize at <console>:24
scala> val b = sc.parallelize(Seq((3,4), (5,6)))
b: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[26] at parallelize at <console>:24
scala> val c = a.subtract(b)
c: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[30] at subtract at <console>:28
scala> c.collect
res17: Array[(Int, Int)] = Array((1,2))
(4)zipPartitions 特别针对单个元素的rdd,两个rdd分区数相同,元素个数可以不同;计算过程,先把自己分区相加,然后再累加另外的分区;最后元素的个数和分区数一样;------------zip 是对应元素位置形成元组,zipPartitions是按照partitions进行操作
scala> val oneRDD = sc.parallelize[Int](Seq(1, 2, 3), 3)
oneRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:24
scala> val otherRDD = sc.parallelize(Seq(3, 4, 5,6,7,8,9), 3)
otherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at parallelize at <console>:24
scala> val zipPartitionRDD = oneRDD.zipPartitions(otherRDD)((iterator1, iterator2) => Iterator(iterator1.sum + iterator2.sum))
zipPartitionRDD: org.apache.spark.rdd.RDD[Int] = ZippedPartitionsRDD2[46] at zipPartitions at <console>:28
scala> zipPartitionRDD.collect
res30: Array[Int] = Array(8, 13, 27)
5. 形成rdd
val hdfsFileRDD = sc.textFile("hdfs://master:9999/users/hadoop-twq/word.txt")
hdfsFileRDD.count()
val listRDD = sc.parallelize[Int](Seq(1, 2, 3, 3, 4), 2)
listRDD.collect()
listRDD.glom().collect()
val rangeRDD = sc.range(0, 10, 2, 4)
rangeRDD.collect()
val makeRDD = sc.makeRDD(Seq(1, 2, 3, 3))
makeRDD.collect()