map()
将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素。
输入分区与输出分区一对一,即:有多少个输入分区,就有多少个输出分区
scala> val data = sc.textFile("/data/spark_rdd.txt") data: org.apache.spark.rdd.RDD[String] = /data/spark_rdd.txt MapPartitionsRDD[1] at textFile at <console>:24 scala> val map_rdd = data.map(line => line.split("\\s+")) map_rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26 scala> map_rdd.collect res0: Array[Array[String]] = Array(Array(insert, overwrite, table), Array(dataset, intersect, tochar), Array(linux, alter))
flatMap()
第一步和map一样,最后将所有的输出分区合并成一个
scala> val flatMap_rdd = data.flatMap(line => line.split("\\s+")) flatMap_rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:26 scala> flatMap_rdd.collect res1: Array[String] = Array(insert, overwrite, table, dataset, intersect, tochar, linux, alter)
注
:flatMap会将String字符串扁平化成字符数组scala> data.map(_.toUpperCase).collect res3: Array[String] = Array("INSERT OVERWRITE TABLE ", DATASET INTERSECT TOCHAR, LINUX ALTER) scala> data.flatMap(_.toUpperCase).collect res4: Array[Char] = Array(I, N, S, E, R, T, , O, V, E, R, W, R, I, T, E, , T, A, B, L, E, , D, A, T, A, S, E, T, , I, N, T, E, R, S, E, C, T, , T, O, C, H, A, R, L, I, N, U, X, , A, L, T, E, R)
distinct
对RDD里面的元素进行去重操作
scala> val distinct_rdd = data.flatMap(_.toUpperCase).distinct distinct_rdd: org.apache.spark.rdd.RDD[Char] = MapPartitionsRDD[13] at distinct at <console>:26 scala> distinct_rdd.collect res6: Array[Char] = Array(T, L, R, B, O, A, I, , S, H, C, E, , U, V, X, N, W, D)
filter()
将
filter
里面的函数作用于RDD里面的每个元素且函数返回为true的RDD元素作为输出scala> val data_1 = sc.parallelize(Array(1, 2, 3, 4, 23, 5, 123, 98)) data_1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24 scala> val filter_rdd = data_1.filter(_ < 10) filter_rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[15] at filter at <console>:26 scala> filter_rdd.collect res7: Array[Int] = Array(1, 2, 3, 4, 5)