Spark RDD编程
文章目录
一、RDD操作
1.创建操作
①从文件创建
文件的一行对应RDD的一个元素
a.从本地文件创建
//格式:sc.textFile("file://本地文件绝对路径")
val rdd = sc.textFile("file:///home/centos7/infos.txt")
b.从HDFS文件夹创建
//格式一:sc.textFile("hdfs://HDFS文件绝对路径")
val rdd = sc.textFile("hdfs:///user/centos7/infos.txt")
//格式二:sc.textFile("HDFS文件绝对路径")
val rdd = sc.textFile("/user/centos7/infos.txt")
//格式三:sc.textFile("相对路径"),相当于在相对路径请默认加了“/user/账号/”
val rdd = sc.textFile("infos.txt")
②从并行集合创建
//sc.parallelize(arr),arr必须时集合或者数组/序列
val rdd2 = sc.parallelize(arr)
2.转换操作/转换算子/Transformation
①map(func):
将每个元素传递到函数func中,并将结果返回为一个新的数据集(新RDD的元素个数等于原本RDD的元素个数)
②flatMap(func):
将每个元素传递到函数func中,并将结果 “拍扁” 返回为一个新的数据集(新RDD的元素个数与原本RDD的元素个数无必然联系)
//创建一个数组
scala> val arr = Array("zhangsan lisi wangwu","zhaoliu"))
arr: Array[String] = Array(zhangsan lisi wangwu, zhaoliu)
//将数组转化为RDD
scala> val rdd4 = sc.parallelize(arr)
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at <console>:26
//查看RDD的全部内容
scala> rdd4.collect
res19: Array[String] = Array(zhangsan lisi wangwu, zhaoliu)
//map算子
scala> rdd4.map(_.split(" ")).collect
res20: Array[Array[String]] = Array(Array(zhangsan, lisi, wangwu), Array(zhaoliu))
//flatMap算子
scala> rdd4.flatMap(_.split(" ")).collect
res21: Array[String] = Array(zhangsan, lisi, wangwu, zhaoliu)
③filter(func):
func返回值必须是布尔类型,将每个元素传递到函数func中,并且将满足func的RDD返回
//创建RDD
scala> val rdd2 = sc.parallelize(Array(1, 2, 3, 4, 5, 6))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24
//查看RDD的全部内容
scala> rdd2.collect
res30: Array[Int] = Array(1, 2, 3, 4, 5, 6)
//过滤:将满足条件_%2==0的元素保存下来
scala> rdd2.filter(_%2==0).collect
res31: Array[Int] = Array(2, 4, 6)
scala> rdd2.filter(_%2!=0).collect
res32: Array[Int] = Array(1, 3, 5)
scala> rdd2.filter(_<=3).collect
res33: Array[Int] = Array(1, 2, 3)
④groupByKey():
将相同key的value放在一起,必须要应用在键值对RDD上
//创建RDD
scala> val rdd5 = sc.parallelize(List("hadoop","spark","spark"))
rdd5: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[35] at parallelize at <console>:24
//查看RDD的全部内容
scala> rdd5.collect
res35: Array[String] = Array(hadoop, spark, spark)
//将RDD的每一个元素变为键值对(元组),//这一类元素是键值对的RDD,称其为键值对RDD(Pair RDD)
scala> rdd5.map((_,1)).collect
res36: Array[(String, Int)] = Array((hadoop,1), (spark,1), (spark,1))
//执行groupByKey算子,将相同key的value放在一起
scala> rdd5.map((_,1)).groupByKey().collect
res41: Array[(String, Iterable[Int])] = Array((spark,CompactBuffer(1, 1)), (hadoop,CompactBuffer(1)))
⑤reduceByKey(func):
将相同key的value调用func,func必须有两个参数
//创建RDD
scala> val rdd5 = sc.parallelize(List("hadoop","spark","spark"))
rdd5: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[35] at parallelize at <console>:24
//查看RDD的全部内容
scala> rdd5.collect
res35: Array[String] = Array(hadoop, spark, spark)
//将RDD的每一个元素变为键值对(元组),//这一类元素是键值对的RDD,称其为键值对RDD(Pair RDD)
scala> rdd5.map((_,1)).collect
res36: Array[(String, Int)] = Array((hadoop,1), (spark,1), (spark,1))
//执行reduceByKey(_+_)算子,将相同key的value相加
//_+_ ============> (x,y)=>x+y
scala> rdd5.map((_,1)).reduceByKey(_+_).collect
res48: Array[(String, Int)] =<