RDD介绍

RDD是Spark中的核心概念,代表了不可变、可分区的分布式数据集,具备弹性、容错性。RDD操作包括转换和行动,转换如map、filter,行动如reduce、count。SparkContext是程序入口,用于连接集群。创建RDD可通过本地文件或外部存储系统如HDFS。在使用RDD时需注意数据分布和分区策略,以及转换操作的延迟执行特性。

RDD
让开发者大大降低开发分布式应用程序的门槛,提高执行效率。

RDD源码:https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/rdd

RDD:弹性的分布式数据集,代表了不可变的,可分区的元素,这些元素能被并行操作。
弹性:指的是spark在分布式计算的时候可以容错
分布式:数据可能跨节点存储在不同的节点之上,计算也可以在不同节点上执行

一个RDD由多个partition构成
RDDA: (1,2,3,4,5,6,7,8,9) 对每个元素 +1 ,rdd会对每个partition操作,同时进行

   hadoop001:  Partition1:(1,2,3)  +1

   hadoop002:  Partition2:(4,5,6) +1

   hadoop003: Partition3:  (7,8,9) +1
  1. RDD是抽象类:RDD必然是由子类实现的,使用时直接使用子类即可
  2. 序列化:序列化性能的好坏直接导致框架性能的优劣
  3. Logging:spark1.6可以直接用,2.0不可以用。自己写一个或者拷原来的来用。
  4. T: 泛型,rdd支持各种数据类型

RDD的三大必有特性

  1. 分区多
  2. 计算是作用在每一个分区上的
  3. 每个RDD之间有依赖关系,第一个RDD加载出来的,其余相互依赖

在spark中,计算时,有多少partition就对应有多少个task来执行

SparkContext
http://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#initializing-spark

spark程序的第一步 SparkConf --> SparkContext

sparkcontext告诉spark如何连接一个集群,运行在哪个模式上面 local standalone yarn mesos

创建SparkConf对象(键值对的方式),支持链式set。 SparkConf包含一些应用程序的信息(Application name, master, memory)
在集群上运行时,不要硬编码一个master。要通过spark-submit的方式提交master

 object SparkContextApp {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName("sparkcontextApp11").setMaster("local[2]")
    val sc =new SparkContext(sparkconf)

   // ToDo... 业务逻辑代码

    sc.stop()
  }
}

spark-shell的使用
spark-shell是交互式命令行,直接输入命令就能得到结果,不需要借助idea了;一般用于测试,开发还是要IDEA+MAVEN

  1. 借助于–help
  2. 重要参数
    –master 不建议硬编码指定master
    –name 指定application的名称
    –jars 以逗号分割,传入多个本地jar包
    –conf 指定配置参数
    –queue 队列
    –num-executor 执行端的个数
spark-shell --master local[2]
spark-shell --master local[4] --jars code.jar
spark-shell --master local[4] --packages "org.example:example:0.1"

RDD的创建方式

  1. 把一个集合转换为RDD 测试的时候用
  2. 使用外部存储系统的数据集(HDFS,HBase…) 生产的时候用

方式1:

scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)


//数组转换为RDD
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> distData.collect()
res0: Array[Int] = Array(1, 2, 3, 4, 5)                                         

//数组里面的数两两相加
scala> distData.reduce((a, b) => a + b)
res1: Int = 15

// 一个task对应于一个partition,下面的命令会有5个task
scala> val distData = sc.parallelize(data,5)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:26

方法2:

外部数据集的方式创建RDD, 支持Hadoop,Hbase, S3 … 支持 文本格式、sequenceFile

textFile:
Read a text file from HDFS, a local file system (available on all nodes), or any
Hadoop-supported file system URI, and return it as an RDD of Strings.

Standalone模式下:
在standalone模式下少用本地的文件模式

inputfile 必须要确保所有节点上都有

//本地
scala> val distFile = sc.textFile("file:///home/hadoop/source/a.dat")
distFile: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/source/a.dat MapPartitionsRDD[3] at textFile at <console>:24

scala> distFile.collect
res0: Array[String] = Array(hello	eurecom	hello, eurecom	hello	yuan)

// hdfs
scala> val distFile = sc.textFile("hdfs://hadoop001:9000/data")
distFile: org.apache.spark.rdd.RDD[String] = hdfs://hadoop001:9000/data MapPartitionsRDD[1] at textFile at <console>:24

scala> distFile.collect
res0: Array[String] = Array(hello       eurecom hello, eurecom  hello   yuan) 

注意点:

  1. 使用本地文件时,文件必须在各个工作节点上都能以相同的路径访问,所有节点上都必须要有这个文件,文件路径还得一样
  2. 指定hdfs时,可以指定到文件,也可以指定到文件夹,也可以用通配符匹配多个文件
  3. textFile的第二个参数可以控制partition的数量

RDD操作
1) 转换操作,RDD是不可变的,必须从RDDA转换为RDDB;
所有的转换操作都是懒(lazy)操作,不遇到action,转换操作就不会被执行

  • map 对RDD中的每个元素进行操作,返回一个新的RDD
scala> var a = sc.parallelize(1 to 9)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> a.map(_*2)
res0: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:27

scala> a.map(_*2).collect()
res1: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18)

scala> var a = sc.parallelize(List("dog","tiger","cat"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> val b =a.map(x=>(x,1))
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:26

scala> b.collect()
res2: Array[(String, Int)] = Array((dog,1), (tiger,1), (cat,1))
  • filter 对元素进行过滤
scala> a.filter(_%2 == 0).collect()
res3: Array[Int] = Array(2, 4, 6, 8, 10)

scala> a.filter(_<4).collect()
res4: Array[Int] = Array(1, 2, 3)

scala> var a = sc.parallelize(1 to 6)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24

scala> val mapRDD = a.map(_*2)
mapRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at map at <console>:26

scala> mapRDD.collect()
res5: Array[Int] = Array(2, 4, 6, 8, 10, 12)

scala> mapRDD.filter(_>5).collect
res7: Array[Int] = Array(6, 8, 10, 12)
  • flatMap和Map的区别
    flatMap是压平后操作的,词频统计里面就用flatMap
scala> var a = sc.parallelize(1 to 9)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> var nums = a.map(x=>(x*x))
nums: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26

scala> nums.collect
res0: Array[Int] = Array(1, 4, 9, 16, 25, 36, 49, 64, 81)

//把每个x都拓展成1到x
scala> nums.flatMap(x => 1 to x).collect
res1: Array[Int] = Array(1, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 1, 2, 3, 4, 5, ...
  • mapValues 只对value进行操作,key不动
scala> val a = sc.parallelize(List("dog","tiger","cat"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[14] at parallelize at <console>:24

scala> val b = a.map(x=>(x,x.length))
b: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at <console>:26

scala> b.collect()
res9: Array[(String, Int)] = Array((dog,3), (tiger,5), (cat,3))

scala> b.mapValues("x"+_+"x").collect()
res10: Array[(String, String)] = Array((dog,x3x), (tiger,x5x), (cat,x3x))
  • count 返回数据集里元素的个数
  • sum 求和操作
scala> var a = sc.parallelize(1 to 100)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:24

scala> a.sum()
res11: Double = 5050.0

scala> a.reduce(_+_)
res12: Int = 5050
  • first 返回数据集第一个元素 和take(1)类似
scala> val a = sc.parallelize(List("dog","tiger","cat"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[19] at parallelize at <console>:24

scala> a.first()
res13: String = dog

scala> a.take(1)
res14: Array[String] = Array(dog)
  • top 排序
scala> sc.parallelize(Array(6,9,4,7,5,8)).top(2)
res15: Array[Int] = Array(9, 8)

//隐式转换后从小到大排序
scala> implicit val myorder = implicitly[Ordering[Int]].reverse
myorder: scala.math.Ordering[Int] = scala.math.Ordering$$anon$4@44473eed

scala> sc.parallelize(Array(6,9,4,7,5,8)).top(2)
res16: Array[Int] = Array(4, 5)
  • subtract
    不是相减,是去重复元素
scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> a.collect
res3: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> b.collect
res4: Array[Int] = Array(2, 3)

scala> a.subtract(b).collect
res5: Array[Int] = Array(4, 1, 5)
  • intersection
    返回交叉的部分
scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> a.collect
res3: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> b.collect
res4: Array[Int] = Array(2, 3)

scala> a.intersection(b).collect
res6: Array[Int] = Array(2, 3)
  • cartesian
    返回笛卡尔积
scala> val a = sc.parallelize(1 to 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> a.collect
res3: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val b = sc.parallelize(2 to 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> b.collect
res4: Array[Int] = Array(2, 3)

scala> a.cartesian(b).collect
res7: Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))

2)actions, which return a value to the driver program after running a computation on the dataset. 对结果进行处理
reduce 聚合所有的结果

Join在Spark CORE中的使用

scala> val a=sc.parallelize(Array(("A","a1"),("B","b1"),("C","c1"),("D","d1"),("D","d2")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> val b=sc.parallelize(Array(("A","a2"),("B","b2"),("C","c1"),("D","d2"),("D","d3"),("E","E1")))
b: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[7] at parallelize at <console>:24

//join,只返回左右都匹配上的
scala> a.join(b).collect
res10: Array[(String, (String, String))] = Array((B,(b1,b2)), (D,(d1,d2)), (D,(d1,d3)), (D,(d2,d2)), (D,(d2,d3)), (A,(a1,a2)), (C,(c1,c1)))

//leftOuterJoin , rightOuterJoin,fullOuterJoin跟sql原理是一样的
scala> a.leftOuterJoin(b).collect
res2: Array[(String, (String, Option[String]))] = Array((B,(b1,Some(b2))), (D,(d1,Some(d2))), (D,(d1,Some(d3))), (D,(d2,Some(d2))), (D,(d2,Some(d3))), (A,(a1,Some(a2))), (C,(c1,Some(c1))))

词频统计实战

scala> val log = sc.textFile("/home/hadoop/source/a.dat")
log: org.apache.spark.rdd.RDD[String] = /home/hadoop/source/a.dat MapPartitionsRDD[1] at textFile at <console>:24

scala> val splits = log.flatMap(x => x.split("\t"))
splits: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26

scala> val wordone  = splits.map(x=>(x,1)) 
wordone: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:28

scala> wordone.collect
res0: Array[(String, Int)] = Array((hello,1), (eurecom,1), (hello,1), (eurecom,1), (hello,1), (yuan,1))

scala> val res1 = wordone.reduceByKey(_+_) 
res1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:30

scala> res1.collect
res2: Array[(String, Int)] = Array((hello,3), (eurecom,2), (yuan,1))

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值