- 3.1、案例开发&&上传jar包到服务器&&测试数据准备
- 3.2、结果输出到控制台&&HDFS目录
- 3.3、处理多个输入文件&&输入文件规则匹配&&带排序的词频统计
- 3.4、spark-shell中进行测试
一、RDD常用算子再次实验
1、新建一个数据集:
scala> val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
2、数据集中的每一个元素自己乘自己返回一个值:
scala> val b = a.map(x =>x*x)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25
scala> b.collect
res0: Array[Int] = Array(1, 4, 9, 16, 25, 36, 49, 64, 81)
3、解析:1返回1,2返回1、2,3返回1、2、3,4返回1、2、3、4
scala> a.flatMap(x=>1 to x).collect
res1: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Spark Core读取SequenceFile文件(生产上很常见)
历史原因:Hive中有些表是采用SequenceFile的格式来存储的,现在想用Spark来作为分布式计算框架;肯定就需要Spark core来读取SequenceFile文件。
二、JOIN在Spark Core中的使用
1、新建a、b两个rdd
scala> val a = sc.parallelize(Array(("A","a1"),("C","c1"),("D","d1"),("F","f1")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[5] at parallelize at <console>:24
scala> val b = sc.parallelize(Array(("A","a2"),("C","c2"),("C","c3"),("E","e1")))
b: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[6] at parallelize at <console>:24
2、去测试a join b的结果,返回类型:Array[(String,(String,String))]
//相当于inner join,只返回左右都匹配上的
scala> a.join(b).collect
res5: Array[(String, (String, String))] = Array((A,(a1,a2)), (C,(c1,c2)), (C,(c1,c3)))
3、leftOuterJoin
//看返回的数据结构,以a表为主表,去b表匹配,返回左表的所有
scala> a.leftOuterJoin(b).collect
res6: Array[(String, (String, Option[String]))] = Array((F,(f1,None)), (D,(d1,None)), (A,(a1,Some(a2))), (C,(c1,Some(c2))), (C,(c1,Some(c3))))
4、rightOuterJoin 返回右表的所有
scala> a.rightOuterJoin(b).collect
res7: Array[(String, (Option[String], String))] = Array((A,(Some(a1),a2)), (C,(Some(c1),c2)), (C,(Some(c1),c3)), (E,(None,e1)))
为什么右外连接只有4条记录,因为对应的b表只有4条记录
5、全连接,返回两张表中相同的数据,查看数据结构:
Array[(String,(Option[String]),Option[String])]
scala> a.fullOuterJoin(b).collect
res8: Array[(String, (Option[String], Option[String]))] = Array((F,(Some(f1),None)), (D,(Some(d1),None)), (A,(Some(a1),Some(a2))), (C,(Some(c1),Some(c2))), (C,(Some(c1),Some(c3))), (E,(None,Some(e1))))
- 一定要会看数据结构,数据结构对于下一步的操作而言是至关重要的。
2.1、使用Spark-Core进行词频统计分析
1、读取出文件:
scala> val log=sc.textFile("file:///home/hadoop/data/ruozeinput.txt")
log: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/ruozeinput.txt MapPartitionsRDD[1] at textFile at <console>:24
2、使用map算子将数据集中的元素分割:
scala> log.map(x =>x.split("\t")).collect
res0: Array[Array[String]] = Array(Array(hello, hello, hello), Array(world, world), Array(john))
3、注意此处使用的是flatMap,它是将数据集中的每一个元素都拿出来进行压平处理:
scala> val splits=log.flatMap(x =>x.split("\t"))
splits: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:25
4、flatMap使用如下:
scala> splits.collect
res1: Array[String] = Array(hello, hello, hello, world, world, john)
5、进而进行求和:
scala> splits.map(x =>(x,1)).reduceByKey(_+_).collect
res2: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))
注意: reduceByKey操作存在shuffle,把相同的key分发到同一个reduce中,把key相加。
按照每个单词出现的次数做降序排列、升序排列:
2.2、RDD中subtract && intersection && cartesian使用详解
1、subtract(减去、扣掉)
- 直接在RDD.scala中,ctrl+o键,显示出所有的方法,搜索subtract:
/**
* Return an RDD with the elements from `this` that are not in `other`.
*
* Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
* RDD will be <= us.
*/
def subtract(other: RDD[T]): RDD[T] = withScope {
subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
}
//在RDD中,两个DF做减法是非常常见的:
val a=sc.parallelize(1 to 5)
val b=sc.parallelize(2 to 4)
a.subtract(b).collect
输出:Array[Int]=Array(4,1,5)
2、intersection(交集)
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* @note This method performs a shuffle internally.
*/
def intersection(other: RDD[T]): RDD[T] = withScope {
this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
.filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
.keys
}
scala> a.intersection(b).collect
res4: Array[Int] = Array(2, 3)
//交集
3、笛卡尔积:
/**
* Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
* elements (a, b) where a is in `this` and b is in `other`.
*/
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
new CartesianRDD(sc, this, other)
}
scala> a.cartesian(b).collect
res5: Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))
此处的spark-shell仅仅适用于测试,开发环境采用IDEA+Maven+Scala
- Spark Application编程时的整个执行流程:
三、IDEA整合Maven搭建Spark应用程序
1、简述idea新建项目的步骤:
点击file --> new --> project,点击create from archetype;选择org.scala-tools.archetypes:scala-archetype-simple;
2、单机下一步GroupId:com.ruozedata.bigdata、ArtifactId: g6-spark;
3、选定maven_home目录、使用maven_home下的settings.xml、再选择local_repository
两个地方查看引入的包:
1、External Libraries
2、View --> Tool Windows --> Maven Projects,点击项目名称,再点击Dependencies
在pom.xml中需要添加如下信息:
在pom.xml文件中需要添加如下:
1、除自带的repository以外还需要添加一个,它默认的repository是访问不到的。
<repository>
<id>cloudera</id>
<name>cloudera</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
2、添加dependency依赖文件
<!--添加spark-core依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!--添加Hadoop.version的依赖-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
注意的是: 在properties文件下做出如下修改,便于后期对版本进行维护。
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.4.0</spark.version>
<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
</properties>
在pom.xml中,右键maven,点击reimport,重跑依赖包;再点击view->Tool Windows->Maven Projects,点开lifecycle,点击clean。再等一会就不会报错了。
此处所需要的注意点:
1)、更换scala版本
2)、添加Spark-Core依赖
3)、添加hadoop-client依赖
4)、添加cdh的仓库
3.1、案例开发&&上传jar包到服务器&&测试数据准备
1、快速的在IDEA下写一个WordCount程序,名字叫WordCountApp:
package com.ruozedata.bigdata.SparkCore01
import org.apache.spark.{SparkConf, SparkContext}
object WordCountApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val textFile = sc.textFile(args(0))
val wc = textFile.flatMap(line => line.split("\t")).map((_,1)).reduceByKey(_+_)
wc.collect.foreach(println)
sc.stop()
}
}
-
WordCountApp程序开发完成后,进入Maven Projects ⇒ lifecycle ⇒ package ⇒ Run Maven Build,进行打包,打包完成后下方控制台会有打印信息及包的路径:
[INFO] Building jar: G:\bigdata_workspace\g7-spark\target\g6-spark-1.0.jar
通过RZ命令上传至服务器
[hadoop@hadoop004 lib]$ ll
total 52
-rw-r--r--. 1 hadoop hadoop 49347 Jun 7 2020 g6-spark-1.0.jar
[hadoop@hadoop004 lib]$ pwd
/home/hadoop/lib
小插曲:在生产上如何上传文件至服务器?
生产上一般使用rz上传,ftp工具在生产上是不用的
生产上都是跳板机,一般需要先登录跳板机,再从跳板机登录到服务器;
这中间还有一层堡垒机,ftp服务是绝对连接不上去的,ftp在生产上是绝对连不上去的;
我们生产使用rz肯定是没有问题的,rz不用经过堡垒机
提交Spark程序运行:
详见:
http://spark.apache.org/docs/latest/submitting-applications.html
- 一旦一个应用程序被构建完成,它能够使用bin/spark-submit脚本;这个脚本使用它的classpath和依赖项,它能够支持不同的集群管理和部署模式。
1、通过hdfs进行访问:
[hadoop@hadoop004 bin]$ ./spark-submit --class com.ruozedata.bigdata.SparkCore01 --master local[2] /home/hadoop/lib/g6-spark-1.0.jar hdfs://hadoop004:9000/data/input/ruozeinput.txt
20/06/06 17:17:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.SparkSubmit$$anon$2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
//报错原因:--class后面没有跟上包名+类名;我们定义了一个object WordCountApp方法,鼠标选中WordCountApp,右键,copy reference;
3.2、结果输出到控制台&&HDFS目录
1、输出结果在控制台上:
20/06/06 19:20:25 INFO DAGScheduler: ResultStage 1 (collect at WordCountApp.scala:14) finished in 0.121 s
20/06/06 19:20:25 INFO DAGScheduler: Job 0 finished: collect at WordCountApp.scala:14, took 2.328096 s
(hello,3)
(world,2)
(john,1)
20/06/06 19:20:25 INFO SparkUI: Stopped Spark web UI at http://hadoop004:4041
20/06/06 19:20:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
2、输出结果到HDFS目录上:
1、IDEA中修改一句代码:
// wc.collect().foreach(println) 原先的直接打印在控制台
wc.saveAsTextFile(args(1)) spark-submit的时候指定输出路径
2、spark-submit如下进行提交:
[hadoop@hadoop004 bin]$ ./spark-submit --class com.ruozedata.bigdata.SparkCore01.WordCountApp \
--master local[2] \
/home/hadoop/lib/g6-spark-1.0.jar \
hdfs://hadoop004:9000/data/input/ruozeinput.txt \
hdfs://hadoop004:9000/data/output1
3、为什么partition=2?
[hadoop@hadoop004 lib]$ hdfs dfs -text /data/output1/part-00000
20/06/06 19:36:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(hello,3)
(world,2)
[hadoop@hadoop004 lib]$ hdfs dfs -text /data/output1/part-00001
20/06/06 19:36:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(john,1)
3.3、处理多个输入文件&&输入文件规则匹配&&带排序的词频统计
1、处理多个输入文件:
1、spark-submit进行提交:
[hadoop@hadoop004 bin]$ ./spark-submit --class com.ruozedata.bigdata.SparkCore01.WordCountApp --master local[2] /home/hadoop/lib/g6-spark-1.0.jar hdfs://hadoop004:9000/data/input/ hdfs://hadoop004:9000/data/output2
2、日志信息中有一行,表示的是hdfs上下面有3个文件数:
20/06/06 19:41:07 INFO FileInputFormat: Total input paths to process : 3
3、FileAlreadyExistsException:输出路径已经存在报错
2、支持通配符:
- 举例,在输入路径处:hdfs://hadoop004:9000/data/input/*.txt,意味着的是匹配hdfs路径上尾号是.txt的文件
注意:此处的Total input paths to process : 3,指的是文件数,因为文件较小;如果文件较大比如130M,那么这个参数total input paths processs就是2。
3、带排序的词频统计
- 之前代码的输出结果:
(world,6)
(hello,9)
(john,3)
- 如何按照value的值进行排序输出:
在IDEA中修改如下:
val wc = textFile.flatMap(line =>line.split("\t")).map((_,1)).reduceByKey(_+_)
val sorted = wc.map(x =>(x._2,x._1)).sortByKey(true).map(x =>(x._2,x._1))
sorted.saveAsTextFile(args(1))
每一次提交到服务器上都需要打jar包,上传到服务器,并且使用spark-submit进行提交,所以这么操作有些许繁琐,简单的测试可以直接在spark-shell中进行测试。
3.4、Spark-shell中快速进行测试
1、读取文件:
val textFile = sc.textFile("hdfs://hadoop004:9000/data/input/ruozeinput.txt")
2、flatMap把一个个字符串压平,然后使用map为他们赋值赋上1个1:
scala> val wc=textFile.flatMap(line =>line.split("\t")).map((_,1))
wc: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[37] at map at <console>:25
scala> wc.collect
res14: Array[(String, Int)] = Array((hello,1), (hello,1), (hello,1), (world,1), (world,1), (john,1))
3、使用reduceByKey把他们的词数相加,map颠倒顺序,sortByKey默认按照升序排序
wc.reduceByKey(_+_).map(x =>(x._2,x._1)).sortByKey(true).collect
4、按照降序排列:true --> false
wc.reduceByKey(_+_).map(x =>(x._2,x._1)).sortByKey(false).collect
查看SortByKey在源码的定义:
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
.setKeyOrdering(if (ascending) ordering else ordering.reverse)
}
sortByKey中默认有两个参数,如果没有参数,括号可以省略;如果声明的是默认参数,括号不能省略