Spark基础篇(五) - RDD的高级算子&&IDEA搭建、开发Spark应用程序

本文链接：https://blog.youkuaiyun.com/SparkOnYarn/article/details/106604525

本文介绍了Spark RDD的高级操作，包括SequenceFile读取、JOIN使用、subtract、intersection和cartesian算子。同时，详细阐述了如何在IDEA中整合Maven搭建Spark应用程序，从创建项目、配置依赖到打包上传jar包，以及测试数据准备和程序运行流程。文章还涉及结果输出到控制台和HDFS，以及Spark-shell中的快速测试。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、RDD常用算子再次实验

二、JOIN在Spark Core中的使用

2.1、使用Spark-Core进行词频统计分析
2.2、RDD中subtract && intersection && cartesian使用详解

三、IDEA整合Maven搭建Spark应用程序

3.1、案例开发&&上传jar包到服务器&&测试数据准备
3.2、结果输出到控制台&&HDFS目录
3.3、处理多个输入文件&&输入文件规则匹配&&带排序的词频统计
3.4、spark-shell中进行测试

一、RDD常用算子再次实验

1、新建一个数据集：
scala> val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

2、数据集中的每一个元素自己乘自己返回一个值：
scala> val b = a.map(x =>x*x)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25

scala> b.collect
res0: Array[Int] = Array(1, 4, 9, 16, 25, 36, 49, 64, 81)

3、解析：1返回1，2返回1、2，3返回1、2、3，4返回1、2、3、4
scala> a.flatMap(x=>1 to x).collect
res1: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9)

Spark Core读取SequenceFile文件（生产上很常见）

历史原因：Hive中有些表是采用SequenceFile的格式来存储的，现在想用Spark来作为分布式计算框架；肯定就需要Spark core来读取SequenceFile文件。

二、JOIN在Spark Core中的使用

1、新建a、b两个rdd
scala> val a = sc.parallelize(Array(("A","a1"),("C","c1"),("D","d1"),("F","f1")))
a: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> val b = sc.parallelize(Array(("A","a2"),("C","c2"),("C","c3"),("E","e1")))
b: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[6] at parallelize at <console>:24

2、去测试a join b的结果，返回类型：Array[(String,(String,String))]
//相当于inner join，只返回左右都匹配上的
scala> a.join(b).collect
res5: Array[(String, (String, String))] = Array((A,(a1,a2)), (C,(c1,c2)), (C,(c1,c3)))

3、leftOuterJoin
//看返回的数据结构，以a表为主表，去b表匹配，返回左表的所有
scala> a.leftOuterJoin(b).collect
res6: Array[(String, (String, Option[String]))] = Array((F,(f1,None)), (D,(d1,None)), (A,(a1,Some(a2))), (C,(c1,Some(c2))), (C,(c1,Some(c3))))

4、rightOuterJoin		返回右表的所有
scala> a.rightOuterJoin(b).collect
res7: Array[(String, (Option[String], String))] = Array((A,(Some(a1),a2)), (C,(Some(c1),c2)), (C,(Some(c1),c3)), (E,(None,e1)))
为什么右外连接只有4条记录，因为对应的b表只有4条记录

5、全连接，返回两张表中相同的数据，查看数据结构：
Array[(String,(Option[String]),Option[String])]
scala> a.fullOuterJoin(b).collect
res8: Array[(String, (Option[String], Option[String]))] = Array((F,(Some(f1),None)), (D,(Some(d1),None)), (A,(Some(a1),Some(a2))), (C,(Some(c1),Some(c2))), (C,(Some(c1),Some(c3))), (E,(None,Some(e1))))

一定要会看数据结构，数据结构对于下一步的操作而言是至关重要的。

2.1、使用Spark-Core进行词频统计分析

1、读取出文件：
scala> val log=sc.textFile("file:///home/hadoop/data/ruozeinput.txt")
log: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/data/ruozeinput.txt MapPartitionsRDD[1] at textFile at <console>:24

2、使用map算子将数据集中的元素分割：
scala> log.map(x =>x.split("\t")).collect
res0: Array[Array[String]] = Array(Array(hello, hello, hello), Array(world, world), Array(john))

3、注意此处使用的是flatMap，它是将数据集中的每一个元素都拿出来进行压平处理：
scala> val splits=log.flatMap(x =>x.split("\t"))
splits: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:25

4、flatMap使用如下：
scala> splits.collect
res1: Array[String] = Array(hello, hello, hello, world, world, john)

5、进而进行求和：
scala> splits.map(x =>(x,1)).reduceByKey(_+_).collect
res2: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))

注意： reduceByKey操作存在shuffle，把相同的key分发到同一个reduce中，把key相加。

按照每个单词出现的次数做降序排列、升序排列：

2.2、RDD中subtract && intersection && cartesian使用详解

1、subtract（减去、扣掉）

直接在RDD.scala中，ctrl+o键，显示出所有的方法，搜索subtract：

  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be &lt;= us.
   */
  def subtract(other: RDD[T]): RDD[T] = withScope {
    subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
  }

//在RDD中，两个DF做减法是非常常见的：
val a=sc.parallelize(1 to 5)
val b=sc.parallelize(2 to 4)
a.subtract(b).collect

输出：Array[Int]=Array(4,1,5)

2、intersection（交集）

/**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }


scala> a.intersection(b).collect
res4: Array[Int] = Array(2, 3)
//交集

3、笛卡尔积：

/**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }

scala> a.cartesian(b).collect
res5: Array[(Int, Int)] = Array((1,2), (2,2), (1,3), (2,3), (3,2), (4,2), (5,2), (3,3), (4,3), (5,3))

此处的spark-shell仅仅适用于测试，开发环境采用IDEA+Maven+Scala

Spark Application编程时的整个执行流程：

三、IDEA整合Maven搭建Spark应用程序

1、简述idea新建项目的步骤：
点击file --> new --> project，点击create from archetype；选择org.scala-tools.archetypes:scala-archetype-simple；

2、单机下一步GroupId:com.ruozedata.bigdata、ArtifactId: g6-spark；

3、选定maven_home目录、使用maven_home下的settings.xml、再选择local_repository

两个地方查看引入的包：
1、External Libraries
2、View --> Tool Windows --> Maven Projects，点击项目名称，再点击Dependencies

在pom.xml中需要添加如下信息：

在pom.xml文件中需要添加如下：
1、除自带的repository以外还需要添加一个，它默认的repository是访问不到的。
<repository>
      <id>cloudera</id>
      <name>cloudera</name>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
    2、添加dependency依赖文件
    <!--添加spark-core依赖-->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

    <!--添加Hadoop.version的依赖-->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>
    
    注意的是： 在properties文件下做出如下修改，便于后期对版本进行维护。
    <properties>
    <scala.version>2.11.8</scala.version>
    <spark.version>2.4.0</spark.version>
    <hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
  </properties>

在pom.xml中，右键maven，点击reimport，重跑依赖包；再点击view->Tool Windows->Maven Projects,点开lifecycle，点击clean。再等一会就不会报错了。
在这里插入图片描述
此处所需要的注意点：
1）、更换scala版本
2）、添加Spark-Core依赖
3）、添加hadoop-client依赖
4）、添加cdh的仓库

3.1、案例开发&&上传jar包到服务器&&测试数据准备

1、快速的在IDEA下写一个WordCount程序，名字叫WordCountApp：

package com.ruozedata.bigdata.SparkCore01

import org.apache.spark.{SparkConf, SparkContext}

object WordCountApp {
    def main(args: Array[String]): Unit = {
      val sparkConf = new SparkConf()
      val sc = new SparkContext(sparkConf)

      val textFile = sc.textFile(args(0))

      val wc = textFile.flatMap(line => line.split("\t")).map((_,1)).reduceByKey(_+_)

      wc.collect.foreach(println)

      sc.stop()
  }

}

WordCountApp程序开发完成后，进入Maven Projects ⇒ lifecycle ⇒ package ⇒ Run Maven Build，进行打包，打包完成后下方控制台会有打印信息及包的路径：
```
  [INFO] Building jar: G:\bigdata_workspace\g7-spark\target\g6-spark-1.0.jar
```

通过RZ命令上传至服务器

[hadoop@hadoop004 lib]$ ll
total 52
-rw-r--r--. 1 hadoop hadoop 49347 Jun  7  2020 g6-spark-1.0.jar
[hadoop@hadoop004 lib]$ pwd
/home/hadoop/lib

小插曲：在生产上如何上传文件至服务器？

生产上一般使用rz上传，ftp工具在生产上是不用的

生产上都是跳板机，一般需要先登录跳板机，再从跳板机登录到服务器；
这中间还有一层堡垒机，ftp服务是绝对连接不上去的，ftp在生产上是绝对连不上去的；
我们生产使用rz肯定是没有问题的，rz不用经过堡垒机

提交Spark程序运行：

详见：
http://spark.apache.org/docs/latest/submitting-applications.html

一旦一个应用程序被构建完成，它能够使用bin/spark-submit脚本；这个脚本使用它的classpath和依赖项，它能够支持不同的集群管理和部署模式。

1、通过hdfs进行访问：
[hadoop@hadoop004 bin]$ ./spark-submit --class com.ruozedata.bigdata.SparkCore01 --master local[2] /home/hadoop/lib/g6-spark-1.0.jar hdfs://hadoop004:9000/data/input/ruozeinput.txt
20/06/06 17:17:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.SparkSubmit$$anon$2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

//报错原因：--class后面没有跟上包名+类名；我们定义了一个object WordCountApp方法，鼠标选中WordCountApp，右键，copy reference；

3.2、结果输出到控制台&&HDFS目录

1、输出结果在控制台上：

20/06/06 19:20:25 INFO DAGScheduler: ResultStage 1 (collect at WordCountApp.scala:14) finished in 0.121 s
20/06/06 19:20:25 INFO DAGScheduler: Job 0 finished: collect at WordCountApp.scala:14, took 2.328096 s
(hello,3)
(world,2)
(john,1)
20/06/06 19:20:25 INFO SparkUI: Stopped Spark web UI at http://hadoop004:4041
20/06/06 19:20:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

2、输出结果到HDFS目录上：

1、IDEA中修改一句代码：
//    wc.collect().foreach(println)			原先的直接打印在控制台
      wc.saveAsTextFile(args(1))			spark-submit的时候指定输出路径

2、spark-submit如下进行提交：
[hadoop@hadoop004 bin]$ ./spark-submit --class com.ruozedata.bigdata.SparkCore01.WordCountApp \
--master local[2] \
/home/hadoop/lib/g6-spark-1.0.jar \
hdfs://hadoop004:9000/data/input/ruozeinput.txt \
hdfs://hadoop004:9000/data/output1

3、为什么partition=2？
[hadoop@hadoop004 lib]$ hdfs dfs -text /data/output1/part-00000
20/06/06 19:36:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(hello,3)
(world,2)
[hadoop@hadoop004 lib]$ hdfs dfs -text /data/output1/part-00001
20/06/06 19:36:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(john,1)

在这里插入图片描述

3.3、处理多个输入文件&&输入文件规则匹配&&带排序的词频统计

1、处理多个输入文件：

1、spark-submit进行提交：
[hadoop@hadoop004 bin]$ ./spark-submit --class com.ruozedata.bigdata.SparkCore01.WordCountApp --master local[2] /home/hadoop/lib/g6-spark-1.0.jar hdfs://hadoop004:9000/data/input/ hdfs://hadoop004:9000/data/output2


2、日志信息中有一行，表示的是hdfs上下面有3个文件数：
20/06/06 19:41:07 INFO FileInputFormat: Total input paths to process : 3

3、FileAlreadyExistsException：输出路径已经存在报错

2、支持通配符：

举例，在输入路径处：hdfs://hadoop004:9000/data/input/*.txt，意味着的是匹配hdfs路径上尾号是.txt的文件

注意：此处的Total input paths to process : 3，指的是文件数，因为文件较小；如果文件较大比如130M，那么这个参数total input paths processs就是2。

3、带排序的词频统计

之前代码的输出结果：

(world,6)
(hello,9)
(john,3)

如何按照value的值进行排序输出：

在IDEA中修改如下：
val wc = textFile.flatMap(line =>line.split("\t")).map((_,1)).reduceByKey(_+_)
val sorted = wc.map(x =>(x._2,x._1)).sortByKey(true).map(x =>(x._2,x._1))
sorted.saveAsTextFile(args(1))

每一次提交到服务器上都需要打jar包，上传到服务器，并且使用spark-submit进行提交，所以这么操作有些许繁琐，简单的测试可以直接在spark-shell中进行测试。

3.4、Spark-shell中快速进行测试

1、读取文件：
val textFile = sc.textFile("hdfs://hadoop004:9000/data/input/ruozeinput.txt")

2、flatMap把一个个字符串压平，然后使用map为他们赋值赋上1个1：
scala> val wc=textFile.flatMap(line =>line.split("\t")).map((_,1))
wc: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[37] at map at <console>:25

scala> wc.collect
res14: Array[(String, Int)] = Array((hello,1), (hello,1), (hello,1), (world,1), (world,1), (john,1))

3、使用reduceByKey把他们的词数相加，map颠倒顺序，sortByKey默认按照升序排序
wc.reduceByKey(_+_).map(x =>(x._2,x._1)).sortByKey(true).collect

4、按照降序排列：true --> false
wc.reduceByKey(_+_).map(x =>(x._2,x._1)).sortByKey(false).collect

查看SortByKey在源码的定义：


  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }