Structured Streaming_structured streaming 查询-优快云博客

第1章 Structured Streaming介绍

1.1 Structured Streaming概述

从 Spark2.0 开始, spark 引入了一套新的流式计算模型: Structured Streaming.

该组件进一步降低了处理数据的延迟时间, 它实现了“有且仅有一次(Exectly Once)” 语义, 可以保证数据被精准消费。

Structured Streaming 基于 Spark SQl 引擎, 是一个具有弹性和容错的流式处理引擎. 使用 Structure Streaming 处理流式计算的方式和使用批处理计算静态数据(表中的数据)的方式是一样的。

随着流数据的持续到达, Spark SQL 引擎持续不断的运行并得到最终的结果. 我们可以使用 Dataset/DataFrame API 来表达流的聚合, 事件-时间窗口(event-time windows), 流-批处理连接(stream-to-batch joins)等等. 这些计算都是运行在被优化过的 Spark SQL 引擎上. 最终, 通过 chekcpoint 和 WALs(Write-Ahead Logs), 系统保证end-to-end exactly-once。

总之, Structured Streaming 提供了快速, 弹性, 容错, end-to-end exactly-once 的流处理, 而用户不需要对流进行推理(比如 spark streaming 中的流的各种转换).

默认情况下, 在内部, Structured Streaming 查询使用微批处理引擎(micro-batch processing engine)处理, 微批处理引擎把流数据当做一系列的小批job(small batch jobs ) 来处理. 所以, 延迟低至 100 毫秒, 从 Spark2.3, 引入了一个新的低延迟处理模型:Continuous Processing, 延迟低至 1 毫秒.

第2章快速入门

2.1导入依赖

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <version>3.0.0</version> </dependency>

2.2 WordCount

import org.apache.spark.SparkConfimport org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.streaming.OutputModeobject WordCount { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf).master("local[*]").getOrCreate() import sparkSession.implicits._ val df = sparkSession.readStream.format("socket").option("host", "hadoop102").option("port", "7777").load().as[String] val result = df.flatMap(_.split(" ")).groupBy("value").count() val query = result.writeStream.outputMode(OutputMode.Update()).format("console").start() query.awaitTermination() } }

2.3测试

1.在hadoop102上监听端口

[atguigu@hadoop102 ~]$ nc -lk 7777

2.启动 Structured Streaming任务,输入一些单词查看结果

第3章编程模型

3.1基本概念

3.1.1 输入表

把输入数据流当做输入表(Input Table). 到达流中的每个数据项(data item)类似于被追加到输入表中的一行.

3.1.2 结果表

作用在输入表上的查询将会产生“结果表(Result Table)”. 每个触发间隔(trigger interval, 例如 1s), 新行被追加到输入表, 最终会更新结果表. 无论何时更新结果表, 我们都希望将更改的结果行写入到外部接收器(external sink)

3.2对外输出模式

（1）Complete Mode：完整模式，整个更新后的结果表被写入外部sink。由连接器决定如何处理整个表

（2）Append Mode：追加模式，仅将结果表中附加的新行写入到外部sink，这种模式适用于历史数据不会有更改的情况

（3）Update Mode：更新模式，仅将在结果表中发生变化的，发生更新的行写入外部sink，如果此模式历史数据没有发生变化，相当于Append追加模式。

结构化流不会实现整个表，也就是说不会保存整个表的明细数据，从流数据源读取最新的可用数据并进行增量处理得到最新结果，结构化流会仅保留最新结果而丢弃源数据。此模型与许多其他流处理引擎明细不同。许多流系统要求用户自己维护运行中的聚合，因此必须考虑容错和数据一致性。

3.1数据源

（1）File source:文件源，支持CSV,JSON,ORC,Parquet。支持容错。支持全局路径，但不支持逗号分割的全局路径。

（2）Socket Source:从套接字读取UTF-8文本数据，一般用于测试，不支持端到端的容错。

（3）Kafka Source:从Kafka读取数据，版本0.10.0或更高，支持容错。

（4）Rate Source:以每秒指定的行数生成数据，带有时间戳，用于测试环境。

3.1.1SocketSource

参考快速入门的WordCount

3.1.2FileSource

import org.apache.spark.SparkConfimport org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.streaming.OutputModeimport org.apache.spark.sql.types.StructTypeobject FileSource { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf).master("local[*]").getOrCreate() import sparkSession.implicits._ val schema = new StructType().add("value", "String") // val df = sparkSession.readStream.textFile("C:\\Users\\lzt\\Desktop\\test") val df = sparkSession.readStream.schema(schema).option("sep",";").csv("C:\\Users\\lzt\\Desktop\\test").as[String].flatMap(_.split(",")).groupBy("value").count() val query =df.writeStream.outputMode(OutputMode.Update()).format("console").start() query.awaitTermination() } }

3.1.3kafkaSource

添加依赖

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql-kafka-0-10_2.12</artifactId> <version>3.0.0</version> </dependency>

编写代码

import org.apache.spark.SparkConfimport org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.streaming.OutputModeobject KafkaSource { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf).master("local[*]").getOrCreate() val df = sparkSession.readStream.format("kafka").option("kafka.bootstrap.servers", "hadoop102:9092,hadoop103:9092,hadoop104:9092").option("subscribe", "test") .option("startingOffsets", "earliest") .option("kafka.group.id", "test-group") .load() import sparkSession.implicits._ val query = df.selectExpr("CAST(value AS STRING)").as[String] .flatMap(_.split(" ")).groupBy("value").count() .writeStream.outputMode(OutputMode.Update()).format("console").start() query.awaitTermination() } }

发送测试数据

[atguigu@hadoop104 kafka]$ bin/kafka-console-producer.sh --broker-list hadoop104:9092 --topic test

3.1.4RateSource

import org.apache.spark.SparkConfimport org.apache.spark.sql.SparkSessionobject RateSource { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf) .master("local[*]").getOrCreate() val rows = sparkSession.readStream.format("rate") .option("rowsPerSecond", 10) // 设置每秒产生的数据的条数, 默认是 1 .option("rampUpTime", 1) // 设置多少秒到达指定速率默认为 0 .option("numPartitions", 2) /// 设置分区数默认是 spark 的默认并行度 .load rows.writeStream.outputMode("update") .format("console") .start() .awaitTermination() } }

3.3处理延迟数据

事件时间是嵌入数据本身的时间，对于许多应用程序来说，都希望按此时间进行处理，而不是Spark本身受到数据的时间。在结构化流中，可以基于事件时间来进行窗口聚合操作。

此模型自然会根据事件时间处理比预期晚到达的数据，从Spark2.1开始，支持水印功能，该功能允许用户指定最新数据可延迟的阀值，并允许引擎相应地清楚旧状态。

3.4容错定义

精准一次消费是结构化流设计的主要目标之一，为此结构化有接收器和执行引擎，可以可靠地跟踪处理确切的进度，以便结构化流可以通过重新启动或/重新处理来处理任何类型的故障。例如流数据源Kafka,结构化流可以跟踪对应偏移量读取位置，引擎使用checkpoint和wal预写日志记录每个trigger触发器的偏移量。结构化流可以重播源数据和幂等的接收器，确保精准一次消费。

第4章操作Structured Streaming

4.2基于eventTime窗口操作

在 Structured Streaming 中, 可以按照事件发生时的时间对数据进行聚合操作, 即基于 event-time 进行操作.

在这种机制下, 即不必考虑 Spark 陆续接收事件的顺序是否与事件发生的顺序一致, 也不必考虑事件到达 Spark 的时间与事件发生时间的关系.

因此, 它在提高数据处理精度的同时, 大大减少了开发者的工作量.

我们现在想计算 10 分钟内的单词, 每 5 分钟更新一次, 也就是说在 10 分钟窗口 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20等之间收到的单词量. 注意, 12:00 - 12:10 表示数据在 12:00 之后 12:10 之前到达.

现在，考虑一下在 12:07 收到的单词。单词应该增加对应于两个窗口12:00 - 12:10和12:05 - 12:15的计数。因此，计数将由分组键（即单词）和窗口（可以从事件时间计算）索引。

统计后的结果应该是这样的:

import java.sql.Timestampimport WaterMarkTest.Studentimport org.apache.spark.SparkConfimport org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.streaming.OutputMode/** * 测试数据 1001,2020-08-06 16:50:54 * lzt * 测试eventTime窗口 */ object WordCountWindow { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf) .master("local[*]").getOrCreate() import sparkSession.implicits._ val line = sparkSession.readStream.format("socket") .option("host", "hadoop102") .option("port", "7777") .load() .as[String] val result = line.as[String].map(item => { val arry = item.split(",") Student(arry(0).toInt, Timestamp.valueOf(arry(1))) }) import org.apache.spark.sql.functions._ val query = result.groupBy(window(column("time"), "10 minutes", "4 minutes")).count() .writeStream.outputMode(OutputMode.Complete()).format("console") .option("truncate", "false").start() query.awaitTermination() } }

有上图看出得到两个窗口。Structured Streaming 会根据事件时间生成对应的若干个时间窗口, 然后按照指定的规则聚合.

4.3 eventTime窗口生成规则

org.apache.spark.sql.catalyst.analysis.TimeWindowing

// 窗口个数maxNumOverlapping = ceil (windowDuration / slideDuration)for (i 0 until maxNumOverlapping) windowId ceil ((timestamp - startTime) / slideDuration) windowStart return windowStart, windowEnd

根据源码，窗口时长/滑动步长得出生成最大的窗口个数（不一定是实际窗口个数）。

循环窗口个数，windowId 根据代码ceil((timestamp - startTime) / slideDuration)获得，其中timestamp为测试数据时间即2020-08-06 16:50:54，startTime为源码处0秒，再除以滑动步长并向上取整，获取到一个windowId。

然后窗口的开始时间根据源码 windowId * slideDuration + (i - maxNumOverlapping) * slideDuration + startTime计算得出,此处代码由windowId再次乘以滑动步长，由于windowId是通过ceil函数向上取整得出，所以此处windowId*slideDuration一定是滑动步长的整倍数并且大于测试数据时间即2020-08-06 16:52:00。那么第一次循环i为0再减去最大窗口个数（2）得出-2，那么2020-08-06 16:52:00时间再减去2个滑动步长时间（8分钟），startTime为0,得出最早的开始窗口时间为2020-08-06 16:44:00。

最后窗口结束时间 windowEnd 根据源码开始时间加上滑动步长得出。

4.4 基于 Watermark 处理延迟数据

在数据分析系统中, Structured Streaming 可以持续的按照 event-time 聚合数据, 然而在此过程中并不能保证数据按照时间的先后依次到达. 例如: 当前接收的某一条数据的 event-time 可能远远早于之前已经处理过的 event-time. 在发生这种情况时, 往往需要结合业务需求对延迟数据进行过滤.

现在考虑如果事件延迟到达会有哪些影响. 假如, 一个单词在 12:04(event-time) 产生, 在 12:11 到达应用. 应用应该使用 12:04 来在窗口(12:00 - 12:10)中更新计数, 而不是使用 12:11. 这些情况在我们基于窗口的聚合中是自然发生的, 因为结构化流可以长时间维持部分聚合的中间状态

但是, 如果这个查询运行数天, 系统很有必要限制内存中累积的中间状态的数量. 这意味着系统需要知道何时从内存状态中删除旧聚合, 因为应用不再接受该聚合的后期数据.

为了实现这个需求, 从 spark2.1, 引入了 watermark(水印), 使用引擎可以自动的跟踪当前的事件时间, 并据此尝试删除旧状态.

通过指定 event-time 列和预估事件的延迟时间上限来定义一个查询的 watermark. 针对一个以时间 T 结束的窗口, 引擎会保留状态和允许延迟时间直到(max event time seen by the engine - late threshold > T). 换句话说, 延迟时间在上限内的被聚合, 延迟时间超出上限的开始被丢弃.

可以通过withWatermark() 来定义watermark

watermark 计算: watermark = MaxEventTime - Threshhod

而且, watermark只能逐渐增加, 不能减少

总结:

Structured Streaming 引入 Watermark 机制, 主要是为了解决以下两个问题:

处理聚合中的延迟数据

减少内存中维护的聚合状态.

在不同输出模式(complete, append, update)中, Watermark 会产生不同的影响.

4.4.1 update 模式下使用 watermark

在 update 模式下, 仅输出与之前批次的结果相比, 涉及更新或新增的数据.

import java.sql.Timestampimport org.apache.spark.SparkConfimport org.apache.spark.sql.SparkSessionobject UpdateWaterMarkTest { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf) .master("local[*]").getOrCreate() import sparkSession.implicits._ val df = sparkSession.readStream.format("socket") .option("host", "hadoop102") .option("port", "7777") .load() .as[String] val result = df.map(item => { val array = item.split(",") (Timestamp.valueOf(array(0)), array(1)) }).toDF("timestamp", "value") import org.apache.spark.sql.functions._ val wordCount = result.withWatermark("timestamp", "2 minutes") //设置水位线允许延迟2分钟 .groupBy(window($"timestamp", "10 minutes", "2 minutes"), $"value") .count() val query = wordCount.writeStream.outputMode("update").format("console").option("truncate", "false") .start() query.awaitTermination(); }}

输入测试数据2020-10-15 10:55:00,dog。根据代码设置，窗口大小为10分钟，滑动窗口为2分钟，则会产生对应5个窗口。

根据当前测试数据时间2020-10-15 10:55:00，及水位线延迟2分钟设置，得出当前水位线时间为2020-10-15 10:53:00。再次输入第二条测试数据2020-10-15 11:00:00,dog

update模式只输出发生变化的数据，count为2的窗口代表进行了更新，count为1的窗口为新增窗口。此时由于我们最大eventTime为2020-10-15 11:00:00,eventTime上涨了随之水位线也跟着上涨变为2020-10-15 10:58:00，那么第一批窗口中已有2个窗口低于此水位线了，则会删除对应窗口的维护状态。

同时如果新增数据窗口低于此水位线也会被过滤掉。测试再次发送数据2020-10-15 10:55:00,dog

可以看到只会出现3个窗口，其中2个低于水位线的窗口被删除维护状态，同时输出时也被过滤掉了。

4.4.2 append 模式下使用 wartermark

把前一个案例中的update改成append即可.

val query = wordCount.writeStream.outputMode("append").format("console").option("truncate", "false") .start()

在append模式下仅输出新增数据，且输出后不会变化

测试发送数据2020-10-15 10:55:00,dog。此时不会输出任何信息，因为 Structured Streaming无法确定后续是否数据是否会更新当前数据,只有当窗口结束时间小于watermask,才能确定后续数据不会更新窗口里的数据，这时才会输出结果。

此时水位线是2020-10-15 10:55:00 减去2分钟2020-10-15 10:53:00。发送测试数据2020-10-15 11:00:00,dog使最大eventTime上涨带动水位线上涨（此时水位线2020-10-15 10:58:00）。水位线上涨之后就会有对应2个窗口过期（参考update模式），这两个窗口结束时间都小于水位线2020-10-15 10:58:00，此时输出最终结果。

4.4.3 watermark 机制总结

watermark 在用于基于时间的状态聚合操作时, 该时间可以基于窗口, 也可以基于 event-time本身.

输出模式必须是append或update. 在输出模式是complete的时候(必须有聚合), 要求每次输出所有的聚合结果. 我们使用 watermark 的目的是丢弃一些过时聚合数据, 所以complete模式使用wartermark无效也无意义.

在输出模式是append时, 必须设置 watermask 才能使用聚合操作. 其实, watermask 定义了 append 模式中何时输出聚合聚合结果(状态), 并清理过期状态.

在输出模式是update时, watermask 主要用于过滤过期数据并及时清理过期状态.

watermask 会在处理当前批次数据时更新, 并且会在处理下一个批次数据时生效使用. 但如果节点发送故障, 则可能延迟若干批次生效.

withWatermark 必须使用与聚合操作中的时间戳列是同一列.df.withWatermark("time", "1 min").groupBy("time2").count() 无效

withWatermark 必须在聚合之前调用. f.groupBy("time").count().withWatermark("time", "1 min") 无效

4.5 流去重

import java.sql.Timestampimport org.apache.spark.SparkConfimport org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.streaming.OutputMode/** * 测试数据 * 1,1001,20,2020-01-01 11:50:00 * 1,1002,22,2020-01-01 11:55:00 * 2,1003,22,2020-01-01 11:52:00 * 3,1004,55,2020-01-01 11:54:00 * */ object StreamDropDuplicate { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("test") .set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() val line = sparkSession.readStream.format("socket") .option("host", "hadoop102") .option("port", "7777") .load() import sparkSession.implicits._ val result = line.as[String].map(item => { val array = item.split(",") (array(0), array(1), array(2), Timestamp.valueOf(array(3))) }).withWatermark("_4", "2 minutes") .dropDuplicates("_1") result.writeStream.outputMode(OutputMode.Update()).format("console") .start().awaitTermination() }}

使用dropDuplicates算子根据字段进行去重。去重时可以设置水位线延迟时长，如果设置了就会在延迟范围内进行去重。如果没有设置水位线则是全局去重。

4.6 Join

Spark2.0支持流与静态数据之间的join，2.3时支持流与流的join。

4.6.1流与静态数据join

先准备好测试csv文件

上传到hdfs路径上。

编写测试代码

import org.apache.spark.SparkConfimport org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.streaming.OutputMode/** * 1001,school1 * 1002,school2 * 1003,school3 */object StreamStaciJoin { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() import sparkSession.implicits._ val df = sparkSession.read.csv("hdfs://mycluster/user/atguigu/static.csv") .toDF("uid", "name", "age") val dstream = sparkSession.readStream.format("socket") .option("host", "hadoop102") .option("port", "7777") .load() .as[String] val dsStudent = dstream.map(item => { val array = item.split(",") (array(0), array(1)) }).toDF("uid", "school") val result = dsStudent.join(df, Seq("uid")) result.writeStream.outputMode(OutputMode.Update()).format("console") .start().awaitTermination() }}

编写进行测试，在hadoop102机器上监听7777端口，并允许spark代码。

查看运行结果

join测试成功

4.6.2流与流join

import java.sql.Timestampimport org.apache.spark.SparkConfimport org.apache.spark.sql.SparkSessionimport org.apache.spark.sql.streaming.OutputMode//1001,imporession1,2020-08-07 10:00:00//1002,imporession1,2020-08-07 10:05:00//1003,imporession1,2020-08-07 9:08:00//1001,clieck1,2020-08-07 13:00:00//1002,clieck1,2020-08-07 13:00:00//1003,clieck1,2020-08-07 10:00:00 case class ImpressionData(impressionAdId: Int, data: String, impressionTime: Timestamp)case class ClickData(clickId: Int, data: String, clickTime: Timestamp)object StreamJoin { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("test") .set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() val line1 = sparkSession.readStream.format("socket") .option("host", "hadoop102").option("port", "7777").load() val line2 = sparkSession.readStream.format("socket") .option("host", "hadoop102").option("port", "7778").load() import sparkSession.implicits._ val impressionStream = line1.as[String].map(item => { val datas = item.split(",") ImpressionData(datas(0).toInt, datas(1), Timestamp.valueOf(datas(2))) }).withWatermark("impressionTime", "2 hours") val clieckStream = line2.as[String].map(item => { val datas = item.split(",") ClickData(datas(0).toInt, datas(1), Timestamp.valueOf(datas(2))) }).withWatermark("clickTime", "3 hours") import org.apache.spark.sql.functions._ val result = clieckStream.join(impressionStream, expr("impressionAdId=clickId ")) result.writeStream.outputMode(OutputMode.Append()).format("console") .option("truncate", "false").start().awaitTermination() }}

Structured Streaming双流join，最终延迟即水位线是会在两条流之间来回切换。即以最小eventTime为准。即如上：impressionStream和clieckStream量条流水位延迟分别为2小时和3小时，两条流进行join，当impressionStream的最大eventTime小于clieckStream的最大eventTime则取impressionStream的水位线规则最大eventTime为1002,imporession1,2020-08-07 10:05:00这条数据的时间，水位线时间为2020-08-07 08:05:00,那么这个时候双流join时只要保证两条流的数据都大于

2020-08-07 08:05:00就行。再到后来，发送测试数据1001,imporession1,2020-08-07 16:00:00，此时impressionStream大于clieckStream的最大eventTime为1002,clieck1,2020-08-07 13:00:00的时间，那么此时水位线规则又发生变化了，以clieckStream为准允许延迟三小时即水位线2020-08-07 10:00:00，两条流数据都保证在2020-08-07 10:00:00时间就能join上。

4.7 处理多个水位线

流跟流进行join，合并在一起，每个流都会有各自的水位线阀值，结构化流在执行操作时，会分别跟踪每个输入流中出现的最大事件时间，根据相应延迟计算水位线，并选择其中一个水位线作为全局水位线用于状态操作。

默认情况下，选择最小值作为全局水位线。换句话说，全局水位线将以最慢的流的速度安全地移动，并且结果将因此而延迟。

从spark2.4开始，可以配置spark.sql.streaming.multipleWatermarkPolicy此参数来控制全局水位线，默认值是min,延迟最高的流的水位线。可以将此值设置为max，以延迟最低流的水位线作为全局水位线，这使全局最快的速度移动，但是副作用就是速度较慢的流中，数据会被丢弃。

4.8不支持的操作

（1）不支持多个流的聚合操作（即DataFrame上的聚合链）

（2）不支持使用limit和take

（3）不支持使用distinct

（4）仅在聚合操作之后，并且输出模式是Complete,才能使用Sort排序操作

（5）无法直接使用count()函数获取流的个数，只能通过ds.groupby().count()获取分组后的个数

（6）无法直接使用foreach()函数，只能使用ds.writeStream.foreach()

（7）Show()用的是console sink

当前版本如果使用这些操作时会遇到一个AnalysisException错误

4.9 Output Sinks

File sink - 存储数据到指定目录

writeStream

.format("parquet") // can be "orc", "json", "csv", etc.

.option("path", "path/to/destination/dir")

.start()

Kafka sink -存储输出到指定kafka

writeStream

.format("kafka")

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")

.option("topic", "updates")

.start()

Foreach sink -循环输出记录，运行计算

writeStream

.foreach(...)

.start()

Console sink (for debugging) - 将输出打印到控制台，调试使用

writeStream

.format("console")

.start()

Memory sink (for debugging) - 输出作为内存表存储在内存中。仅用于调试。

writeStream

.format("memory")

.queryName("tableName")

.start()

Sink	Supported Output Modes	Options	Fault-tolerant
File Sink	Append	path: path to the output directory, must be specified.For file-format-specific options, see the related methods in DataFrameWriter (Scala/Java/Python/R). E.g. for "parquet" format options see DataFrameWriter.parquet()	Yes (exactly-once)
Kafka Sink	Append, Update, Complete	See the Kafka Integration Guide	Yes (at-least-once)
Foreach Sink	Append, Update, Complete	None	Yes (at-least-once)
ForeachBatch Sink	Append, Update, Complete	None	Depends on the implementation
Console Sink	Append, Update, Complete	numRows: Number of rows to print every trigger (default: 20)truncate: Whether to truncate the output if too long (default: true)	No
Memory Sink	Append, Complete	None	No. But in Complete Mode, restarted query will recreate the full table.

4.10 Trigger

流查询的触发设置定义了流数据处理的时间，无论该查询是具有固定批处理间隔的微批查询还是作为连续处理的查询。

以下是受支持的各种触发器:

（1）未指定，默认触发器，该模式下，上一个微批处理完后才会生成微批次处理。

（2）固定间隔微批，查询以微批次模式执行，该模式下，微批次将按用户指定的时间间隔启动。如果前一个微批处理完了，并且是在间隔时间内的，那么会进行等待，间隔时间结束后再执行下一微批，如果前一批数据处理时间大于间隔时间，则结束后立马就执行下一批微批，没有等待时间。如果没有新数据，则不会启动。

（3）一次性微批，该查询仅执行一个微批处理来处理所有数据，然后停止。如果是跑定时任务这会很有用。

（4）连续固定的检查点间隔模式，查询将以新的低延迟，连续处理模式执行，连续处理是spark2.3中引入的一种新的执行模式，可实现几乎1ms的端到端延迟，并保证容错。与默认触发器相比，该模式可以实现精准以下消费，但是延迟最多只能实现100ms。这种模式不支持水位线操作。

4.11使用foreach和foreachBatch

foreach和foreachBatch方法都可以定义输出逻辑，区别在于foreach是对于每行数据，foreachBatch是对于每个微批数据

data.writeStream.foreach(new ForeachWriter[Student2] { override def open(partitionId: Long, epochId: Long): Boolean = ??? override def process(value: Student2): Unit = ??? override def close(errorOrNull: Throwable): Unit = ???})data.writeStream.foreachBatch(new VoidFunction2[Dataset[Student2], java.lang.Long] { override def call(v1: Dataset[Student2], v2: lang.Long): Unit = { }})data.writeStream.foreachBatch { (batchDF: Dataset[Student2], batchId: Long) => batchDF.persist() batchDF.write.format("").save() //location batchDF.write.format("").save() //location2 batchDF.unpersist()

}

Foreach写法跟flink类似，open打开资源连接，process处理数据，close关闭相应资源

ForeachBatch提供at-least-once语义，想要实现exactly-once精准一次语义，得自己通过batchId进行去重来实现。

4.12 对接Kafka

4.12.1 导入所需jar包

org.apache.spark spark-sql-kafka-0-10_2.12 3.0.0

kafka版本必须是0.11.0.0或更高

4.12.2 必选配置

Option

Value

Meaning

assign

{“topicA:”[0,1],”topicB”:[0,1]}

json串的形式监控topic。assign,subscribe,subscribeParttern三种只能选择其中一种

topicA,topicB

以逗号分割的形式填写

subscribePattern

JAVA regex string

正则表达式来匹配对应topic

kafka.bootstrap.servers

Hadoop101:9092,hadoop102:9092,hadoop103:9092

kafka集群地址

4.12.3 可选配置

Option	Value	Defaule	Query type	Meaning
startingOffsetByTimestamp	{"topicA":{"0": 1000, "1": 1000}, "topicB": {"0": 2000, "1": 2000}}	none	streaming and batch	查询开始时每个分区对应的时间戳标记，如果时间戳标记不存在偏移量则查询会失败。此参数相当于KafkaConsumer.offsetsForTimes
startingOffsets	"earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """	"latest" for streaming, "earliest" for batch	streaming and batch	查询开始的起点，每个分区对应开始查询的最早的偏移量，在json串中-2代表最早的偏移量，-1代表最新的片偏移量。
endingOffsetsByTimestamp	json string """ {"topicA":{"0": 1000, "1": 1000}, "topicB": {"0": 2000, "1": 2000}} """	latest	batch query	批处理查询的结束时的终点，json串中指定每个分区指定结束时间戳标记。
endingOffsets	latest or json string {"topicA":{"0":23,"1":-1},"topicB":{"0":-1}}	latest	batch query	批处理查询结束时的终点，json串为每个分区指定结束时的偏移量。-1代表最新，-2不可用使用。
failOnDataLoss	true or false	true	streaming and batch	在topic已删除或者偏移量不存在的情况下，是否查询失败。当你认为是错误的警告时，可以禁用了。
kafkaConsumer.pollTimeoutMs	Long	512	streaming and batch	执行程序时，从kafka消费数据进行对接的超时时长，单位毫秒
fetchOffset.numRetries	Int	3	streaming and batch	程序对接kafka重试次数
fetchOffset.retryIntervalMs	Long	10	streaming and batch	失败后重试中间等待的时长，单位毫秒
maxOffsetsPerTrigger	Long	none	streaming and batch	每个触发间隔处理的最大偏移量速率限制，指定的偏移总数将按比例分给对应的topic 分区
minPartitions	Int	none	streaming and batch	程序从kafka读取的最小分区数，Spark分区对接kafka分区默认映射关系是1:1，如果此参数设置了比topic分区数还大，那么spark会将topic分区切成多个小分区
groupIdPrefix	String	spark-kafka-source	streaming and batch	结构化流查询时生成的使用者标识符的前缀（group.id），如果设置了kafka.group.id则可以忽略
kafka.group.id	String	none	streaming and batch	消费者组设置
includeHeaders	boolean	false	streaming and batch	是否在数据行中包含kafka header信息

4.12.4 消费者缓存

初始化Kafka消费者很耗时，尤其是在流处理场景当中，因此Spark利用Apache Commons Pool进行了将kafka信息缓存在executor上，缓存信息包括Topic name,Topic Partition,Group Id

以下是对池的配置

Property Name	Default	Meaning	Since Version
spark.kafka.consumer.cache.capacity	64	缓存使用的最大数量，这是一个软限制，不会影响程序正常运行	3.0
spark.kafka.consumer.cache.timeout	5 minutes	消费者信息可以在池中闲置的最大时间	3.0
spark.kafka.consumer.cache.evictorThreadRunInterval	1 minutes	池中定时驱逐空闲状态线程的间隔时间	3.0
spark.kafka.consumer.cache.jmx.enable	False	创建池的时候是否启用JMX,默认不启用	3.0

4.12.5 数据写出到Kafka

向外写出，结构化流支持批处理和流处理下图是两个例子，读的时候也一样read和readStream

当模式指定为writeStream时为流式处理，当模式指定为write时为批处理

4.12.6 代码例子统计count

（1）创建topic

[root@hadoop102 module]# kafka_2.11-2.4.0/bin/kafka-topics.sh --zookeeper hadoop102:2181/kafka_2.4 --create --replication-factor 2 --partitions 10 --topic register_topic

（2）编写生产者代码发送数据

import java.util.Propertiesimport org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}import org.apache.spark.{SparkConf, SparkContext}object RegisterProducer { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("registerProducer").setMaster("local[*]") val ssc = new SparkContext(sparkConf) ssc.textFile("file://"+this.getClass.getResource("/register.log").getPath, 10) .foreachPartition(partition => { val props = new Properties() props.put("bootstrap.servers", "hadoop101:9092,hadoop102:9092,hadoop103:9092") props.put("acks", "1") props.put("batch.size", "16384") props.put("linger.ms", "10") props.put("buffer.memory", "33554432") props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer") props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer") val producer = new KafkaProducer[String, String](props) partition.foreach(item => { val msg = new ProducerRecord[String, String]("register_topic",item) producer.send(msg) }) producer.flush() producer.close() }) }}

（3）数据格式，根据第二个类型字段统计count个数

（4）编写结构化流代码

import org.apache.spark.SparkConfimport org.apache.spark.sql.{Row, SparkSession}case class Data(appName: String, value: Int)object WortCount { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("word count").setMaster("local[*]") .set("spark.sql.shuffle.partitions", "10") //和topic 分区数一致 val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() val df = sparkSession.readStream.format("kafka") .option("kafka.bootstrap.servers", "hadoop101:9092,hadoop102:9092,hadoop103:9092") .option("subscribe", "register_topic") .option("startingOffsets", "earliest") .option("enable.auto.commit", "false") //测试不提交偏移量 .option("maxOffsetsPerTrigger", "3000") .load() import sparkSession.implicits._ val result = df.selectExpr("cast(value as string)").as[String] .filter(item => item.split("\t").length == 3) .mapPartitions((partition: Iterator[String]) => partition.map(item => { val datas = item.split("\t") val app_name = datas(1) match { case "1" => "PC" case "2" => "APP" case _ => "Other" } Data(app_name, 1) })).groupBy("appName").count() val query = result.writeStream.outputMode("update").format("console") .option("truncate", "false").start() query.awaitTermination() }}

经过测试，aqe特性对结构化流不起作用，只适用于批处理

（5）将结果写入mysql

建表语句

create table wordcount( id int PRIMARY KEY AUTO_INCREMENT, appname varchar(20), `value` int, UNIQUE INDEX index_appname(appname) )ENGINE=INNODB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;

准备相关jar包和工具类

<dependency> <groupId>com.alibaba</groupId> <artifactId>druid</artifactId> <version>1.1.16</version></dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.29</version></dependency>

import java.io.InputStream;import java.util.Properties;/** * * 读取配置文件工具类 */public class ConfigurationManager { private static Properties prop = new Properties(); static { try { InputStream inputStream = ConfigurationManager.class.getClassLoader() .getResourceAsStream("comerce.properties"); prop.load(inputStream); } catch (Exception e) { e.printStackTrace(); } } //获取配置项 public static String getProperty(String key) { return prop.getProperty(key); } //获取布尔类型的配置项 public static boolean getBoolean(String key) { String value = prop.getProperty(key); try { return Boolean.valueOf(value); } catch (Exception e) { e.printStackTrace(); } return false; }}

import com.alibaba.druid.pool.DruidDataSourceFactory;import javax.sql.DataSource;import java.io.Serializable;import java.sql.Connection;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.sql.SQLException;import java.util.Properties;/** * 德鲁伊连接池 */public class DataSourceUtil implements Serializable { public static DataSource dataSource = null; static { try { Properties props = new Properties(); props.setProperty("url", ConfigurationManager.getProperty("jdbc.url")); props.setProperty("username", ConfigurationManager.getProperty("jdbc.user")); props.setProperty("password", ConfigurationManager.getProperty("jdbc.password")); props.setProperty("initialSize", "5"); //初始化大小 props.setProperty("maxActive", "20"); //最大连接 props.setProperty("minIdle", "5"); //最小连接 props.setProperty("maxWait", "60000"); //等待时长 props.setProperty("timeBetweenEvictionRunsMillis", "2000");//配置多久进行一次检测,检测需要关闭的连接单位毫秒 props.setProperty("minEvictableIdleTimeMillis", "600000");//配置连接在连接池中最小生存时间单位毫秒 props.setProperty("maxEvictableIdleTimeMillis", "900000"); //配置连接在连接池中最大生存时间单位毫秒 props.setProperty("validationQuery", "select 1"); props.setProperty("testWhileIdle", "true"); props.setProperty("testOnBorrow", "false"); props.setProperty("testOnReturn", "false"); props.setProperty("keepAlive", "true"); props.setProperty("phyMaxUseCount", "100000"); dataSource = DruidDataSourceFactory.createDataSource(props); } catch (Exception e) { e.printStackTrace(); } } //提供获取连接的方法 public static Connection getConnection() throws SQLException { return dataSource.getConnection(); } // 提供关闭资源的方法【connection是归还到连接池】 // 提供关闭资源的方法【方法重载】3 dql public static void closeResource(ResultSet resultSet, PreparedStatement preparedStatement, Connection connection) { // 关闭结果集 // ctrl+alt+m 将java语句抽取成方法 closeResultSet(resultSet); // 关闭语句执行者 closePrepareStatement(preparedStatement); // 关闭连接 closeConnection(connection); } private static void closeConnection(Connection connection) { if (connection != null) { try { connection.close(); } catch (SQLException e) { e.printStackTrace(); } } } private static void closePrepareStatement(PreparedStatement preparedStatement) { if (preparedStatement != null) { try { preparedStatement.close(); } catch (SQLException e) { e.printStackTrace(); } } } private static void closeResultSet(ResultSet resultSet) { if (resultSet != null) { try { resultSet.close(); } catch (SQLException e) { e.printStackTrace(); } } }}

import java.sql.{Connection, PreparedStatement, ResultSet}trait QueryCallback { def process(rs: ResultSet)}class SqlProxy { private var rs: ResultSet = _ private var psmt: PreparedStatement = _ /** * 执行修改语句 * * @param conn * @param sql * @param params * @return */ def executeUpdate(conn: Connection, sql: String, params: Array[Any]): Int = { var rtn = 0 try { psmt = conn.prepareStatement(sql) if (params != null && params.length > 0) { for (i <- 0 until params.length) { psmt.setObject(i + 1, params(i)) } } rtn = psmt.executeUpdate() } catch { case e: Exception => e.printStackTrace() } rtn } /** * 执行查询语句 * 执行查询语句 * * @param conn * @param sql * @param params * @return */ def executeQuery(conn: Connection, sql: String, params: Array[Any], queryCallback: QueryCallback) = { rs = null try { psmt = conn.prepareStatement(sql) if (params != null && params.length > 0) { for (i <- 0 until params.length) { psmt.setObject(i + 1, params(i)) } } rs = psmt.executeQuery() queryCallback.process(rs) } catch { case e: Exception => e.printStackTrace() } } def shutdown(conn: Connection): Unit = DataSourceUtil.closeResource(rs, psmt, conn)}

准备连接数据库配置文件放到resource下

jdbc.url=jdbc:mysql://hadoop101:3306/qz_course?useUnicode=true&characterEncoding=utf8&serverTimezone=Asia/Shanghai&useSSL=falsejdbc.user=rootjdbc.password=123456

修改结构化流代码

import java.sql.Connectionimport com.atguigu.test.util.{DataSourceUtil, SqlProxy}import org.apache.spark.SparkConfimport org.apache.spark.sql.{ForeachWriter, Row, SparkSession}case class Data(appName: String, value: Int)object WortCount { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("word count").setMaster("local[*]") .set("spark.sql.shuffle.partitions", "10") //和topic 分区数一致 val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() val df = sparkSession.readStream.format("kafka") .option("kafka.bootstrap.servers", "hadoop101:9092,hadoop102:9092,hadoop103:9092") .option("subscribe", "register_topic") .option("startingOffsets", "earliest") .option("enable.auto.commit", "false") //测试不提交偏移量 .option("maxOffsetsPerTrigger", "3000") .load() import sparkSession.implicits._ val result = df.selectExpr("cast(value as string)").as[String] .filter(item => item.split("\t").length == 3) .mapPartitions((partition: Iterator[String]) => partition.map(item => { val datas = item.split("\t") val app_name = datas(1) match { case "1" => "PC" case "2" => "APP" case _ => "Other" } Data(app_name, 1) })).groupBy("appName").count() val query = result.writeStream.outputMode("update").foreach(new ForeachWriter[Row] { var client: Connection = _ var sqlProxy: SqlProxy = _ override def open(partitionId: Long, epochId: Long): Boolean = { client = DataSourceUtil.getConnection sqlProxy = new SqlProxy true } override def process(value: Row): Unit = { val appName = value.getString(0) val count = value.getLong(1) sqlProxy.executeUpdate(client, "insert into wordcount(appname,`value`) values(?,?) on duplicate key update appname=?,`value`=?", Array(appName, count, appName, count)) } override def close(errorOrNull: Throwable): Unit = { client.close() } }).option("truncate", "false").start() query.awaitTermination() }} 查看效果

4.13 yarn提交使用checkpoint

代码去掉localhost[*]。输出端加上checkpoint选项

修改pom.xml文件，将spark相关的包加上provided

<dependencies>  <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <scope>provided</scope> <version>3.0.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <scope>provided</scope> <version>3.0.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.12</artifactId> <scope>provided</scope> <version>3.0.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql-kafka-0-10_2.12</artifactId> <scope>provided</scope> <version>3.0.0</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>druid</artifactId> <version>1.1.16</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.29</version> </dependency></dependencies><build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.6.1</version>  <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <version>2.15.1</version> <executions> <execution> <id>compile-scala</id> <goals> <goal>add-source</goal> <goal>compile</goal> </goals> </execution> <execution> <id>test-compile-scala</id> <goals> <goal>add-source</goal> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <configuration> <archive> <manifest> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> </plugin> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.2</version> <executions> <execution>  <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> </plugin>  <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>3.0.0</version> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins></build>

然后打包,将jar包上传到集群上

[root@hadoop103 ~]# spark-submit --master yarn --deploy-mode cluster --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 --driver-memory 1g --num-executors 5 --executor-cores 2 --executor-memory 2g --queue spark --class com.atguigu.test.WortCount spark3test-1.0-SNAPSHOT-jar-with-dependencies.jar

checkpoint存储信息，可用于第二次启动恢复数据

4.14 topN例子

经过测试发现结构化流支持limit操作了（官网仍然没有修改），那么可以排序后再limit去求topN,输出模式必须是Complete

case class Data(appName: String, value: Int) object WortCount { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("word count").setMaster("local[*]") .set("spark.sql.shuffle.partitions", "10") //和topic 分区数一致 val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() val df = sparkSession.readStream.format("kafka") .option("kafka.bootstrap.servers", "hadoop101:9092,hadoop102:9092,hadoop103:9092") .option("subscribe", "register_topic") .option("startingOffsets", "earliest") .option("enable.auto.commit", "false") //测试不提交偏移量 .option("maxOffsetsPerTrigger", "3000") .load() import sparkSession.implicits._ import org.apache.spark.sql.functions._ val result = df.selectExpr("cast(value as string)").as[String] .filter(item => item.split("\t").length == 3) .mapPartitions((partition: Iterator[String]) => partition.map(item => { val datas = item.split("\t") val app_name = datas(1) match { case "1" => "PC" case "2" => "APP" case _ => "Other" } Data(app_name, 1) })).groupBy("appName").count().orderBy(desc("count")).limit(1) val query = result.writeStream.outputMode("Complete").format("console").start() query.awaitTermination() }}

第5章项目实战

5.1环境准备

（1）jdk1.8、zookeeper、kafka、hadoop、hbase

（2）hbase完全分布式安装

在完全分布式配置中，集群包含多个节点，每个节点运行一个或多个Hbase守护进程。其中包括主实例和备份Master实例，多个Zookeeper节点和多个RegionServer节点。

Node Name	Master	Zookeeper	RegionServer
Hadoop101	yes	yes	yes
Hadoop102	backup	yes	yes
Hadoop103	no	yes	yes

（1）上传并解压hbase-2.2.4-bin.tar.gz

[root@hadoop101 hadoop]# cd /opt/software/ [root@hadoop101 software]# tar -zxvf hbase-2.2.4-bin.tar.gz -C /opt/module/

（2）修改conf/regionservers,删除localhost，修改对应各主机域名或ip

[root@hadoop101 software]# cd /opt/module/hbase-2.2.4/ [root@hadoop101 hbase-2.2.4]# vim conf/regionservers hadoop101 hadoop102 hadoop103

（3）在conf创建一个文件名为backup-masters,并且在这文件里添加hadoop102的域名

[root@hadoop101 hbase-2.2.4]# vim conf/backup-masters hadoop102

（4）修改conf/hbase-site.xml文件

[root@hadoop101 hbase-2.2.4]# cd conf/ [root@hadoop101 conf]# vim hbase-site.xml <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://mycluster/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.master.port</name> <value>16000</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/root/zookeeper</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>hadoop101,hadoop102,hadoop103</value> </property> <property><name>hbase.unsafe.stream.capability.enforce</name><value>false</value></property> </configuration>

（5）修改hbase-env.sh。声明jdk路径，并且讲hbase自带的zookeeper设置为false

[root@hadoop101 conf]# vim hbase-env.sh export JAVA_HOME=/opt/module/jdk1.8.0_211 export HBASE_MANAGES_ZK=false

（6）拷贝hdfs-site.xml到hbase conf下

[root@hadoop101 conf]# cp /opt/module/hadoop-3.1.3/etc/hadoop/hdfs-site.xml /opt/module/hbase-2.2.4/conf/

（7）分发hbase到其他节点

[root@hadoop101 module]# scp -r hbase-2.2.4/ hadoop102:/opt/module/

[root@hadoop101 module]# scp -r hbase-2.2.4/ hadoop103:/opt/module/

（8）配置hbase环境变量

[root@hadoop101 module]# vim /etc/profile #HBASE_HOME export HBASE_HOME=/opt/module/hbase-2.2.4 export PATH=$PATH:$HBASE_HOME/bin [root@hadoop101 module]# source /etc/profile [root@hadoop102 module]# vim /etc/profile #HBASE_HOME export HBASE_HOME=/opt/module/hbase-2.2.4 export PATH=$PATH:$HBASE_HOME/bin [root@hadoop102 module]# source /etc/profile [root@hadoop103 module]# vim /etc/profile #HBASE_HOME export HBASE_HOME=/opt/module/hbase-2.2.4 export PATH=$PATH:$HBASE_HOME/bin [root@hadoop103 module]# source /etc/profile

（7）启动hbase

[root@hadoop101 module]# start-hbase.sh

（8） Web Ui访问,http://hadoop101:16010

5.2准备数据

5.2.1事实表数据

（1）订单主表数据格式

{"id":4777,"consignee":"司马震","consignee_tel":"13315310287","total_amount":671.00,"order_status":"1001","user_id":24,"payment_way":null,"delivery_address":"第7大街第3号楼4单元629门","order_comment":"描述168877","out_trade_no":"864124284531429","trade_body":"小米（MI）小米路由器4 双千兆路由器无线家用穿墙1200M高速双频wifi 千兆版千兆端口光纤适用等3件商品","create_time":"2020-10-21 23:10:49","operate_time":null,"expire_time":"2020-10-21 23:25:49","process_status":null,"tracking_no":null,"parent_order_id":null,"img_url":"http://img.gmall.com/169344.jpg","province_id":21,"activity_reduce_amount":0.00,"coupon_reduce_amount":0.00,"original_total_amount":666.00,"feight_fee":5.00,"feight_fee_reduce":null,"refundable_time":null}}

（2）订单明细表数据格式

{"id":12756,"order_id":4768,"sku_id":14,"sku_name":"Dior迪奥口红唇膏送女友老婆礼物生日礼物烈艳蓝金999+888两支装礼盒","img_url":"http://kAXllAQEzJWHwiExxVmyJIABfXyzbKwedeofMqwh","order_price":496.00,"sku_num":"3","create_time":"2020-10-21 23:10:49","source_type":"2401","source_id":null,"pay_amount":null,"split_total_amount":null,"split_activity_amount":null,"split_coupon_amount":100.00}

（3）优惠活动

{"id":1624,"order_id":4769,"order_detail_id":12761,"activity_id":1,"activity_rule_id":2,"sku_id":12,"create_time":"2020-10-21 23:10:49"}

（4）优惠卷

{"id":1325455744978726914,"order_id":4768,"order_detail_id":12756,"coupon_id":3,"coupon_use_id":null,"sku_id":14,"create_time":"2020-10-21 23:10:49"}

5.2.3维度表数据

（1）活动规则

{"database":"gmall2020","table":"activity_rule","type":"bootstrap-insert","ts":1604887946,"data":{"id":1,"activity_id":1,"activity_type":"3101","condition_amount":10000.00,"condition_num":null,"benefit_amount":500.00,"benefit_discount":null,"benefit_level":1}}

（2）活动范围

{"database":"gmall2020","table":"activity_sku","type":"bootstrap-insert","ts":1604887948,"data":{"id":1,"activity_id":1,"sku_id":11,"create_time":null}}

（3）购物券

{"database":"gmall2020","table":"coupon_info","type":"bootstrap-insert","ts":1604887950,"data":{"id":1,"coupon_name":"口红品类券","coupon_type":"3201","condition_amount":99.00,"condition_num":null,"activity_id":null,"benefit_amount":30.00,"benefit_discount":null,"create_time":"2020-10-24 01:37:05","range_type":"3301","limit_num":100,"taken_count":0,"start_time":null,"end_time":null,"operate_time":null,"expire_time":null,"range_desc":null}}

5.3架构图

5.4创建topic

（1）创建ods层topic，此topic用于存储所有数据

[root@hadoop101 kafka_2.11-2.4.0]# bin/kafka-topics.sh --zookeeper hadoop102:2181/kafka_2.4 --create --replication-factor 2 --partitions 12 --topic order_all

（2）创建dwd层对应的4张事实表数据的topic

1.订单主表

[root@hadoop101 kafka_2.11-2.4.0]# bin/kafka-topics.sh --zookeeper hadoop102:2181/kafka_2.4 --create --replication-factor 2 --partitions 12 --topic order_main

2.订单明细

[root@hadoop101 kafka_2.11-2.4.0]# bin/kafka-topics.sh --zookeeper hadoop102:2181/kafka_2.4 --create --replication-factor 2 --partitions 12 --topic order_details

3.优惠活动

[root@hadoop101 kafka_2.11-2.4.0]# bin/kafka-topics.sh --zookeeper hadoop102:2181/kafka_2.4 --create --replication-factor 2 --partitions 12 --topic preferential_activities

4.优惠卷

[root@hadoop101 kafka_2.11-2.4.0]# bin/kafka-topics.sh --zookeeper hadoop102:2181/kafka_2.4 --create --replication-factor 2 --partitions 12 --topic coupon

5.5模拟数据生成

（1）编写对应kafka生产者代码，模拟数据生成

package com.atguigu.producerimport java.text.DecimalFormatimport java.util.{Properties, Random}import com.alibaba.fastjson.JSONimport com.alibaba.fastjson.serializer.SerializerFeatureimport org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}import scala.beans.BeanProperty/** * 订单主表 * * @param id * @param consignee * @param consignee_tel * @param total_amount * @param order_status * @param user_id * @param payment_way * @param delivery_address * @param order_comment * @param out_trade_no * @param trade_body * @param create_time * @param operate_time * @param expire_time * @param process_status * @param tracking_no * @param parent_order_id * @param img_url * @param province_id * @param activity_reduce_amount * @param coupon_reduce_amount * @param original_total_amount * @param feight_fee * @param feight_fee_reduce * @param refundable_time */case class OrderMain(@BeanProperty id: Long, @BeanProperty consignee: String, @BeanProperty consignee_tel: String, @BeanProperty total_amount: String, @BeanProperty order_status: String, @BeanProperty user_id: Long, @BeanProperty payment_way: String, @BeanProperty delivery_address: String, @BeanProperty order_comment: String, @BeanProperty out_trade_no: Long, @BeanProperty trade_body: String, @BeanProperty create_time: Long, @BeanProperty operate_time: String, @BeanProperty expire_time: Long, @BeanProperty process_status: String, @BeanProperty tracking_no: String, @BeanProperty parent_order_id: String, @BeanProperty img_url: String, @BeanProperty province_id: Int, @BeanProperty activity_reduce_amount: String, @BeanProperty coupon_reduce_amount: String, @BeanProperty original_total_amount: String, @BeanProperty feight_fee: String, @BeanProperty feight_fee_reduce: String, @BeanProperty refundable_time: String, @BeanProperty table: String)/** * 订单明细表 * * @param id * @param order_id * @param sku_id * @param sku_name * @param img_url * @param order_price * @param sku_num * @param create_time * @param source_type * @param source_id * @param pay_amount * @param split_total_amount * @param split_activity_amount * @param split_coupon_amount */case class OrderDeatails(@BeanProperty id: Long, @BeanProperty order_id: Long, @BeanProperty sku_id: Int, @BeanProperty sku_name: String, @BeanProperty img_url: String, @BeanProperty order_price: String, @BeanProperty sku_num: Int, @BeanProperty create_time: Long, @BeanProperty source_type: String, @BeanProperty source_id: Int, @BeanProperty pay_amount: String, @BeanProperty split_total_amount: String, @BeanProperty split_activity_amount: String, @BeanProperty split_coupon_amount: String, @BeanProperty table: String)/** * 优惠活动表 * * @param id * @param order_id * @param order_detail_id * @param activity_id * @param activity_rule_id * @param sku_id * @param create_time */case class PreferentialActivities(@BeanProperty id: Long, @BeanProperty order_id: Long, @BeanProperty order_detail_id: Long, @BeanProperty activity_id: Int, @BeanProperty activity_rule_id: Int, @BeanProperty sku_id: Int, @BeanProperty create_time: Long, @BeanProperty table: String)/** * 优惠卷 * * @param id * @param order_id * @param order_detail_id * @param coupon_id * @param coupon_use_id * @param sku_id * @param create_time */case class Coupon(@BeanProperty id: Long, @BeanProperty order_id: Long, @BeanProperty order_detail_id: Long, @BeanProperty coupon_id: Int, @BeanProperty coupon_use_id: Int, @BeanProperty sku_id: Int, @BeanProperty create_time: Long, @BeanProperty table: String)/** * 活动规则维度表 * * @param database * @param table * @param `type` * @param data * @param ts */case class ActivityRulesDim(@BeanProperty database: String, @BeanProperty table: String, @BeanProperty `type`: String, @BeanProperty data: RuleData, @BeanProperty ts: Long)case class RuleData(@BeanProperty id: Int, @BeanProperty activity_id: Int, @BeanProperty activity_type: Int, @BeanProperty condition_amount: Double, @BeanProperty condition_num: Int, @BeanProperty benefit_amount: Double, @BeanProperty benefit_discount: Int, @BeanProperty benefit_level: Int)/** * 活动范围维度表 * * @param database * @param table * @param `type` * @param data * @param ts */case class ActivityRangeDim(@BeanProperty database: String, @BeanProperty table: String, @BeanProperty `type`: String, @BeanProperty data: RangeData, @BeanProperty ts: Long)case class RangeData(@BeanProperty id: Int, @BeanProperty sku_id: Int, @BeanProperty create_time: Long)/** * 购物券维度表 * * @param database * @param table * @param `type` * @param data * @param ts */case class ShoppingVoucherDim(@BeanProperty database: String, @BeanProperty table: String, `type`: String, @BeanProperty data: VoucherData, @BeanProperty ts: Long)case class VoucherData(@BeanProperty id: Int, @BeanProperty coupon_name: String, @BeanProperty coupon_type: Int, @BeanProperty condition_amount: Double, @BeanProperty condition_num: Int, @BeanProperty activity_id: Int, @BeanProperty benefit_amount: Double, @BeanProperty benefit_discount: Int, @BeanProperty create_time: Long, @BeanProperty range_type: Long, @BeanProperty limit_num: Int, @BeanProperty taken_count: Int, @BeanProperty start_time: Long, @BeanProperty end_time: Long)object OdsDataProducer { def main(args: Array[String]): Unit = { val props = new Properties props.put("bootstrap.servers", "hadoop101:9092,hadoop102:9092,hadoop103:9092") props.put("acks", "-1") props.put("buffer.memory", "5000000") props.put("max.block.ms", "300000") props.put("compression.type", "snappy") props.put("linger.ms", "50") props.put("retries", Integer.MAX_VALUE.toString) props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer") props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer") try { val producer = new KafkaProducer[String, String](props) /** * ---------------------------------------------维度表数据---------------------------------------------- * 添加活动规则信息表数据 */ for (i <- 0 until 1000) { val model = generateActivityRulesDim(i) producer.send(new ProducerRecord[String, String]("order_all", JSON.toJSONString(model, SerializerFeature.QuoteFieldNames))) } producer.flush() /** * 添加活动范围数据 */ for (i <- 0 until 800000) { val model = generateActivityRangeDim(i) producer.send(new ProducerRecord[String, String]("order_all", JSON.toJSONString(model, SerializerFeature.QuoteFieldNames))) } producer.flush() /** * 添加购物券数 */ for (i <- 0 until 1000) { val model = generateShoppingVoucherDim(i) producer.send(new ProducerRecord[String, String]("order_all", JSON.toJSONString(model, SerializerFeature.QuoteFieldNames))) } producer.flush() /** -------------------------------------------事实表数据--------------------------------------- * 添加商品信息表 */ for (i <- 0 until 3000000) { val model = generateOrderMainLog(i) producer.send(new ProducerRecord[String, String]("order_all", JSON.toJSONString(model, SerializerFeature.QuoteFieldNames))) } producer.flush() /** * 添加商品明细表 */ for (i <- 0 until 5000000) { val model = generateOrderDetails(i) producer.send(new ProducerRecord[String, String]("order_all", JSON.toJSONString(model, SerializerFeature.QuoteFieldNames))) } producer.flush() /** * 添加优惠活动数据 */ for (i <- 0 until 2000000) { val model = generateOrderDetailActivity(i) producer.send(new ProducerRecord[String, String]("order_all", JSON.toJSONString(model, SerializerFeature.QuoteFieldNames))) } producer.flush() /** * 添加优惠卷事实表数据 */ for (i <- 0 until 3000000) { val model = generateOrderDetailCoupon(i) producer.send(new ProducerRecord[String, String]("order_all", JSON.toJSONString(model, SerializerFeature.QuoteFieldNames))) } producer.flush() } catch { case e: Exception => println(e.getCause.getMessage) } } def generateOrderMainLog(id: Int): OrderMain = { val df = new DecimalFormat("0.00") val random = new Random() val consignee = "用户" + id; val consignee_tel = "13315310287" val toal_amount = df.format(random.nextDouble() * 1000) val order_status = "1001" val user_id = random.nextInt(10000) val payment_way = null val delivery_address = "第7大街第" + id + "号楼4单元629门" val order_comment = "描述" + id val out_trade_no = random.nextLong() val trade_body = "小米（MI）小米路由器4 双千兆路由器无线家用穿墙1200M高速双频wifi 千兆版千兆端口光纤适用等3件商品" val create_time = System.currentTimeMillis() val operate_time = null; val expire_time = System.currentTimeMillis() val process_status = null val tracking_no = null; val parent_order_id = null val img_url = "http://img.gmall.com/169344.jpg" val province_id = random.nextInt(100) val activity_reduce_amount = "0.00" val coupon_reduce_amount = "0.00" val original_total_amount = "10" val feight_fee = "10" val feight_fee_reduce = null val refundable_time = null new OrderMain(id, consignee, consignee_tel, toal_amount, order_status, user_id, payment_way, delivery_address, order_comment , out_trade_no, tracking_no, create_time, operate_time, expire_time, process_status, tracking_no, parent_order_id, img_url, province_id, activity_reduce_amount, coupon_reduce_amount, original_total_amount, feight_fee, feight_fee_reduce, refundable_time, "OrderMain") } /** * 生成800万条明细 * * @param id * @return */ def generateOrderDetails(id: Int): OrderDeatails = { val df = new DecimalFormat("0.00") val random = new Random() val order_id = random.nextInt(3000000) val sku_id = random.nextInt(800000) //80万条sku val sku_name = "Dior迪奥口红唇膏送女友老婆礼物生日礼物烈艳蓝金999+888两支装礼盒" val img_url = "http://kAXllAQEzJWHwiExxVmyJIABfXyzbKwedeofMqwh" val order_price = "496.00" val sku_num = 3 val createtime = System.currentTimeMillis() val source_type = "00000" val source_id = 0 val pay_amount = null val split_total_amount = null val split_activity_amount = null val split_coupon_amount = "100.00" new OrderDeatails(id, order_id, sku_id, sku_name, img_url, order_price, sku_num, createtime, source_type, source_id, pay_amount, split_total_amount, split_activity_amount, split_coupon_amount, "OrderDeatail") } /** * 生成200万条优惠活动数据 * * @param id */ def generateOrderDetailActivity(id: Int) = { val random = new Random() val order_id = random.nextInt(3000000) val order_detail_id = 0 val activity_id = random.nextInt(1000) // 1000条活动 val actiity_rule_id = random.nextInt(1000) //1000条活动规则 val sku_id = 0 val create_time = System.currentTimeMillis() new PreferentialActivities(id, order_id, order_detail_id, actiity_rule_id, actiity_rule_id, sku_id, create_time, "PreferentialActivities") } /** * 生成300万订单优惠卷 * * @param id */ def generateOrderDetailCoupon(id: Int) = { val random = new Random() val order_id = random.nextInt(3000000) val order_detail_id = 12756 val coupon_id = random.nextInt(1000) //1000条优惠卷 val coupon_use_id = 0 val sku_id = 0 val createtime = System.currentTimeMillis() new Coupon(id, order_id, order_detail_id, coupon_id, coupon_use_id, sku_id, createtime, "Coupon") } /** * 生成活动规则 * * @param id * @return */ def generateActivityRulesDim(id: Int): ActivityRulesDim = { val database = "gamll2020" val table = "activity_rule" val `type` = "bootstrap-insert" val ts = System.currentTimeMillis() val activity_id = 1 val activity_type = 3101 val condition_amount = 10000.00 val condition_num = 0 val benefit_amount = 500.00 val benefit_discount = 0 val benefit_level = 1 new ActivityRulesDim(database, table, `type`, new RuleData(id, activity_id, activity_type, condition_amount, condition_num, benefit_amount, benefit_discount, benefit_level), ts) } /** * 生成活动规则 * * @param id * @return */ def generateActivityRangeDim(id: Int) = { val database = "gamll2020" val table = "activity_sku" val `type` = "bootstrap-insert" val ts = System.currentTimeMillis() val activity_id = 1 val sku_id = id val create_time = System.currentTimeMillis() new ActivityRangeDim(database, table, `type`, new RangeData(id, sku_id, create_time), ts) } def generateShoppingVoucherDim(id: Int) = { val database = "gamll2020" val table = "coupon_info" val `type` = "bootstrap-insert" val ts = System.currentTimeMillis() val coupon_name = "口红品类卷" + id val coupon_type = 3201 val condition_amount = 99.00 val condition_num = 0 val activity_id = 0 val benefit_amount = 30.00 val benefit_discount = 0 val create_time = System.currentTimeMillis() val range_type = 3301 val limit_num = 100 val taken_count = 0 val start_time = create_time val end_time = create_time new ShoppingVoucherDim(database, table, `type`, new VoucherData(id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, limit_num, taken_count, start_time, end_time), ts) }}

5.6 HBase建表

（1）进入到hbase shell

[root@hadoop103 ~]# hbase shell

（2）创建namespacce

hbase(main):001:0> create_namespace 'orders'

（3）创建维度表

hbase(main):001:0> create 'orders:dwd_activity_rule',{NAME=>'info',VERSIONS => '3', TTL => 'FOREVER'},{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'} hbase(main):002:0> create 'orders:dwd_activity_sku',{NAME=>'info',VERSIONS => '3', TTL => 'FOREVER'},{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'} hbase(main):003:0> create 'orders:dwd_coupon_info',{NAME=>'info',VERSIONS => '3', TTL => 'FOREVER'},{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

（4）创建宽表

hbase(main):004:0> create 'orders:dim_order_details',{NAME=>'info',VERSIONS => '3', TTL => 'FOREVER'},{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

5.7编写dwd层逻辑

（1）编写hash工具类

package com.atguigu.utilimport java.math.BigIntegerimport java.security.MessageDigestobject Utils { /** * 对字符串进行MD5加密 * * @param input * @return */ def generateHash(input: String): String = { try { if (input == null) { null } val md = MessageDigest.getInstance("MD5") md.update(input.getBytes()); val digest = md.digest(); val bi = new BigInteger(1, digest); var hashText = bi.toString(16); while (hashText.length() < 32) { hashText = "0" + hashText; } hashText } catch { case e: Exception => e.printStackTrace(); null } }}

（2）编写dwd逻辑

package com.atguigu.dwd.streamimport java.util.Propertiesimport com.alibaba.fastjson.{JSON, JSONObject}import com.atguigu.util.Utilsimport org.apache.hadoop.conf.Configurationimport org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Put, Table}import org.apache.hadoop.hbase.util.Bytesimport org.apache.hadoop.hbase.{HBaseConfiguration, TableName}import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}import org.apache.spark.SparkConfimport org.apache.spark.sql.{ForeachWriter, SparkSession}object DwdStructuredStream { val groupid = "orders_dwd_stream_groupid" def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("dwdOrderStream") .set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() import sparkSession.implicits._ val dStream = sparkSession.readStream.format("kafka") .option("kafka.bootstrap.servers", "hadoop101:9092,hadoop102:9092,hadoop103:9092") .option("subscribe", "order_all") .option("startingOffsets", "earliest") .option("kafka.group.id", groupid) .option("maxOffsetsPerTrigger", "4800") .load() .selectExpr("cast(value as string)").as[String] dStream.writeStream.foreach(new ForeachWriter[String] { var hbaseConfig: Configuration = _ var connection: Connection = _ var activityRuleTable: Table = _ var activitySkuTable: Table = _ var couponInfoTable: Table = _ val props = new Properties props.put("bootstrap.servers", "hadoop101:9092,hadoop102:9092,hadoop103:9092") props.put("acks", "-1") props.put("buffer.memory", "5000000") props.put("max.block.ms", "300000") props.put("compression.type", "snappy") props.put("linger.ms", "50") props.put("retries", Integer.MAX_VALUE.toString) props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer") props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer") var producer: KafkaProducer[String, String] = _ override def open(partitionId: Long, epochId: Long): Boolean = { hbaseConfig = HBaseConfiguration.create() hbaseConfig.set("hbase.zookeeper.property.clientPort", "2181") hbaseConfig.set("hbase.zookeeper.quorum", "hadoop101,hadoop102,hadoop103") connection = ConnectionFactory.createConnection(hbaseConfig) activityRuleTable = connection.getTable(TableName.valueOf("orders:dwd_activity_rule")) activitySkuTable = connection.getTable(TableName.valueOf("orders:dwd_activity_sku")) couponInfoTable = connection.getTable(TableName.valueOf("orders:dwd_coupon_info")) producer = new KafkaProducer[String, String](props) true } override def process(value: String): Unit = { val jsonObject = JSON.parseObject(value) val table = jsonObject.getString("table") table match { case "activity_rule" => { upserDwdData(jsonObject, activityRuleTable) } case "activity_sku" => { upserDwdData(jsonObject, activitySkuTable) } case "coupon_info" => { upserDwdData(jsonObject, couponInfoTable) } case "OrderMain" => { producer.send(new ProducerRecord[String, String]("order_main", jsonObject.toJSONString)) } case "OrderDeatail" => { producer.send(new ProducerRecord[String, String]("order_details", jsonObject.toJSONString)) } case "PreferentialActivities" => { producer.send(new ProducerRecord[String, String]("preferential_activities", jsonObject.toJSONString)) } case "Coupon" => { producer.send(new ProducerRecord[String, String]("coupon", jsonObject.toJSONString)) } case _ => "" } } override def close(errorOrNull: Throwable): Unit = { producer.flush() producer.close() activityRuleTable.close() activitySkuTable.close() couponInfoTable.close() } }).option("checkpointLocation", "hdfs://mycluster/structuredstreaming/checkpoint/" + groupid).start().awaitTermination() def upserDwdData(jsonObject: JSONObject, hbaseTable: Table) = { val ts = jsonObject.getLong("ts") val data = jsonObject.getJSONObject("data") val id = data.getIntValue("id") val rowkey = Utils.generateHash(String.valueOf(id)).substring(0, 5) + "_" + id val put = new Put(Bytes.toBytes(rowkey)) val keySet = data.keySet().toArray() put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("ts"), Bytes.toBytes(ts)) for (key <- keySet) { put.addColumn(Bytes.toBytes("info"), Bytes.toBytes(key.toString), Bytes.toBytes(data.getString(key.toString))) } hbaseTable.put(put) } }}

5.8编写dim层逻辑

package com.atguigu.dwd.streamimport java.utilimport com.alibaba.fastjson.serializer.SerializerFeatureimport com.alibaba.fastjson.{JSON, JSONObject}import com.atguigu.producer.{Coupon, OrderDeatails, OrderMain, PreferentialActivities}import com.atguigu.util.Utils}import org.apache.hadoop.hbase.client.{ConnectionFactory, Get, Put, Table}import org.apache.hadoop.hbase.util.Bytesimport org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration, TableName}import org.apache.spark.SparkConfimport org.apache.spark.broadcast.Broadcastimport org.apache.spark.sql.{Dataset, SparkSession}import scala.beans.BeanProperty//spark-submit --master yarn --deploy-mode client --driver-memory 1g --num-executors 3 --executor-cores 3 --executor-memory 4g --queue spark --class com.atguigu.dwd.stream.DimStructuredStream Structured-Streaming-1.0-SNAPSHOT-jar-with-dependencies.jarobject DimStructuredStream { val orderMaingGroupid = "orders_dim_stream_groupid_orderMain" val orderDetalGroupid = "orders_dim_stream_groupid_orderDetail" val preferentialGroupid = "orders_dim_stream_groupid_preferential" val couponGroupid = "orders_dim_stream_groupid_coupon" val bootStrapServers = "hadoop101:9092,hadoop102:9092,hadoop103:9092" def main(args: Array[String]): Unit = { System.setProperty("HADOOP_USER_NAME", "root") val sparkConf = new SparkConf().setAppName("dimOrderStream") //.setMaster("local[*]") .set("spark.sql.shuffle.partitions", "12") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() val sparkContext = sparkSession.sparkContext import org.apache.spark.sql.functions._ import sparkSession.implicits._ val orderMainDStream = getDStream(bootStrapServers, "order_main", "earliest", orderMaingGroupid, "4800", sparkSession) .as[String].map(item => { JSON.parseObject[OrderMain](item, classOf[OrderMain]) }).withColumn("create_time", to_timestamp(from_unixtime(col("create_time") / 1000))) .withColumnRenamed("create_time", "orderMain_create_time") .withColumnRenamed("operate_time", "orderMain_operte_time") .withColumnRenamed("table", "orderMain_table") .withColumnRenamed("img_url", "orderMain_img_url") .withWatermark("orderMain_create_time", "2 hours") val orderDetailDStream = getDStream(bootStrapServers, "order_details", "earliest", orderDetalGroupid, "4800", sparkSession) .as[String].map(item => { JSON.parseObject[OrderDeatails](item, classOf[OrderDeatails]) }).withColumn("create_time", to_timestamp(from_unixtime(col("create_time") / 1000))) .withColumnRenamed("id", "orderDetail_id") .withColumnRenamed("order_id", "orderDetail_order_id") .withColumnRenamed("order_price", "orderDetail_order_price") .withColumnRenamed("create_time", "orderDetail_create_time") .withColumnRenamed("table", "orderDetail_table") .withColumnRenamed("sku_id", "orderDetail_sku_id") .withColumnRenamed("img_url", "orderDetail_img_url") .withWatermark("orderDetail_create_time", "2 hours") val preferentialDStream = getDStream(bootStrapServers, "preferential_activities", "earliest", preferentialGroupid, "4800", sparkSession) .as[String].map(item => { JSON.parseObject[PreferentialActivities](item, classOf[PreferentialActivities]) }).withColumn("create_time", to_timestamp(from_unixtime(col("create_time") / 1000))) .withColumnRenamed("id", "preferential_id") .withColumnRenamed("order_id", "preferential_order_id") .withColumnRenamed("create_time", "preferential_create_time") .withColumnRenamed("table", "preferential_table") .withWatermark("preferential_create_time", "2 hours") val couponDStream = getDStream(bootStrapServers, "coupon", "earliest", couponGroupid, "4800", sparkSession) .as[String].map(item => { JSON.parseObject[Coupon](item, classOf[Coupon]) }).withColumn("create_time", to_timestamp(from_unixtime(col("create_time") / 1000))) .withColumnRenamed("id", "copon_id") .withColumnRenamed("order_id", "copon_order_id") .withColumnRenamed("create_time", "copon_create_time") .withColumnRenamed("table", "copon_table") .drop("sku_id").drop("order_detail_id") .withWatermark("copon_create_time", "2 hours") //事实表进行join val resultDStream = orderMainDStream.join(orderDetailDStream, orderMainDStream("id") === orderDetailDStream("orderDetail_order_id") && orderMainDStream("orderMain_create_time") <= orderDetailDStream("orderDetail_create_time") && (orderMainDStream("orderMain_create_time").+(expr("interval 1 hour")) >= (orderDetailDStream("orderDetail_create_time"))) , "left") .join(preferentialDStream, orderMainDStream("id") === preferentialDStream("preferential_order_id") && orderMainDStream("orderMain_create_time") <= preferentialDStream("preferential_create_time") && orderMainDStream("orderMain_create_time").+(expr("interval 1 hour")) > preferentialDStream("preferential_create_time"), "left") .join(couponDStream, orderMainDStream("id") === couponDStream("copon_order_id") && orderMainDStream("orderMain_create_time") <= couponDStream("copon_create_time") && orderMainDStream("orderMain_create_time").+(expr("interval 1 hour")) > couponDStream("copon_create_time"), "left") .as[OrderDetails] //关联维度表 resultDStream.writeStream .foreachBatch { (batchDF: Dataset[OrderDetails], batchId: Long) => //批量查询list batchDF.cache() //申明三个HashMap,并且广播，用于存放维度表数据 val brRuleDataMap = sparkContext.broadcast(new util.HashMap[String, String]()) val brSkuRuleDataMap = sparkContext.broadcast(new util.HashMap[String, String]) val brCoponDataMap = sparkContext.broadcast(new util.HashMap[String, String]) //关联维度表数据 getDimemsionData(batchDF, brRuleDataMap, brSkuRuleDataMap, brCoponDataMap) //获取宽表数据后,插入HBase putDimData(batchDF, brRuleDataMap, brSkuRuleDataMap, brCoponDataMap) batchDF.unpersist() brCoponDataMap.unpersist() brRuleDataMap.unpersist() brSkuRuleDataMap.unpersist() } .option("checkpointLocation", "hdfs://mycluster/structuredstreaming/checkpoint/dimorders") .start().awaitTermination() } def getDStream(bootStrapServers: String, subscribe: String, startingOffsets: String, groupid: String, maxOffsetsPerTrigger: String, sparkSession: SparkSession) = { sparkSession.readStream.format("kafka") .option("kafka.bootstrap.servers", bootStrapServers) .option("subscribe", subscribe) .option("startingOffsets", startingOffsets) .option("kafka.group.id", groupid) .option("maxOffsetsPerTrigger", maxOffsetsPerTrigger) .load() .selectExpr("cast (value as string)") } case class OrderDetails(@BeanProperty id: String, @BeanProperty consignee: String, @BeanProperty consignee_tel: String, @BeanProperty total_amount: String, @BeanProperty order_status: String, @BeanProperty user_id: String, @BeanProperty payment_way: String, @BeanProperty delivery_address: String, @BeanProperty order_comment: String, @BeanProperty out_trade_no: String, @BeanProperty trade_body: String, @BeanProperty orderMain_create_time: String, @BeanProperty orderMain_operte_time: String, @BeanProperty expire_time: String, @BeanProperty process_status: String, @BeanProperty tracking_no: String, @BeanProperty parent_order_id: String, @BeanProperty orderMain_img_url: String, @BeanProperty province_id: String, @BeanProperty activity_reduce_amount: String, @BeanProperty coupon_reduce_amount: String, @BeanProperty original_total_amount: String, @BeanProperty feight_fee: String, @BeanProperty feight_fee_reduce: String, @BeanProperty refundable_time: String, @BeanProperty orderMain_table: String, @BeanProperty orderDetail_id: String, @BeanProperty orderDetail_order_id: String, @BeanProperty orderDetail_sku_id: String, @BeanProperty sku_name: String, @BeanProperty orderDetail_img_url: String, @BeanProperty orderDetail_order_price: String, @BeanProperty sku_num: String, @BeanProperty orderDetail_create_time: String, @BeanProperty source_type: String, @BeanProperty source_id: String, @BeanProperty pay_amount: String, @BeanProperty split_total_amount: String, @BeanProperty split_activity_amount: String, @BeanProperty split_coupon_amount: String, @BeanProperty orderDetail_table: String, @BeanProperty preferential_id: String, @BeanProperty preferential_order_id: String, @BeanProperty order_detail_id: String, @BeanProperty activity_id: String, @BeanProperty activity_rule_id: String, @BeanProperty sku_id: String, @BeanProperty preferential_create_time: String, @BeanProperty preferential_table: String, @BeanProperty copon_id: String, @BeanProperty copon_order_id: String, @BeanProperty coupon_id: String, @BeanProperty coupon_use_id: String, @BeanProperty copon_create_time: String, @BeanProperty copon_table: String) def addActivityRuleGets(jsonObject: JSONObject, activityRuleGets: util.ArrayList[Get], activityRuleSkuGets: util.ArrayList[Get], couponInfoTableGets: util.ArrayList[Get]): Unit = { //根据id请求维度表数据 val activity_id = jsonObject.getString("activity_id") val sku_id = jsonObject.getString("sku_id") val copon_id = jsonObject.getString("copon_id") val activityGet = new Get(Bytes.toBytes(Utils.generateHash(activity_id).substring(0, 5) + "_" + activity_id)) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("activity_type")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("condition_amount")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("benefit_amount")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("benefit_discount")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("benefit_level")) val skuGet = new Get(Bytes.toBytes(Utils.generateHash(sku_id).substring(0, 5) + "_" + sku_id)) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("sku_id")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("create_time")) val coponGet = new Get(Bytes.toBytes(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id)) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("coupon_name")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("coupon_type")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("condition_amount")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("benefit_amount")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("benefit_discount")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("create_time")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("range_type")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("limit_num")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("taken_count")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("start_time")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("end_time")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("operate_time")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("expire_time")) .addColumn(Bytes.toBytes("info"), Bytes.toBytes("range_desc")) activityRuleGets.add(activityGet) activityRuleSkuGets.add(skuGet) couponInfoTableGets.add(coponGet) } /** * 关联维度表数据 * * @param batchDF * @param brRuleDataMap * @param brSkuRuleDataMap * @param brCoponDataMap */ def getDimemsionData(batchDF: Dataset[OrderDetails], brRuleDataMap: Broadcast[util.HashMap[String, String]], brSkuRuleDataMap: Broadcast[util.HashMap[String, String]], brCoponDataMap: Broadcast[util.HashMap[String, String]]) = { batchDF.foreachPartition((partitions: Iterator[OrderDetails]) => { //批量查询Hbase 声明三个list用于存放get val activityRuleGets = new util.ArrayList[Get]() val activityRuleSkuGets = new util.ArrayList[Get]() val couponInfoTableGets = new util.ArrayList[Get]() import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.client.ConnectionFactory val hbaseConfig = HBaseConfiguration.create hbaseConfig.set("hbase.zookeeper.property.clientPort", "2181") hbaseConfig.set("hbase.zookeeper.quorum", "hadoop101,hadoop102,hadoop103") val connection = ConnectionFactory.createConnection(hbaseConfig) var activityRuleTable: Table = connection.getTable(TableName.valueOf("orders:dwd_activity_rule")) val activitySkuTable: Table = connection.getTable(TableName.valueOf("orders:dwd_activity_sku")) val couponInfoTable: Table = connection.getTable(TableName.valueOf("orders:dwd_coupon_info")) //循环分区数据，将get请求加入到list中 partitions.foreach(item => { val jsonObject: JSONObject = JSON.parseObject(JSON.toJSONString(item, SerializerFeature.QuoteFieldNames)) addActivityRuleGets(jsonObject, activityRuleGets, activityRuleSkuGets, couponInfoTableGets) }) val ruleResult = activityRuleTable.get(activityRuleGets) val skuResult = activitySkuTable.get(activityRuleSkuGets) val coponResult = couponInfoTable.get(couponInfoTableGets) //批量查询出数据后，将结果数据放入到广播 for (result <- ruleResult) { val cells = result.rawCells() for (cell <- cells) { brRuleDataMap.value.put(Bytes.toString(CellUtil.cloneRow(cell)) + "_rule_" + Bytes.toString(CellUtil.cloneQualifier(cell)) , Bytes.toString(CellUtil.cloneValue(cell))) } } for (result <- skuResult) { val cells = result.rawCells() for (cell <- cells) { brSkuRuleDataMap.value.put(Bytes.toString(CellUtil.cloneRow(cell)) + "_sku_" + Bytes.toString(CellUtil.cloneQualifier(cell)), Bytes.toString(CellUtil.cloneValue(cell))) } } for (result <- coponResult) { val cells = result.rawCells() for (cell <- cells) { brCoponDataMap.value.put(Bytes.toString(CellUtil.cloneRow(cell)) + "_copon_" + Bytes.toString(CellUtil.cloneQualifier(cell)), Bytes.toString(CellUtil.cloneValue(cell))) } } activityRuleTable.close() activitySkuTable.close() couponInfoTable.close() connection.close() }) } /** * 插入宽表数据 * * @param batchDF * @param brRuleDataMap * @param brSkuRuleDataMap * @param brCoponDataMap */ def putDimData(batchDF: Dataset[OrderDetails], brRuleDataMap: Broadcast[util.HashMap[String, String]], brSkuRuleDataMap: Broadcast[util.HashMap[String, String]], brCoponDataMap: Broadcast[util.HashMap[String, String]]) = { batchDF.foreachPartition((partitions: Iterator[OrderDetails]) => { import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.client.ConnectionFactory val hbaseConfig = HBaseConfiguration.create hbaseConfig.set("hbase.zookeeper.property.clientPort", "2181") hbaseConfig.set("hbase.zookeeper.quorum", "hadoop101,hadoop102,hadoop103") val connection = ConnectionFactory.createConnection(hbaseConfig) val dimOrderDetailsTable = connection.getTable(TableName.valueOf("orders:dim_order_details")) partitions.foreach(item => { //与维度表进行join val putList = new util.ArrayList[Put]() partitions.foreach(item => { val jsonObject: JSONObject = JSON.parseObject(JSON.toJSONString(item, SerializerFeature.QuoteFieldNames)) println(jsonObject.toString) val activity_id = jsonObject.getString("activity_id") val sku_id = jsonObject.getString("sku_id") val copon_id = jsonObject.getString("copon_id") //获取到维表的id后分别去广播的hashmap里取维表数据 val ruledataMap = brRuleDataMap.value val skudataMap = brSkuRuleDataMap.value val copondataMap = brCoponDataMap.value val activity_type = ruledataMap.getOrDefault(Utils.generateHash(activity_id).substring(0, 5) + "_" + activity_id + "_rule_activity_type", "") val condition_amount = ruledataMap.getOrDefault(Utils.generateHash(activity_id).substring(0, 5) + "_" + activity_id + "_rule_condition_amount", "") val benefit_amount = ruledataMap.getOrDefault(Utils.generateHash(activity_id).substring(0, 5) + "_" + activity_id + "_rule_benefit_amount", "") val benefit_discount = ruledataMap.getOrDefault(Utils.generateHash(activity_id).substring(0, 5) + "_" + activity_id + "_rule_benefit_discount", "") val benefit_level = ruledataMap.getOrDefault(Utils.generateHash(activity_id).substring(0, 5) + "_" + activity_id + "_rule_benefit_level", "") val sku_createtime = skudataMap.getOrDefault(Utils.generateHash(sku_id).substring(0, 5) + "_" + sku_id + "_sku_create_time", "") val coupon_name = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_coupon_name", "") val coupon_type = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_coupon_type", "") val coponCondition_amount = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_condition_amount", "") val coponCenefit_amount = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_benefit_amount", "") val coponBenefit_discount = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_benefit_discount", "") val coponCreate_time = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_create_time", "") val range_type = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_range_type", "") val limit_num = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_limit_num", "") val taken_count = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_taken_count", "") val start_time = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_start_time", "") val end_time = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_limit_num", "") val operate_time = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_operate_time", "") val expire_time = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_expire_time", "") val range_desc = copondataMap.getOrDefault(Utils.generateHash(copon_id).substring(0, 5) + "_" + copon_id + "_copon_range_desc", "") //取出维度表数据之后，将维度表数据添加到宽表jsonObject种 jsonObject.put("activity_type", activity_type) jsonObject.put("condition_amount", condition_amount) jsonObject.put("benefit_amount", benefit_amount) jsonObject.put("benefit_discount", benefit_discount) jsonObject.put("benefit_level", benefit_level) jsonObject.put("sku_createtime", sku_createtime) jsonObject.put("coupon_name", coupon_name) jsonObject.put("coupon_type", coupon_type) jsonObject.put("coponCondition_amount", coponCondition_amount) jsonObject.put("coponBenefit_discount", coponBenefit_discount) jsonObject.put("coponCreate_time", coponCreate_time) jsonObject.put("range_type", range_type) jsonObject.put("limit_num", limit_num) jsonObject.put("taken_count", taken_count) jsonObject.put("start_time", start_time) jsonObject.put("end_time", end_time) jsonObject.put("operate_time", operate_time) jsonObject.put("expire_time", expire_time) jsonObject.put("range_desc", range_desc) //宽表主键, hash(订单主表ID).substring(0,6)+"_"+订单主表id+"_"+订单明细id val rowkey = Utils.generateHash(jsonObject.getString("id")).substring(0, 6) + "_" + jsonObject.getString("id") + "_" + jsonObject.getString("orderDetail_id") val put = new Put(Bytes.toBytes(rowkey)) val keySet = jsonObject.keySet().toArray for (key <- keySet) { put.addColumn(Bytes.toBytes("info"), Bytes.toBytes(key.toString), Bytes.toBytes(jsonObject.getString(key.toString))) } putList.add(put) jsonObject.clear() }) dimOrderDetailsTable.put(putList) putList.clear() dimOrderDetailsTable.close() connection.close() }) }) }}