大数据（8r）图解流计算window滑窗

最新推荐文章于 2023-01-10 14:55:28 发布

小基基o_O

最新推荐文章于 2023-01-10 14:55:28 发布

阅读量636

点赞数 3

CC 4.0 BY-SA版权

分类专栏： Scala和Spark

本文链接：https://blog.youkuaiyun.com/Yellow_python/article/details/88936533

Scala和Spark 专栏收录该内容

37 篇文章

订阅专栏

本文深入探讨Spark Streaming中的滑窗运算，包括window、reduceByKeyAndWindow等关键方法的使用及优化策略，适用于流式数据处理场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

0、前言

本文属于Spark Streaming分支章节
本文介绍流式处理中的滑窗运算及其优化
主要方法：window、reduceByKeyAndWindow

1、Window Operations

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval)	Return a sliding window count of elements in the stream.
countByValueAndWindow(windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window.
reduceByWindow(func, windowLength, slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.

2、代码外壳（复制套用）

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable

object Hello {
  def main(args: Array[String]): Unit = {
    // 创建SparkContext对象和SparkStreamingContext
    val c0: SparkConf = new SparkConf().setAppName("a0").setMaster("local[2]")
    val sc: SparkContext = new SparkContext(c0)
    val ssc: StreamingContext = new StreamingContext(sc, Seconds(10))
    // 创建RDD队列并放入QueueInputDStream
    val rddQueue: mutable.Queue[RDD[String]] = new mutable.Queue[RDD[String]]()
    val iDS: InputDStream[String] = ssc.queueStream(rddQueue, oneAtATime = false)
    //================================ 滑窗 ======================================

    //===========================================================================
    // 打印结果
    dS.print()
    // 启动任务
    ssc.start()
    // 循环输入
    while (true) {
      rddQueue += sc.makeRDD(scala.io.StdIn.readLine.split(" "))
    }
    // 等待运算终止
    ssc.awaitTermination()
  }
}

2.1、window

val wDS: DStream[String] = iDS.window(Seconds(20))
val dS: DStream[(String, Int)] = wDS.map((_, 1)).reduceByKey(_ + _)

2.2、reduceByKeyAndWindow

val dS: DStream[(String, Int)] = iDS.map((_, 1))
  .reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(20), Seconds(10))

2.3、reduceByKeyAndWindow优化

优化点：引入invReduceFunc
全称：inverse reduce function
译名：逆向归约函数
原理：invReduceFunc(reduceFunc(x, y), x) = y

优化思想：减少重复运算

// 必须设置检查点才能执行【invReduceFunc版】的reduceByKeyAndWindow
ssc.checkpoint("checkpoint")
// 【invReduceFunc版】的reduceByKeyAndWindow
val dS: DStream[(String, Int)] = iDS.map((_, 1)).reduceByKeyAndWindow(
  reduceFunc = (a: Int, b: Int) => (a + b),
  invReduceFunc = (a: Int, b: Int) => (a - b),
  windowDuration = Seconds(20),
  slideDuration = Seconds(10)
)

2.4、以上3种写法的结果打印

3、源码

3.1、window

def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
  new WindowedDStream(this, windowDuration, slideDuration)
}

返回一个继承DStream的WindowedDStream

3.2、reduceByKeyAndWindow

对应上面2.2

def reduceByKeyAndWindow(
        reduceFunc: (V, V) => V,
        windowDuration: Duration,
        slideDuration: Duration,
        partitioner: Partitioner
    ): DStream[(K, V)] = ssc.withScope {
    self.reduceByKey(reduceFunc, partitioner)
        .window(windowDuration, slideDuration)
        .reduceByKey(reduceFunc, partitioner)
}

对应上面2.3，多了invReduceFunc、filterFunc

def reduceByKeyAndWindow(
        reduceFunc: (V, V) => V,
        invReduceFunc: (V, V) => V,
        windowDuration: Duration,
        slideDuration: Duration,
        partitioner: Partitioner,
        filterFunc: ((K, V)) => Boolean
    ): DStream[(K, V)] = ssc.withScope {

    val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
    val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
    val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None
    new ReducedWindowedDStream[K, V](
        self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
        windowDuration, slideDuration, partitioner
    )
}