有型嘅目录
0、前言
- 本文属于Spark Streaming分支章节
- 本文介绍流式处理中的滑窗运算及其优化
- 主要方法:
window
、reduceByKeyAndWindow
1、Window Operations
Transformation | Meaning |
---|---|
window(windowLength, slideInterval) | Return a new DStream which is computed based on windowed batches of the source DStream. |
countByWindow(windowLength, slideInterval) | Return a sliding window count of elements in the stream. |
countByValueAndWindow(windowLength, slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. |
reduceByWindow(func, windowLength, slideInterval) | Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel. |
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism ) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
2、代码外壳(复制套用)
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
object Hello {
def main(args: Array[String]): Unit = {
// 创建SparkContext对象和SparkStreamingContext
val c0: SparkConf = new SparkConf().setAppName("a0").setMaster("local[2]")
val sc: SparkContext = new SparkContext(c0)
val ssc: StreamingContext = new StreamingContext(sc, Seconds(10))
// 创建RDD队列并放入QueueInputDStream
val rddQueue: mutable.Queue[RDD[String]] = new mutable.Queue[RDD[String]]()
val iDS: InputDStream[String] = ssc.queueStream(rddQueue, oneAtATime = false)
//================================ 滑窗 ======================================
//===========================================================================
// 打印结果
dS.print()
// 启动任务
ssc.start()
// 循环输入
while (true) {
rddQueue += sc.makeRDD(scala.io.StdIn.readLine.split(" "))
}
// 等待运算终止
ssc.awaitTermination()
}
}
2.1、window
val wDS: DStream[String] = iDS.window(Seconds(20))
val dS: DStream[(String, Int)] = wDS.map((_, 1)).reduceByKey(_ + _)
2.2、reduceByKeyAndWindow
val dS: DStream[(String, Int)] = iDS.map((_, 1))
.reduceByKeyAndWindow((a: Int, b: Int) => (a + b), Seconds(20), Seconds(10))
2.3、reduceByKeyAndWindow优化
- 优化点:引入
invReduceFunc
- 全称:inverse reduce function
- 译名:逆向归约函数
- 原理:
invReduceFunc(reduceFunc(x, y), x) = y
优化思想:减少重复运算
// 必须设置检查点才能执行【invReduceFunc版】的reduceByKeyAndWindow
ssc.checkpoint("checkpoint")
// 【invReduceFunc版】的reduceByKeyAndWindow
val dS: DStream[(String, Int)] = iDS.map((_, 1)).reduceByKeyAndWindow(
reduceFunc = (a: Int, b: Int) => (a + b),
invReduceFunc = (a: Int, b: Int) => (a - b),
windowDuration = Seconds(20),
slideDuration = Seconds(10)
)
2.4、以上3种写法的结果打印
3、源码
3.1、window
def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
new WindowedDStream(this, windowDuration, slideDuration)
}
返回一个继承
DStream
的WindowedDStream
3.2、reduceByKeyAndWindow
对应上面2.2
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner
): DStream[(K, V)] = ssc.withScope {
self.reduceByKey(reduceFunc, partitioner)
.window(windowDuration, slideDuration)
.reduceByKey(reduceFunc, partitioner)
}
对应上面2.3,多了
invReduceFunc
、filterFunc
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner,
filterFunc: ((K, V)) => Boolean
): DStream[(K, V)] = ssc.withScope {
val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None
new ReducedWindowedDStream[K, V](
self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
windowDuration, slideDuration, partitioner
)
}
返回
ReducedWindowedDStream