WindowOperations

最新推荐文章于 2023-05-11 10:04:40 发布

无名氏0428

最新推荐文章于 2023-05-11 10:04:40 发布

阅读量610

点赞数

CC 4.0 BY-SA版权

分类专栏： SparkStreaming

本文链接：https://blog.youkuaiyun.com/ainidong2005/article/details/53081464

SparkStreaming 专栏收录该内容

9 篇文章

订阅专栏

本文介绍了Spark Streaming中的窗口操作概念及应用，详细解释了窗口长度和滑动长度这两个关键参数，并给出了具体的示例代码。此外，还列举了多种常用的窗口操作算子及其使用场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在SparkStreaming中提供了window操作，通过window操作，操作者可以对一个滑动的窗口内的数据进行转换操作，如下图所示：

如上图所示，这个窗口每次在DStream上进行滑动，这里存在两个变量

1、window length 窗口长度

2、slid length 窗口滑动长度

这里的长度均是以batchinterval为单位，因此以上两个参数均需要时batchinterval的整数倍。

示例如下：

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

窗口大小为30S，每隔10S窗口滑动一次。通俗地解释就是，每隔10s统计前30s内的数据。

以下是一些window操作中常用的算子：

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength,slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength,slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property`spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameterinvFunc). Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
countByValueAndWindow(windowLength,slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.