在SparkStreaming中提供了window操作,通过window操作,操作者可以对一个滑动的窗口内的数据进行转换操作,如下图所示:
如上图所示,这个窗口每次在DStream上进行滑动,这里存在两个变量
1、window length 窗口长度
2、slid length 窗口滑动长度
这里的长度均是以batchinterval为单位,因此以上两个参数均需要时batchinterval的整数倍。
示例如下:
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
窗口大小为30S,每隔10S窗口滑动一次。通俗地解释就是,每隔10s统计前30s内的数据。
以下是一些window操作中常用的算子:
Transformation | Meaning |
---|---|
window(windowLength, slideInterval) | Return a new DStream which is computed based on windowed batches of the source DStream. |
countByWindow(windowLength,slideInterval) | Return a sliding window count of elements in the stream. |
reduceByWindow(func, windowLength,slideInterval) | Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel. |
reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config propertyspark.default.parallelism ) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks]) | A more efficient version of the above |
countByValueAndWindow(windowLength,slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow , the number of reduce tasks is configurable through an optional argument. |