Flink窗口

本文介绍了Flink的窗口计算在流处理中的核心概念,包括窗口生命周期、分配器、函数、触发器、消除器和事件时间。重点讲解了Keyed Windows和Non-Keyed Windows,以及Window Join和Interval Join的应用。此外,还探讨了Flink的EventTime处理和水印机制,以及CEP模式在复杂事件处理中的作用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Flink窗口计算 / 流计算

参考:https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/windows.html

概述

窗口计算是流计算的核心,通过窗口将一个无线的数据流在时间轴上切分成有限大小的数据集-bucket,然后在对切分后的数据做计算。Flink根据流的特点将窗口计算分为两大类。

  • Keyed Windows
stream
       .keyBy(...)               <-  对数据进行分组
       .window(...)              <-  必须指定: "assigner",如果将数据划分到窗口中
      [.trigger(...)]            <-  可选: "trigger" 每个窗口都有默认触发器,规定窗口什么时候触发
      [.evictor(...)]            <-  可选: "evictor",剔除器负责将窗口中数据在聚合之前或者之后剔除
      [.allowedLateness(...)]    <-  可选: "lateness" 默认不允许迟到,设置窗口数据迟到时间-EventTime
      [.sideOutputLateData(...)] <-  可选: "output tag" 可以通过边输出,将太迟的数据通过SideOut输出到特定流中
       .reduce/aggregate/fold/apply() <-  必须: "function" 负责窗口聚合计算
      [.getSideOutput(...)]      <-  可选: "output tag",获取太迟的数据
  • Non-Keyed Windows
stream
       .windowAll(...)           <-  必须指定: "assigner",如果将数据划分到窗口中
      [.trigger(...)]            <-  可选: "trigger" 每个窗口都有默认触发器,规定窗口什么时候触发
      [.evictor(...)]            <-  可选: "evictor",剔除器负责将窗口中数据在聚合之前或者之后剔除
      [.allowedLateness(...)]    <-  可选: "lateness" 默认不允许迟到,设置窗口数据迟到时间-EventTime
      [.sideOutputLateData(...)] <-  可选: "output tag" 可以通过边输出,将太迟的数据通过SideOut输出到特定流中
       .reduce/aggregate/fold/apply() <-  必须: "function" 负责窗口聚合计算
      [.getSideOutput(...)]      <-  可选: "output tag",获取太迟的数据

Window Lifecycle

简而言之,一旦应属于该窗口的第一个元素到达,就会“创建”窗口,并且当时间(事件或处理时间)超过其结束时间戳时,会“完全删除”该窗口。用户指定的“允许的延迟”(请参阅允许的延迟)。 Flink保证只删除基于时间的窗口,而不能删除其他类型的窗口,例如*全局窗口

此外,每个窗口都会有一个“触发器”(请参阅Triggers和一个函数(“ ProcessWindowFunction”,“ ReduceFunction”,“ AggregateFunction”或“ FoldFunction”)(请参见Window Functions)附加到它。该函数将包含要应用于窗口内容的计算,而“ Trigger”则指定条件,在该条件下,该窗口被视为可以应用该函数的条件。触发策略可能类似于“当窗口中的元素数大于4时”或“当水印通过窗口末尾时”。触发器还可以决定在创建和删除窗口之间的任何时间清除窗口的内容。在这种情况下,清除仅是指窗口中的元素,而不是指窗口元数据。这意味着仍可以将新数据添加到该窗口。

除上述内容外,您还可以指定一个“ Evictor”(请参阅Evictors),将能够在触发触发器之后以及应用此功能之前和/或之后从窗口中删除元素。

In a nutshell, a window is created as soon as the first element that should belong to this window arrives, and the window is completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness (see Allowed Lateness). Flink guarantees removal only for time-based windows and not for other types, e.g. global windows

In addition, each window will have a Trigger (see Triggers) and a function (ProcessWindowFunction, ReduceFunction, AggregateFunction or FoldFunction) (see Window Functions) attached to it. The function will contain the computation to be applied to the contents of the window, while the Trigger specifies the conditions under which the window is considered ready for the function to be applied. A triggering policy might be something like “when the number of elements in the window is more than 4”, or “when the watermark passes the end of the window”. A trigger can also decide to purge a window’s contents any time between its creation and removal. Purging in this case only refers to the elements in the window, and not the window metadata. This means that new data can still be added to that window.

Apart from the above, you can specify an Evictor (see Evictors) which will be able to remove elements from the window after the trigger fires and before and/or after the function is applied.

Window Assigners

Tumbling Windows(时间)

窗口长度和步长一样,不存在窗口的交叠

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()

fsEnv.execute("FlinkWordCountsTumblingWindow_ReduceFunction")

Sliding Windows(时间)

窗口长度一般大于或等于步长,否则会产生丢数据,存在窗口的交叠

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
.aggregate(new AggregateFunction[(String,Int),(String,Int),(String,Int)]{
   
    override def createAccumulator(): (String, Int) = ("",0)

    override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = {
   
        (value._1,value._2+accumulator._2)
    }

    override def getResult(accumulator: (String, Int)): (String, Int) = {
   
        accumulator
    }

    override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
   
        (a._1,a._2+b._2)
    }
})
.print()

fsEnv.execute("FlinkWordCountsTumblingWindow_ReduceFunction")

Session Windows(时间)

每一个元素都会产一个窗口,如果窗口与窗口间的间隔小于指定Window Gap,则系统会合并当前窗口。相比较于前两种窗口,会话窗口又称为可合并窗口,长度不固定。

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
.apply(new WindowFunction[(String,Int),String,String,TimeWindow]{
   
    override def apply(key: String, window: TimeWindow, input: Iterable[(String, Int)],
                       out: Collector[String]): Unit = {
   
        val start = window.getStart
        val end = window.getEnd
        val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
        out.collect(sdf.format(start)+" ~ "+sdf.format(end)+"\t"+input.mkString(","))
    }
})
.print()

fsEnv.execute("FlinkWordCountsTumblingWindow_ReduceFunction")

Global Windows(非时间)

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(2))
.fold(("",0))((z,v)=>(v._1,z._2+v._2))
.print()

fsEnv.execute("FlinkWordCountsGlobalWindow_FoldFunction")

Window Functions

After defining the window assigner, we need to specify the computation that we want to perform on each of these windows. This is the responsibility of the window function, which is used to process the elements of each (possibly keyed) window once the system determines that a window is ready for processing (see triggers for how Flink determines when a window is ready).

ReduceFunction

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce { (v1, v2) => (v1._1, v1._2 + v2._2) }

AggregateFunction

class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
   
  override def createAccumulator() = (0L, 0L)

  override def add(value: (String, Long), accumulator: (Long, Long)) =
    (accumulator._1 + value._2, accumulator._2 + 1L)

  override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2

  override def merge(a: (Long, Long), b: (Long, Long)) =
    (a._1 + b._1, a._2 + b._2)
}

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .aggregate(new AverageAggregate)

FoldFunction

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .fold("") {
    (acc, v) => acc + v._2 }

不可以用在会话窗口中。

WindowFunction (Legacy)

In some places where a ProcessWindowFunction can be used you can also use a WindowFunction. This is an older version of ProcessWindowFunction that provides less contextual information and does not have some advances features, such as per-window keyed state. This interface will be deprecated at some point.

trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {
  def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}
val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .apply(<
<think>好的,我现在需要回答用户关于Apache Flink窗口函数的使用教程示例的问题。首先,我得确保自己理解Flink窗口的基本概念,然后按照用户的要求结构清晰、分步骤说明,并且结合示例代码。 首先,用户提到了窗口分配器窗口函数,这部分是关键。根据引用[2],窗口操作分为这两个部分,所以我的回答应该明确区分这两个部分。我需要解释窗口分配器的类型,比如时间窗口、计数窗口、会话窗口等,并给出对应的代码示例。比如,滚动窗口滑动窗口的区别,以及如何用Flink的API来实现。 接下来是窗口函数部分,用户可能想知道如何处理窗口内的数据。这里需要提到增量聚合函数(如ReduceFunction, AggregateFunction)全量窗口函数(如ProcessWindowFunction)。引用[2]中的代码示例展示了aggregate方法的使用,但用户可能也需要全量处理的例子,比如使用ProcessWindowFunction的情况。需要注意,非按键分区的窗口(windowAll)无法并行处理,这点在引用[2]中也有提到,所以需要特别说明。 然后,用户需要教程示例,我需要提供具体的代码示例,包括如何定义窗口,如何应用聚合函数或处理函数。比如,一个简单的按时间滚动的窗口,计算每个窗口内的总或平均值。同时,要确保代码语法正确,并符合Flink的最新API规范,避免使用过时的方法。 另外,用户的问题中引用了两处资料,我需要正确地在回答中标注引用。例如,在讲解窗口分配器时提到非按键分区的情况,可以引用[^2]。在代码示例部分,如果参考了相关的教程,也要添加引用标识。 还要注意用户提到的系统级指令,比如使用中文回答,数学公式的正确格式,生成相关问题等。这里的问题是关于Flink窗口,所以生成的三个问题需要围绕窗口函数的相关概念、使用场景常见问题展开。 现在,我需要组织这些内容,确保结构清晰。可能的结构是:先介绍窗口的基本概念,然后分步骤讲解窗口分配器窗口函数,给出代码示例,最后总结应用场景注意事项。在每个关键点后添加引用标识,如[^2]。 还需要检查示例代码是否正确,比如使用.keyBy(...).window(...).aggregate(...)的结构,以及ProcessWindowFunction的使用方法是否正确。同时,要确认是否提到了并行度的限制,尤其是在使用windowAll时,这点很重要,避免用户在使用时出现性能问题。 最后,生成的相关问题要具体,比如窗口类型的选择、如何优化窗口处理性能、处理迟到数据的机制等,这些都是用户可能会进一步询问的问题。</think>### Apache Flink窗口函数使用教程 #### 一、窗口基础概念 在流处理中,窗口(Window)用于将无界数据流划分为有限大小的“数据块”,以便进行聚合、统计等操作。Flink提供了多种窗口类型,核心组件包括**窗口分配器(Window Assigners)****窗口函数(Window Functions)**。 #### 二、窗口操作步骤 1. **定义窗口分配器** 通过`.window()`方法指定窗口类型: - **时间窗口** - 滚动时间窗口:`TumblingProcessingTimeWindows.of(Time.seconds(5))` - 滑动时间窗口:`SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5))` - **计数窗口** - 滚动计数窗口:`.countWindow(100)` - **会话窗口** - 会话间隔:`.window(EventTimeSessionWindows.withGap(Time.minutes(5)))` 2. **选择窗口函数** - **增量聚合函数**(逐条处理,内存效率高) 示例:`ReduceFunction`, `AggregateFunction` ```java .aggregate(new AverageAggregate()) ``` - **全量窗口函数**(窗口闭合后处理,可访问窗口数据) 示例:`ProcessWindowFunction` ```java .process(new CustomProcessFunction()) ``` #### 三、完整代码示例 ```java // 数据源:包含时间戳数值的传感器流 DataStream<SensorReading> sensorData = ...; // 按键分区后定义窗口 DataStream<Double> result = sensorData .keyBy(SensorReading::getSensorId) .window(TumblingProcessingTimeWindows.of(Time.seconds(5))) .aggregate(new AverageAggregate()); // 自定义聚合函数(增量计算) public static class AverageAggregate implements AggregateFunction<SensorReading, Tuple2<Double, Integer>, Double> { @Override public Tuple2<Double, Integer> createAccumulator() { return Tuple2.of(0.0, 0); } @Override public Tuple2<Double, Integer> add(SensorReading value, Tuple2<Double, Integer> accumulator) { return Tuple2.of(accumulator.f0 + value.getValue(), accumulator.f1 + 1); } @Override public Double getResult(Tuple2<Double, Integer> accumulator) { return accumulator.f0 / accumulator.f1; } @Override public Tuple2<Double, Integer> merge(Tuple2<Double, Integer> a, Tuple2<Double, Integer> b) { return Tuple2.of(a.f0 + b.f0, a.f1 + b.f1); } } ``` #### 四、关键注意事项 1. **并行度限制** 使用`windowAll`时数据不分区,无法并行处理,需谨慎用于高吞吐场景[^2]。 2. **时间语义选择** 可通过`env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)`指定事件时间。 3. **迟到数据处理** 使用`allowedLateness`定义允许延迟: ```java .window(...) .allowedLateness(Time.seconds(10)) .sideOutputLateData(lateOutputTag) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值