Flink窗口

最新推荐文章于 2024-08-21 12:39:10 发布

深夜的星星

最新推荐文章于 2024-08-21 12:39:10 发布

阅读量429

点赞数

CC 4.0 BY-SA版权

文章标签： flink 大数据

本文链接：https://blog.youkuaiyun.com/DataJunGe/article/details/105359658

本文介绍了Flink的窗口计算在流处理中的核心概念，包括窗口生命周期、分配器、函数、触发器、消除器和事件时间。重点讲解了Keyed Windows和Non-Keyed Windows，以及Window Join和Interval Join的应用。此外，还探讨了Flink的EventTime处理和水印机制，以及CEP模式在复杂事件处理中的作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Flink窗口计算 / 流计算

参考：https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/windows.html

概述

窗口计算是流计算的核心，通过窗口将一个无线的数据流在时间轴上切分成有限大小的数据集-bucket，然后在对切分后的数据做计算。Flink根据流的特点将窗口计算分为两大类。

Keyed Windows

stream
       .keyBy(...)               <-  对数据进行分组
       .window(...)              <-  必须指定: "assigner"，如果将数据划分到窗口中
      [.trigger(...)]            <-  可选: "trigger" 每个窗口都有默认触发器，规定窗口什么时候触发
      [.evictor(...)]            <-  可选: "evictor"，剔除器负责将窗口中数据在聚合之前或者之后剔除
      [.allowedLateness(...)]    <-  可选: "lateness" 默认不允许迟到，设置窗口数据迟到时间-EventTime
      [.sideOutputLateData(...)] <-  可选: "output tag" 可以通过边输出，将太迟的数据通过SideOut输出到特定流中
       .reduce/aggregate/fold/apply() <-  必须: "function" 负责窗口聚合计算
      [.getSideOutput(...)]      <-  可选: "output tag"，获取太迟的数据

Non-Keyed Windows

stream
       .windowAll(...)           <-  必须指定: "assigner"，如果将数据划分到窗口中
      [.trigger(...)]            <-  可选: "trigger" 每个窗口都有默认触发器，规定窗口什么时候触发
      [.evictor(...)]            <-  可选: "evictor"，剔除器负责将窗口中数据在聚合之前或者之后剔除
      [.allowedLateness(...)]    <-  可选: "lateness" 默认不允许迟到，设置窗口数据迟到时间-EventTime
      [.sideOutputLateData(...)] <-  可选: "output tag" 可以通过边输出，将太迟的数据通过SideOut输出到特定流中
       .reduce/aggregate/fold/apply() <-  必须: "function" 负责窗口聚合计算
      [.getSideOutput(...)]      <-  可选: "output tag"，获取太迟的数据

Window Lifecycle

简而言之，一旦应属于该窗口的第一个元素到达，就会“创建”窗口，并且当时间（事件或处理时间）超过其结束时间戳时，会“完全删除”该窗口。用户指定的“允许的延迟”（请参阅允许的延迟)。 Flink保证只删除基于时间的窗口，而不能删除其他类型的窗口，例如*全局窗口

此外，每个窗口都会有一个“触发器”（请参阅Triggers和一个函数（“ ProcessWindowFunction”，“ ReduceFunction”，“ AggregateFunction”或“ FoldFunction”）（请参见Window Functions）附加到它。该函数将包含要应用于窗口内容的计算，而“ Trigger”则指定条件，在该条件下，该窗口被视为可以应用该函数的条件。触发策略可能类似于“当窗口中的元素数大于4时”或“当水印通过窗口末尾时”。触发器还可以决定在创建和删除窗口之间的任何时间清除窗口的内容。在这种情况下，清除仅是指窗口中的元素，而不是指窗口元数据。这意味着仍可以将新数据添加到该窗口。

除上述内容外，您还可以指定一个“ Evictor”（请参阅Evictors），将能够在触发触发器之后以及应用此功能之前和/或之后从窗口中删除元素。

In a nutshell, a window is created as soon as the first element that should belong to this window arrives, and the window is completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness (see Allowed Lateness). Flink guarantees removal only for time-based windows and not for other types, e.g. global windows

In addition, each window will have a Trigger (see Triggers) and a function (ProcessWindowFunction, ReduceFunction, AggregateFunction or FoldFunction) (see Window Functions) attached to it. The function will contain the computation to be applied to the contents of the window, while the Trigger specifies the conditions under which the window is considered ready for the function to be applied. A triggering policy might be something like “when the number of elements in the window is more than 4”, or “when the watermark passes the end of the window”. A trigger can also decide to purge a window’s contents any time between its creation and removal. Purging in this case only refers to the elements in the window, and not the window metadata. This means that new data can still be added to that window.

Apart from the above, you can specify an Evictor (see Evictors) which will be able to remove elements from the window after the trigger fires and before and/or after the function is applied.

Window Assigners

Tumbling Windows(时间)

窗口长度和步长一样，不存在窗口的交叠

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()

fsEnv.execute("FlinkWordCountsTumblingWindow_ReduceFunction")

Sliding Windows（时间）

窗口长度一般大于或等于步长，否则会产生丢数据，存在窗口的交叠

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
.aggregate(new AggregateFunction[(String,Int),(String,Int),(String,Int)]{
   
    override def createAccumulator(): (String, Int) = ("",0)

    override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = {
   
        (value._1,value._2+accumulator._2)
    }

    override def getResult(accumulator: (String, Int)): (String, Int) = {
   
        accumulator
    }

    override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
   
        (a._1,a._2+b._2)
    }
})
.print()

fsEnv.execute("FlinkWordCountsTumblingWindow_ReduceFunction")

Session Windows（时间）

每一个元素都会产一个窗口，如果窗口与窗口间的间隔小于指定Window Gap，则系统会合并当前窗口。相比较于前两种窗口，会话窗口又称为可合并窗口，长度不固定。

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
.apply(new WindowFunction[(String,Int),String,String,TimeWindow]{
   
    override def apply(key: String, window: TimeWindow, input: Iterable[(String, Int)],
                       out: Collector[String]): Unit = {
   
        val start = window.getStart
        val end = window.getEnd
        val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
        out.collect(sdf.format(start)+" ~ "+sdf.format(end)+"\t"+input.mkString(","))
    }
})
.print()

fsEnv.execute("FlinkWordCountsTumblingWindow_ReduceFunction")

Global Windows（非时间）

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

fsEnv.socketTextStream("HadoopNode00",9999)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(t=>t._1)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(2))
.fold(("",0))((z,v)=>(v._1,z._2+v._2))
.print()

fsEnv.execute("FlinkWordCountsGlobalWindow_FoldFunction")

Window Functions

After defining the window assigner, we need to specify the computation that we want to perform on each of these windows. This is the responsibility of the window function, which is used to process the elements of each (possibly keyed) window once the system determines that a window is ready for processing (see triggers for how Flink determines when a window is ready).

ReduceFunction

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .reduce { (v1, v2) => (v1._1, v1._2 + v2._2) }

AggregateFunction

class AverageAggregate extends AggregateFunction[(String, Long), (Long, Long), Double] {
   
  override def createAccumulator() = (0L, 0L)

  override def add(value: (String, Long), accumulator: (Long, Long)) =
    (accumulator._1 + value._2, accumulator._2 + 1L)

  override def getResult(accumulator: (Long, Long)) = accumulator._1 / accumulator._2

  override def merge(a: (Long, Long), b: (Long, Long)) =
    (a._1 + b._1, a._2 + b._2)
}

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .aggregate(new AverageAggregate)

FoldFunction

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .fold("") {
    (acc, v) => acc + v._2 }

不可以用在会话窗口中。

WindowFunction (Legacy)

In some places where a ProcessWindowFunction can be used you can also use a WindowFunction. This is an older version of ProcessWindowFunction that provides less contextual information and does not have some advances features, such as per-window keyed state. This interface will be deprecated at some point.

trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {
  def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}

val input: DataStream[(String, Long)] = ...

input
    .keyBy(<key selector>)
    .window(<window assigner>)
    .apply(<

最低0.47元/天解锁文章