【Flink系列】多流转换介绍

最新推荐文章于 2024-10-27 16:14:06 发布

请叫我阿炜

最新推荐文章于 2024-10-27 16:14:06 发布

阅读量1.6k

点赞数

CC 4.0 BY-SA版权

分类专栏：【Flink系列】文章标签： flink java android

本文链接：https://blog.youkuaiyun.com/qq_42875020/article/details/127378301

【Flink系列】专栏收录该内容

6 篇文章

订阅专栏

本文深入探讨了Apache Flink中的多流转换，包括分流和合流操作。分流利用处理函数的侧输出流实现，如通过OutputTag定义和提取侧数据流。合流则涉及联合（Union）、连接（Connect）以及基于时间的窗口联结（WindowJoin）、间隔联结（IntervalJoin）和窗口同组联结（WindowCoGroup）。这些操作在流处理中扮演关键角色，允许对不同数据流进行复杂的数据整合和处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

多流转换简介

多流转换可以分为"分流"和"合流"两大类、目前分流的操作一般是通过测输出流(side output)来实现、而合流算子比较丰富、根据不同的需求可以调用union、connec、join以及coGroup等接口合并操作。

1、分流

使用处理函数的侧输出流。处理函数本身可以认为是一个转换算子，其输出类型单一，仍然是DataStream；然而侧输出流却能不受限制的任意自定义输出数据，它们就像从”主流“拆分下来的"支流"。

需要调用上下文ctx的output()方法，就可以输出任意类型的数据了。而侧数据流的标记和提取都离不开一个输出标签（OutputTag）

public class KeyByStreamTest {

    public static void main(String[] args) throws Exception {
        // 1、创建流式执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<Event> stream = env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))
                        .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
                            @Override
                            public long extractTimestamp(Event element, long recordTimestamp) {
                                return element.timeStamp;
                            }
                        }));
        // 定义侧输出流、
        OutputTag<Tuple3<String, String, Long>> maryTag = new OutputTag<Tuple3<String, String, Long>>("Mary"){};
        OutputTag<Tuple3<String, String, Long>> bobTag = new OutputTag<Tuple3<String, String, Long>>("Bob"){};

        SingleOutputStreamOperator<Event> process = stream.process(new ProcessFunction<Event, Event>() {
            @Override
            public void processElement(Event event, ProcessFunction<Event, Event>.Context context, Collector<Event> collector) throws Exception {
                if (event.user.equals("mary")) {
                    context.output(maryTag, Tuple3.of(event.user, event.url, event.timeStamp));
                } else if (event.user.equals("bob")) {
                    context.output(bobTag, Tuple3.of(event.user, event.url, event.timeStamp));
                } else {
                    collector.collect(event);
                }
            }
        });

        process.print();
        process.getSideOutput(maryTag).print();
        process.getSideOutput(bobTag).print();
        
        env.execute();
    }

}

2、合流

联合（Union）

最简单的合流操作，就是直接将多条流合在一起，联合操作要求流中的数据类型必须相同，合并之后的新流会包括所有流中的元素，数据类型不变。
在这里插入图片描述
代码实现上，基于DataStream直接调用union()方法，传入其他DataStream作为参数，得到结果仍然是DataStream类型

/**
 * @ClassName: UnionStream
 * @Author: VV
 * @Version: 1.0.0
 * @Description: TODO
 * @MyEmail: vv1213418894@163.com
 * @CreateTime: 2022-10-21 21:50:47
 */
public class UnionStream {

    public static void main(String[] args) throws Exception {
        // 1、创建流式执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<Event> stream1 = env.socketTextStream("hadoop101",7777)
                .map(data -> {
                    String[] split = data.split(",");
                    return new Event(split[0],split[1],Long.getLong(split[2]));
                })
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(2))
                        .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
                            @Override
                            public long extractTimestamp(Event element, long recordTimestamp) {
                                return element.timeStamp;
                            }
                        }));

        SingleOutputStreamOperator<Event> stream2 = env.socketTextStream("hadoop102",8888)
                .map(data -> {
                    String[] split = data.split(",");
                    return new Event(split[0],split[1],Long.getLong(split[2]));
                })
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(5))
                        .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
                            @Override
                            public long extractTimestamp(Event element, long recordTimestamp) {
                                return element.timeStamp;
                            }
                        }));

        stream1.union(stream2)
                .process(new ProcessFunction<Event, String>() {
                    @Override
                    public void processElement(Event event, ProcessFunction<Event, String>.Context context, Collector<String> collector) throws Exception {
                        collector.collect("当前水位线： "+context.timerService().currentWatermark());
                    }
                }).print();
        

        env.execute();

    }

}

连接（Connect）

处理更加灵活，连接操作允许流的数据类型不同。DataStream中的数据只能有唯一的类型，故连接得到的并不是DataStream，而是一个连接流（ConnectedStreams），其可以看成是两条流形式上的统一，在同一条流中却在内部仍然保持着各自的数据形式不变，彼此独立。若要得到新的DataStream，可进一步定义一个同处理（co-process）转换操作，用来说明对于不同来源、不同类型的数据怎，如何进行处理转换、得到统一的输出类型。
在这里插入图片描述

public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        DataStreamSource<Integer> stream1 = env.fromElements(1, 2, 3);
        DataStreamSource<Long> stream2 = env.fromElements(4L, 5L, 6L, 7L);

        SingleOutputStreamOperator<String> map = stream2.connect(stream1).map(new CoMapFunction<Long, Integer, String>() {
            @Override
            public String map1(Long aLong) throws Exception {
                return "Long";
            }

            @Override
            public String map2(Integer integer) throws Exception {
                return "Integer";
            }
        });

        map.print();

        env.execute();
    }

案例：

public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<Tuple3<String, String, Long>> stream1 = env.fromElements(
                Tuple3.of("order-1", "app", 1000L),
                Tuple3.of("order-2", "app", 2000L),
                Tuple3.of("order-3", "app", 3500L)
                ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Long>>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<Tuple3<String, String, Long>>() {
                    @Override
                    public long extractTimestamp(Tuple3<String, String, Long> element, long recordTimestamp) {
                        return element.f2;
                    }
                }));

        SingleOutputStreamOperator<Tuple4<String, String, String, Long>> stream2 = env.fromElements(
                Tuple4.of("order-1", "third-party", "Success", 3000L),
                Tuple4.of("order-3", "third-party", "Success", 4000L)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple4<String, String, String, Long>>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<Tuple4<String, String, String, Long>>() {
                    @Override
                    public long extractTimestamp(Tuple4<String, String, String, Long> element, long recordTimestamp) {
                        return element.f3;
                    }
                }));

        SingleOutputStreamOperator<String> process = stream1.connect(stream2)
                .keyBy(data -> data.f0, data -> data.f0)
                .process(new CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>() {

                    private transient ValueState<Tuple3<String, String, Long>> appState;
                    private transient ValueState<Tuple4<String, String, String, Long>> orderState;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        appState = getRuntimeContext().getState(new ValueStateDescriptor<Tuple3<String, String, Long>>("appState", Types.TUPLE(Types.STRING, Types.STRING, Types.LONG)));
                        orderState = getRuntimeContext().getState(new ValueStateDescriptor<Tuple4<String, String, String, Long>>("orderState", Types.TUPLE(Types.STRING, Types.STRING, Types.STRING, Types.LONG)));
                    }

                    @Override
                    public void processElement1(Tuple3<String, String, Long> value, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.Context context, Collector<String> collector) throws Exception {
                        // 来看app中是否来了流
                        if (orderState.value() != null) {
                            collector.collect("对账成功" + value + " " + orderState.value());
                            orderState.clear();
                        } else {
                            appState.update(value);
                            // 注册定时器、等待另外一条流
                            context.timerService().registerEventTimeTimer(value.f2 + 5000L);
                        }
                    }

                    @Override
                    public void processElement2(Tuple4<String, String, String, Long> value, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.Context context, Collector<String> collector) throws Exception {
                        // 先看app有没有来流
                        if (appState.value() != null) {
                            collector.collect("当前匹配的值为：" + value + " " + appState.value());
                            appState.clear();
                        } else {
                            orderState.update(value);
                            context.timerService().registerEventTimeTimer(value.f3);
                        }
                    }

                    @Override
                    public void onTimer(long timestamp, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.OnTimerContext ctx, Collector<String> out) throws Exception {
                        if (appState.value() != null) {
                            System.out.println("对账失败" + appState.value() + " 来了， " + "第三方支付平台没来");
                        }
                        if (orderState.value() != null)
                            System.out.println("对账失败" + orderState.value() + " 来了，" + "app日志没有来");
                        appState.clear();
                        orderState.clear();
                    }
                });

        process.print();

        env.execute();
    }

基于时间的合流——双流联结（Join）

SQL中join一般会翻译为连接；我们这里为了区分不用的算子，一般的合流操作connect翻译为连接，而把join翻译为联结

Window Join
Interval Join
Flink_Join_文档详解

1、Winodw Join窗口联结的调用

Window join 作用在两个流中有相同 key 且处于相同窗口的元素上。这些窗口可以通过 window assigner 定义，并且两个流中的元素都会被用于计算窗口的结果。

两个流中的元素在组合之后，会被传递给用户定义的 JoinFunction 或 FlatJoinFunction，用户可以用它们输出符合 join 要求的结果。

常见的用例可以总结为以下代码：

stream.join(otherStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(<WindowAssigner>)
    .apply(<JoinFunction>);

滚动 Window Join

使用滚动 window join 时，所有 key 相同且共享一个滚动窗口的元素会被组合成对，并传递给 JoinFunction 或 FlatJoinFunction。因为这个行为与 inner join 类似，所以一个流中的元素如果没有与另一个流中的元素组合起来，它就不会被输出！
在这里插入图片描述
如图所示，我们定义了一个大小为 2 毫秒的滚动窗口，即形成了边界为 [0,1], [2,3], … 的窗口。图中展示了如何将每个窗口中的元素组合成对，组合的结果将被传递给 JoinFunction。注意，滚动窗口 [6,7] 将不会输出任何数据，因为绿色流当中没有数据可以与橙色流的 ⑥ 和 ⑦ 配对。

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
 
...

DataStream<Integer> orangeStream = ...;
DataStream<Integer> greenStream = ...;

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });

滑动 Window Join

当使用滑动 window join 时，所有 key 相同且处于同一个滑动窗口的元素将被组合成对，并传递给 JoinFunction 或 FlatJoinFunction。当前滑动窗口内，如果一个流中的元素没有与另一个流中的元素组合起来，它就不会被输出！注意，在某个滑动窗口中被 join 的元素不一定会在其他滑动窗口中被 join。
在这里插入图片描述
本例中我们定义了长度为两毫秒，滑动距离为一毫秒的滑动窗口，生成的窗口实例区间为 [-1, 0],[0,1],[1,2],[2,3], …。 X 轴下方是每个滑动窗口中被 join 后传递给 JoinFunction 的元素。图中可以看到橙色 ② 与绿色 ③ 在窗口 [2,3] 中 join，但没有与窗口 [1,2] 中任何元素 join。

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

...

DataStream<Integer> orangeStream = ...;
DataStream<Integer> greenStream = ...;

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */))
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });

会话 Window Join

使用会话 window join 时，所有 key 相同且组合后符合会话要求的元素将被组合成对，并传递给 JoinFunction 或 FlatJoinFunction。这个操作同样是 inner join，所以如果一个会话窗口中只含有某一个流的元素，这个窗口将不会产生输出！
在这里插入图片描述
这里我们定义了一个间隔为至少一毫秒的会话窗口。图中总共有三个会话，前两者中两个流都有元素，它们被 join 并传递给 JoinFunction。而第三个会话中，绿流没有任何元素，所以 ⑧ 和 ⑨ 没有被 join！

DataStream<Integer> orangeStream = ...;
DataStream<Integer> greenStream = ...;

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(EventTimeSessionWindows.withGap(Time.milliseconds(1)))
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });

2、Interval Join

public static void main(String[] args) throws Exception {
        // 1、创建流式执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<Tuple2<String, Long>> orderStream = env.fromElements(
                Tuple2.of("Mary", 5000L),
                Tuple2.of("Alice", 5000L),
                Tuple2.of("Bob", 20000L),
                Tuple2.of("Alice", 20000L),
                Tuple2.of("Cary", 51000L)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
                    @Override
                    public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
                        return element.f1;
                    }
                }));

        SingleOutputStreamOperator<Event> clickeStream = env.fromElements(
                new Event("Bob", "./cart", 2000L),
                new Event("Alice", "./prod?id=100", 3000L),
                new Event("Alice", "./prod?id=200", 3500L),
                new Event("Bob", "./prod?id=2", 2500L),
                new Event("Alice", "./prod?id=300", 36000L),
                new Event("Bob", "./home", 30000L),
                new Event("Bob", "./prod?id=1", 23000L),
                new Event("Bob", "./prod?id=3", 33000L)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
                    @Override
                    public long extractTimestamp(Event element, long recordTimestamp) {
                        return element.timeStamp;
                    }
                }));

        SingleOutputStreamOperator<String> process = orderStream.keyBy(data -> data.f0)
                .intervalJoin(clickeStream.keyBy(data -> data.user))
                .between(Time.seconds(-5), Time.seconds(5))
                .process(new ProcessJoinFunction<Tuple2<String, Long>, Event, String>() {
                    @Override
                    public void processElement(Tuple2<String, Long> left, Event right, ProcessJoinFunction<Tuple2<String, Long>, Event, String>.Context context, Collector<String> collector) throws Exception {
                        collector.collect(left + "  " + right);
                    }
                });
        process.print();

        env.execute();
    }

3、窗口同组联结(Window CoGroup)

用法跟window join非常类似，也就是两条流合并之后开窗处理匹配的元素，调用时只需要将join()换为coGroup()即可

stream1.coGroup(stream2)
     .where(<KeySelector>)
     .equalTo(<KeySelector>)
     .window(TumblingEventTimeWindows.of(Time.hours(1)))
     .apply(<CoGroupFunction>)

调用apply()方法传入一个CoGroupFunction，这是一个函数类接口，源码如下：

public interface CoGroupFunction<IN1, IN2, O> extends Function, Serializable {
 	void coGroup(Iterable<IN1> first, Iterable<IN2> second, Collector<O> out) throws Exception;
}

coGroup操作比窗口join更加透明，不仅可以实现类似 SQL 中的“内连接”（inner join），也可以实现左外连接（left outer join）、右外连接（right outer join）和全外连接（full outer join）。事实上，窗口 join 的底层，也是通过 coGroup 来实现的

public static void main(String[] args) throws Exception {

        // 1、创建流式执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<Tuple2<String, Long>> stream1 = env.fromElements(
                Tuple2.of("Mary", 5000L),
                Tuple2.of("Alice", 5000L),
                Tuple2.of("Bob", 20000L),
                Tuple2.of("Alice", 20000L),
                Tuple2.of("Cary", 51000L)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
                    @Override
                    public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
                        return element.f1;
                    }
                }));

        SingleOutputStreamOperator<Event> stream2 = env.fromElements(
                new Event("Bob", "./cart", 2000L),
                new Event("Alice", "./prod?id=100", 3000L),
                new Event("Alice", "./prod?id=200", 3500L),
                new Event("Bob", "./prod?id=2", 2500L),
                new Event("Alice", "./prod?id=300", 36000L),
                new Event("Bob", "./home", 30000L),
                new Event("Bob", "./prod?id=1", 23000L),
                new Event("Bob", "./prod?id=3", 33000L)
        ).assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ZERO)
                .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
                    @Override
                    public long extractTimestamp(Event element, long recordTimestamp) {
                        return element.timeStamp;
                    }
                }));

        stream1.coGroup(stream2)
                .where(data -> data.f0)
                .equalTo(data -> data.user)
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                .apply(new CoGroupFunction<Tuple2<String, Long>, Event, String>() {
                    @Override
                    public void coGroup(Iterable<Tuple2<String, Long>> left, Iterable<Event> right, Collector<String> out) throws Exception {
                        out.collect(left + " " + right);
                    }
                }).print();


        env.execute();

    }

总结

多流转换操作时流处理在实际应用中常见的需求，主要包括合流和分流两大类。在Flink中，分流操作可以通过处理函数的侧输出流（side output）实现；而合流则提供不同层级的各种API

最基本的合流方式是联合（union）和连接（connect）,两者的区别在于union可以对多条流进行合并，数据类型必须一致；而connect只能连接两条流，数据类型可以不同。事实上connect提供了最底层的处理函数接口，可以通过状态和定时器实现任意自定义的合流操作。

Flink还提供了内置的几个联结操作，基于某个时间段的双流合并，是需求特化之后的高层级API：窗口联结（window join）、间隔联结（interval join）、窗口同组联结（window coGroup）。其中 window join 和 coGroup 都是基于时间窗口的操作，窗口分配器的定义与之前介绍的相同，而窗口函数则被限定为一种，通过.apply()来调用； interval join 则与窗口无关，而是基于每个数据元素截取对应的一个时间段来做联结，最终的处理操作则需调用.process()，由处理函数 ProcessJoinFunction 实现