多流转换简介
多流转换可以分为"分流"和"合流"两大类、目前分流的操作一般是通过测输出流(side output)来实现、而合流算子比较丰富、根据不同的需求可以调用union、connec、join以及coGroup等接口合并操作。
1、分流
使用处理函数的侧输出流。处理函数本身可以认为是一个转换算子,其输出类型单一,仍然是DataStream;然而侧输出流却能不受限制的任意自定义输出数据,它们就像从”主流“拆分下来的"支流"。
需要调用上下文ctx的output()方法,就可以输出任意类型的数据了。而侧数据流的标记和提取都离不开一个输出标签(OutputTag)
public class KeyByStreamTest {
public static void main(String[] args) throws Exception {
// 1、创建流式执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Event> stream = env.addSource(new ClickSource())
.assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))
.withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timeStamp;
}
}));
// 定义侧输出流、
OutputTag<Tuple3<String, String, Long>> maryTag = new OutputTag<Tuple3<String, String, Long>>("Mary"){};
OutputTag<Tuple3<String, String, Long>> bobTag = new OutputTag<Tuple3<String, String, Long>>("Bob"){};
SingleOutputStreamOperator<Event> process = stream.process(new ProcessFunction<Event, Event>() {
@Override
public void processElement(Event event, ProcessFunction<Event, Event>.Context context, Collector<Event> collector) throws Exception {
if (event.user.equals("mary")) {
context.output(maryTag, Tuple3.of(event.user, event.url, event.timeStamp));
} else if (event.user.equals("bob")) {
context.output(bobTag, Tuple3.of(event.user, event.url, event.timeStamp));
} else {
collector.collect(event);
}
}
});
process.print();
process.getSideOutput(maryTag).print();
process.getSideOutput(bobTag).print();
env.execute();
}
}
2、合流
联合(Union)
最简单的合流操作,就是直接将多条流合在一起,联合操作要求流中的数据类型必须相同,合并之后的新流会包括所有流中的元素,数据类型不变。
代码实现上,基于DataStream直接调用union()方法,传入其他DataStream作为参数,得到结果仍然是DataStream类型
/**
* @ClassName: UnionStream
* @Author: VV
* @Version: 1.0.0
* @Description: TODO
* @MyEmail: vv1213418894@163.com
* @CreateTime: 2022-10-21 21:50:47
*/
public class UnionStream {
public static void main(String[] args) throws Exception {
// 1、创建流式执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Event> stream1 = env.socketTextStream("hadoop101",7777)
.map(data -> {
String[] split = data.split(",");
return new Event(split[0],split[1],Long.getLong(split[2]));
})
.assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(2))
.withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timeStamp;
}
}));
SingleOutputStreamOperator<Event> stream2 = env.socketTextStream("hadoop102",8888)
.map(data -> {
String[] split = data.split(",");
return new Event(split[0],split[1],Long.getLong(split[2]));
})
.assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timeStamp;
}
}));
stream1.union(stream2)
.process(new ProcessFunction<Event, String>() {
@Override
public void processElement(Event event, ProcessFunction<Event, String>.Context context, Collector<String> collector) throws Exception {
collector.collect("当前水位线: "+context.timerService().currentWatermark());
}
}).print();
env.execute();
}
}
连接(Connect)
处理更加灵活,连接操作允许流的数据类型不同。DataStream中的数据只能有唯一的类型,故连接得到的并不是DataStream,而是一个连接流(ConnectedStreams),其可以看成是两条流形式上的统一,在同一条流中却在内部仍然保持着各自的数据形式不变,彼此独立。若要得到新的DataStream,可进一步定义一个同处理(co-process)转换操作,用来说明对于不同来源、不同类型的数据怎,如何进行处理转换、得到统一的输出类型。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<Integer> stream1 = env.fromElements(1, 2, 3);
DataStreamSource<Long> stream2 = env.fromElements(4L, 5L, 6L, 7L);
SingleOutputStreamOperator<String> map = stream2.connect(stream1).map(new CoMapFunction<Long, Integer, String>() {
@Override
public String map1(Long aLong) throws Exception {
return "Long";
}
@Override
public String map2(Integer integer) throws Exception {
return "Integer";
}
});
map.print();
env.execute();
}
案例:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Tuple3<String, String, Long>> stream1 = env.fromElements(
Tuple3.of("order-1", "app", 1000L),
Tuple3.of("order-2", "app", 2000L),
Tuple3.of("order-3", "app", 3500L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Long>>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple3<String, String, Long>>() {
@Override
public long extractTimestamp(Tuple3<String, String, Long> element, long recordTimestamp) {
return element.f2;
}
}));
SingleOutputStreamOperator<Tuple4<String, String, String, Long>> stream2 = env.fromElements(
Tuple4.of("order-1", "third-party", "Success", 3000L),
Tuple4.of("order-3", "third-party", "Success", 4000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple4<String, String, String, Long>>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple4<String, String, String, Long>>() {
@Override
public long extractTimestamp(Tuple4<String, String, String, Long> element, long recordTimestamp) {
return element.f3;
}
}));
SingleOutputStreamOperator<String> process = stream1.connect(stream2)
.keyBy(data -> data.f0, data -> data.f0)
.process(new CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>() {
private transient ValueState<Tuple3<String, String, Long>> appState;
private transient ValueState<Tuple4<String, String, String, Long>> orderState;
@Override
public void open(Configuration parameters) throws Exception {
appState = getRuntimeContext().getState(new ValueStateDescriptor<Tuple3<String, String, Long>>("appState", Types.TUPLE(Types.STRING, Types.STRING, Types.LONG)));
orderState = getRuntimeContext().getState(new ValueStateDescriptor<Tuple4<String, String, String, Long>>("orderState", Types.TUPLE(Types.STRING, Types.STRING, Types.STRING, Types.LONG)));
}
@Override
public void processElement1(Tuple3<String, String, Long> value, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.Context context, Collector<String> collector) throws Exception {
// 来看app中是否来了流
if (orderState.value() != null) {
collector.collect("对账成功" + value + " " + orderState.value());
orderState.clear();
} else {
appState.update(value);
// 注册定时器、等待另外一条流
context.timerService().registerEventTimeTimer(value.f2 + 5000L);
}
}
@Override
public void processElement2(Tuple4<String, String, String, Long> value, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.Context context, Collector<String> collector) throws Exception {
// 先看app有没有来流
if (appState.value() != null) {
collector.collect("当前匹配的值为:" + value + " " + appState.value());
appState.clear();
} else {
orderState.update(value);
context.timerService().registerEventTimeTimer(value.f3);
}
}
@Override
public void onTimer(long timestamp, CoProcessFunction<Tuple3<String, String, Long>, Tuple4<String, String, String, Long>, String>.OnTimerContext ctx, Collector<String> out) throws Exception {
if (appState.value() != null) {
System.out.println("对账失败" + appState.value() + " 来了, " + "第三方支付平台没来");
}
if (orderState.value() != null)
System.out.println("对账失败" + orderState.value() + " 来了," + "app日志没有来");
appState.clear();
orderState.clear();
}
});
process.print();
env.execute();
}
基于时间的合流——双流联结(Join)
SQL中join一般会翻译为连接;我们这里为了区分不用的算子,一般的合流操作connect翻译为连接,而把join翻译为联结
- Window Join
- Interval Join
Flink_Join_文档详解
1、Winodw Join窗口联结的调用
Window join 作用在两个流中有相同 key 且处于相同窗口的元素上。这些窗口可以通过 window assigner 定义,并且两个流中的元素都会被用于计算窗口的结果。
两个流中的元素在组合之后,会被传递给用户定义的 JoinFunction 或 FlatJoinFunction,用户可以用它们输出符合 join 要求的结果。
常见的用例可以总结为以下代码:
stream.join(otherStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(<WindowAssigner>)
.apply(<JoinFunction>);
滚动 Window Join
使用滚动 window join 时,所有 key 相同且共享一个滚动窗口的元素会被组合成对,并传递给 JoinFunction 或 FlatJoinFunction。因为这个行为与 inner join 类似,所以一个流中的元素如果没有与另一个流中的元素组合起来,它就不会被输出!
如图所示,我们定义了一个大小为 2 毫秒的滚动窗口,即形成了边界为 [0,1], [2,3], … 的窗口。图中展示了如何将每个窗口中的元素组合成对,组合的结果将被传递给 JoinFunction。注意,滚动窗口 [6,7] 将不会输出任何数据,因为绿色流当中没有数据可以与橙色流的 ⑥ 和 ⑦ 配对。
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...;
DataStream<Integer> greenStream = ...;
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
滑动 Window Join
当使用滑动 window join 时,所有 key 相同且处于同一个滑动窗口的元素将被组合成对,并传递给 JoinFunction 或 FlatJoinFunction。当前滑动窗口内,如果一个流中的元素没有与另一个流中的元素组合起来,它就不会被输出!注意,在某个滑动窗口中被 join 的元素不一定会在其他滑动窗口中被 join。
本例中我们定义了长度为两毫秒,滑动距离为一毫秒的滑动窗口,生成的窗口实例区间为 [-1, 0],[0,1],[1,2],[2,3], …。 X 轴下方是每个滑动窗口中被 join 后传递给 JoinFunction 的元素。图中可以看到橙色 ② 与绿色 ③ 在窗口 [2,3] 中 join,但没有与窗口 [1,2] 中任何元素 join。
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...;
DataStream<Integer> greenStream = ...;
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
会话 Window Join
使用会话 window join 时,所有 key 相同且组合后符合会话要求的元素将被组合成对,并传递给 JoinFunction 或 FlatJoinFunction。这个操作同样是 inner join,所以如果一个会话窗口中只含有某一个流的元素,这个窗口将不会产生输出!
这里我们定义了一个间隔为至少一毫秒的会话窗口。图中总共有三个会话,前两者中两个流都有元素,它们被 join 并传递给 JoinFunction。而第三个会话中,绿流没有任何元素,所以 ⑧ 和 ⑨ 没有被 join!
DataStream<Integer> orangeStream = ...;
DataStream<Integer> greenStream = ...;
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(EventTimeSessionWindows.withGap(Time.milliseconds(1)))
.apply (new JoinFunction<Integer, Integer, String> (){
@Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});
2、Interval Join
public static void main(String[] args) throws Exception {
// 1、创建流式执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Tuple2<String, Long>> orderStream = env.fromElements(
Tuple2.of("Mary", 5000L),
Tuple2.of("Alice", 5000L),
Tuple2.of("Bob", 20000L),
Tuple2.of("Alice", 20000L),
Tuple2.of("Cary", 51000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
return element.f1;
}
}));
SingleOutputStreamOperator<Event> clickeStream = env.fromElements(
new Event("Bob", "./cart", 2000L),
new Event("Alice", "./prod?id=100", 3000L),
new Event("Alice", "./prod?id=200", 3500L),
new Event("Bob", "./prod?id=2", 2500L),
new Event("Alice", "./prod?id=300", 36000L),
new Event("Bob", "./home", 30000L),
new Event("Bob", "./prod?id=1", 23000L),
new Event("Bob", "./prod?id=3", 33000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timeStamp;
}
}));
SingleOutputStreamOperator<String> process = orderStream.keyBy(data -> data.f0)
.intervalJoin(clickeStream.keyBy(data -> data.user))
.between(Time.seconds(-5), Time.seconds(5))
.process(new ProcessJoinFunction<Tuple2<String, Long>, Event, String>() {
@Override
public void processElement(Tuple2<String, Long> left, Event right, ProcessJoinFunction<Tuple2<String, Long>, Event, String>.Context context, Collector<String> collector) throws Exception {
collector.collect(left + " " + right);
}
});
process.print();
env.execute();
}
3、窗口同组联结(Window CoGroup)
用法跟window join非常类似,也就是两条流合并之后开窗处理匹配的元素,调用时只需要将join()换为coGroup()即可
stream1.coGroup(stream2)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.apply(<CoGroupFunction>)
调用apply()方法传入一个CoGroupFunction,这是一个函数类接口,源码如下:
public interface CoGroupFunction<IN1, IN2, O> extends Function, Serializable {
void coGroup(Iterable<IN1> first, Iterable<IN2> second, Collector<O> out) throws Exception;
}
coGroup操作比窗口join更加透明,不仅可以实现类似 SQL 中的“内 连接”(inner join),也可以实现左外连接(left outer join)、右外连接(right outer join)和全外连接(full outer join)。事实上,窗口 join 的底层,也是通过 coGroup 来实现的
public static void main(String[] args) throws Exception {
// 1、创建流式执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<Tuple2<String, Long>> stream1 = env.fromElements(
Tuple2.of("Mary", 5000L),
Tuple2.of("Alice", 5000L),
Tuple2.of("Bob", 20000L),
Tuple2.of("Alice", 20000L),
Tuple2.of("Cary", 51000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
return element.f1;
}
}));
SingleOutputStreamOperator<Event> stream2 = env.fromElements(
new Event("Bob", "./cart", 2000L),
new Event("Alice", "./prod?id=100", 3000L),
new Event("Alice", "./prod?id=200", 3500L),
new Event("Bob", "./prod?id=2", 2500L),
new Event("Alice", "./prod?id=300", 36000L),
new Event("Bob", "./home", 30000L),
new Event("Bob", "./prod?id=1", 23000L),
new Event("Bob", "./prod?id=3", 33000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forBoundedOutOfOrderness(Duration.ZERO)
.withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timeStamp;
}
}));
stream1.coGroup(stream2)
.where(data -> data.f0)
.equalTo(data -> data.user)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply(new CoGroupFunction<Tuple2<String, Long>, Event, String>() {
@Override
public void coGroup(Iterable<Tuple2<String, Long>> left, Iterable<Event> right, Collector<String> out) throws Exception {
out.collect(left + " " + right);
}
}).print();
env.execute();
}
总结
多流转换操作时流处理在实际应用中常见的需求,主要包括合流和分流两大类。在Flink中,分流操作可以通过处理函数的侧输出流(side output)实现;而合流则提供不同层级的各种API
最基本的合流方式是联合(union)和连接(connect),两者的区别在于union可以对多条流进行合并,数据类型必须一致;而connect只能连接两条流,数据类型可以不同。事实上connect提供了最底层的处理函数接口,可以通过状态和定时器实现任意自定义的合流操作。
Flink还提供了内置的几个联结操作,基于某个时间段的双流合并,是需求特化之后的高层级API:窗口联结(window join)、间隔联结(interval join)、窗口同组联结(window coGroup)。其中 window join 和 coGroup 都是基于时间窗口的操作, 窗口分配器的定义与之前介绍的相同,而窗口函数则被限定为一种,通过.apply()来调用; interval join 则与窗口无关,而是基于每个数据元素截取对应的一个时间段来做联结,最终的处理操作则需调用.process(),由处理函数 ProcessJoinFunction 实现