5 Flink流处理核心编程
和其他所有的计算框架一样,Flink也有一些基础的开发步骤以及基础,核心的API,从开发步骤的角度来讲,主要分为四大部分:
- Environment
- Source
- Transform
- Sink
5.1 Environment
Flink Job在提交执行计算时,需要首先建立和Flink框架之间的联系,也就指的是当前的flink运行环境,只有获取了环境信息,才能将task调度到不同的taskManager执行。而这个环境对象的获取方式相对比较简单。
// 批处理环境
ExecutionEnvironment benv = ExecutionEnvironment.getExecutionEnvironment();
// 流式数据处理环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
5.2 Source
Flink框架可以从不同的来源获取数据,将数据提交给框架进行处理, 我们将获取数据的来源称之为数据源(Source)。
5.2.1 准备工作
<properties>
<flink.version>1.13.0</flink.version>
<java.version>1.8</java.version>
<scala.binary.version>2.12</scala.binary.version>
<slf4j.version>1.7.30</slf4j.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime-web_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-to-slf4j</artifactId>
<version>2.14.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.projectlombok/lombok -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.16</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
1. 导入注解工具依赖, 方便生产POJO类
<!-- https://mvnrepository.com/artifact/org.projectlombok/lombok -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.16</version>
</dependency>
2. 准备一个WaterSensor类方便演示
package com.atguigu.flink.source;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
/**
* 水位传感器:用于接收水位数据
*
* id:传感器编号
* ts:时间戳
* vc:水位
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class WaterSensor {
private String id;
private Long ts;
private Integer vc;
}
5.2.2 从Java的集合中读取数据
一般情况下,可以将数据临时存储到内存中,形成特殊的数据结构后,作为数据源使用。这里的数据结构采用集合类型是比较普遍的。
package com.atguigu.flink.source;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
public class Test01_Source_Collection {
public static void main(String[] args) throws Exception {
List<WaterSensor> waterSensors = Arrays.asList(
new WaterSensor("ws_001", System.currentTimeMillis(), new Random().nextInt(50)),
new WaterSensor("ws_002", System.currentTimeMillis(), new Random().nextInt(50)),
new WaterSensor("ws_003", System.currentTimeMillis(), new Random().nextInt(50)));
// 1 创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
env
.fromCollection(waterSensors)
.print();
env.execute();
}
}
5.2.3 从文件读取数据
package com.atguigu.flink.source;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class Test02_Source_File {
public static void main(String[] args) throws Exception {
// 1 创建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.readTextFile("input")
.print();
env.execute();
}
}
说明:
- 参数可以是目录,也可以是文件
- 路径可以是相对路径,也可以是相对路径
- 相对路径是从系统属性 user.dir 获取路径:idea下的是 project 根目录,standalone 模式下是集群节点根目录
- 也可以从HDFS目录下读取,使用路径 hdfs://hadoop102:8020/… 由于Flink没有提供hadoop相关依赖,需要pom中添加hadoop 客户端依赖
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.3</version>
</dependency>
5.2.4 从Socket读取数据
package com.atguigu.flink.source;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.io.IOException;
import java.net.Socket;
public class Test03_Source_Socket {
public static void main(String[] args) throws Exception {
// 1 环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Socket socket = new Socket("hadoop102", 9999);
DataStreamSource<String> lineDS = env.socketTextStream("hadoop102", 9999);
lineDS
.print();
lineDS.executeAndCollect();
}
}
5.2.5 从Kafka读取数据
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.12</artifactId>
<version>1.13.0</version>
</dependency>
package com.atguigu.flink.source;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import java.util.Properties;
public class Test04_Source_Kafka {
public static void main(String[] args) throws Exception {
// 0 todo Kafka相关配置
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "hadoop102:9092,hadoop103:9092,hadoop104:9092");
properties.setProperty("group.id", "Flink01_Source_Kafka");
properties.setProperty("auto.offset.reset", "latest");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.addSource(new FlinkKafkaConsumer<>("sensor", new SimpleStringSchema(), properties))
.print("kafka source");
env.execute();
}
}
kafka-console-producer.sh --broker-list hadoop102:9092 --topic sensor
5.2.6 自定义Source
大多数情况下,前面的数据源已经能够满足需要,但是难免会存在特殊情况的场合,所以flink也提供了能自定义数据源的方式.
package com.atguigu.flink.source;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.kafka.common.protocol.types.Field;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.Socket;
import java.nio.charset.StandardCharsets;
public class Test05_Source_Custom {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.addSource(new MySource("hadoop102", 9999))
.print();
env.execute();
}
public static class MySource implements SourceFunction<WaterSensor> {
private String host;
private int prot;
private volatile boolean isRunning = true;
private Socket socket;
public MySource(String host, int prot) {
this.host = host;
this.prot = prot;
}
@Override
public void run(SourceContext<WaterSensor> ctx) throws Exception {
// 实现一个从Socket读取数据的source
Socket socket = new Socket(host, prot);
BufferedReader reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8));
String line = null;
while (isRunning && (line = reader.readLine()) != null){
String[] split = line.split(",");
ctx.collect(new WaterSensor(split[0], Long.valueOf(split[1]), Integer.valueOf(split[2])));
}
}
@Override
public void cancel() {
isRunning = false;
try {
socket.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
自定义 SourceFunction:
- 实现 SourceFunction 相关接口
- 重写两个方法:
- run(): 主要逻辑
- cancel(): 停止逻辑
如果希望 Source 可以指定并行度,那么就 实现 ParallelSourceFunction 这个接口。
5.3 Transform
转换算子可以把一个或多个DataStream转成一个新的DataStream.程序可以把多个复杂的转换组合成复杂的数据流拓扑。
5.3.1 map
作用:将数据流中的数据进行转换, 形成新的数据流,消费一个元素并产出一个元素
参数:lambda表达式 或 MapFunction实现类
返回值:DataStream -> DataStream
package com.atguigu.flink.transform;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class Test01_Map_Anonymous {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.fromElements(1, 2, 3, 4, 5, 6)
// 匿名内部类
/* .map(new MapFunction<Integer, Integer>() {
@Override
public Integer map(Integer value) throws Exception {
return value * value;
}
})
.print();*/
// Lambda表达式
/*
.map(ele -> ele * ele)
.print();
*/
// 静态内部类
.map(new MyMapFunction())
.print();
env.execute();
}
private static class MyMapFunction implements MapFunction<Integer, Integer> {
@Override
public Integer map(Integer value) throws Exception {
return value * value;
}
}
}
Rich…Function类
所有Flink函数类都有其Rich版本。它与常规函数的不同在于,可以获取运行环境的上下文,并拥有一些生命周期方法,所以可以实现更复杂的功能。也有意味着提供了更多的,更丰富的功能。例如:RichMapFunction
// 得到一个新的数据流: 新的流的元素是原来流的元素的平方
package com.atguigu.flink.transform;
import org.apache.flink.api.common.functions.IterationRuntimeContext;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import scala.Int;
public class Test02_Map_RichMapFunction {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.fromElements(1, 2, 3, 4, 5)
.map(new MyRichMapFunction())
.setParallelism(2)
.print();
env.execute();
}
private static class MyRichMapFunction extends RichMapFunction<Integer, Integer> {
@Override
public void setRuntimeContext(RuntimeContext t) {
System.out.println("设置运行时上下文 执行一次");
}
@Override
public RuntimeContext getRuntimeContext() {
System.out.println("运行上下文 执行一次");
return super.getRuntimeContext();
}
@Override
public IterationRuntimeContext getIterationRuntimeContext() {
System.out.println("迭代时运行上下文 运行一次");
return super.getIterationRuntimeContext();
}
// 默认生命周期方法,初始化方法,在每个并行度上只会被调用一次
@Override
public void open(Configuration parameters) throws Exception {
System.out.println("open 执行一次");
}
// 默认生命周期方法,最后一个方法,做一些清理工作,在每个并行度上只调用一次
@Override
public void close() throws Exception {
System.out.println("close 执行一次");
}
@Override
public Integer map(Integer value) throws Exception {
System.out.println("map 一个元素执行一次");
return value * value;
}
}
}
设置运行时上下文 执行一次
设置运行时上下文 执行一次
open 执行一次
open 执行一次
map 一个元素执行一次
map 一个元素执行一次
map 一个元素执行一次
map 一个元素执行一次
map 一个元素执行一次
close 执行一次
close 执行一次
1> 16
7> 9
6> 1
8> 25
12> 4
Process finished with exit code 0
方法执行次数:
- 默认周期生命周期方法,初始化方法 open() 在每个并行度上只会被调用一次,而且最先被调用。
- 默认周期方法,最后一个方法 close() 做一下清理工作,在每个并行度上只调用一次,而且时最后被调用,但读文件时,在每个并行度上调用两次。
- 运行时上下文方法 getRuntimeContext() 提供了函数 RuntimeContext 的一些信息,例如函数执行的并行度,任务的名字,以及 state 状态,开发人员在需要的时候可以自行调用获取运行时上下文对象。
5.3.2 flatMap
作用:消费一个元素并产生领个或多个元素
参数:FlatMapFunction实现类
返回:DataStream -> DataStream
package com.atguigu.flink.transform;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
public class Test03_FlatMap {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.fromElements(1, 2, 3, 4, 5)
// 匿名内部类
/* .flatMap(new FlatMapFunction<Integer, Integer>() {
@Override
public void flatMap(Integer value, Collector<Integer> out) throws Exception {
out.collect(value*value);
out.collect(value*value*value);
}
})
.print();*/
// Lambda
.flatMap((Integer value, Collector<Integer> out) -> {
out.collect(value * value);
out.collect(value * value * value);
})
.returns(Types.INT)
.print();
env.execute();
}
}
说明:在使用Lambda表达式表达式的时候, 由于泛型擦除的存在, 在运行的时候无法获取泛型的具体类型, 全部当做Object来处理, 及其低效, 所以Flink要求当参数中有泛型的时候, 必须明确指定泛型的类型.
5.3.3 filter
作用:根据指定的规则将满足条件(true)的数据保留,不满足条件(false)的数据丢弃。
参数:FlatMapFunction实现类
返回:DataStream -> DataStream
package com.atguigu.flink.transform;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class Test04_Fliter {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/* env
.socketTextStream("localhost", 9999)
.map(ele -> Integer.valueOf(ele))
.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) throws Exception {
return value % 2 == 0;
}
})
.print();*/
env
.socketTextStream("localhost", 9999)
.map(ele -> Integer.valueOf(ele))
.filter(ele -> ele % 2 == 0)
.print();
env.execute();
}
}
5.3.4 keyBy
作用:
1 把流中的数据分到不同的分区中,具有相同 key 的元素会分到同一个分区中,一个分区中可以有多重不同的key
2 在内部是使用hash分区来实现的
分组和分区的区别:
分组:是一个逻辑上的划分,按照key进行划分,经过keyby,同一个分组的数据肯定会进入同一个分区
分区:下游算子的一个并行实例(等价于一个slot),同一个分区内,可能有多个分组
参数:
key选择器函数:interface KeySelector<IN, KEY>
注意:什么值不可以作为keySelect的Key:
1 没有覆写hashCode方法的POJO,而是依赖Object的hashCode,因为这样分组没有意义:因为每个元素都会得到一个独一无二的组,实际情况是:可以运行,但是分组没有意义
2 任何类型的数组
返回:DataStream -> KeyedStream
package com.atguigu.flink.transform;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class Test05_KeyBy {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.fromElements(10, 3, 5, 9, 20, 8)
/* .keyBy(new KeySelector<Integer, String>() {
@Override
public String getKey(Integer value) throws Exception {
return value % 2 == 0 ? "偶数" : "奇数";
}
})
.print();*/
.keyBy(value -> value % 2 == 0 ? "偶数" : "奇数")
.print();
env.execute();
}
}
总结:
- 指定 位置索引,只能用于 Tuple 的数据类型
KeyedStream<WaterSenor, Tuple> sensorKS = sensorDS.keyBy(0);
- 指定 字段名字,适用于 POJO
KeyedStream<WaterSensor, Tulpe> sensorKS = sensorDS.keyBy("id");
- 推荐使用 使用KeySelector
KeyedStream<WaterSensor, String> sensorKS = sensorDS.keyBy(new KeySelector<WaterSensor, String>() {
@Override
public String getKey(WaterSensor value) throws Exception {
return value.getId();
}
});
5.3.5 shuffle

作用:把流中的元素随机打乱,对同一个组数据,每次只需要得到的结果不同
参数:无
返回:DataStream -> DataStream
5.3.6 Split和select
已经过时, 在1.12中已经被移除
作用:在某些情况下,我们需要将数据流根据某些特征拆分成两个或者多个数据流,给不同数据流增加标记以便于从流中取出.
split用于给流中的每个元素添加标记. select用于根据标记取出对应的元素, 组成新的流.
参数
split参数: interface OutputSelector<OUT>
select参数: 字符串
返回
split: SingleOutputStreamOperator -> SplitStream
slect: SplitStream -> DataStream
// 匿名内部类写法
// 奇数一个流, 偶数一个流
SplitStream<Integer> splitStream = env
.fromElements(10, 3, 5, 9, 20, 8)
.split(new OutputSelector<Integer>() {
@Override
public Iterable<String> select(Integer value) {
return value % 2 == 0
? Collections.singletonList("偶数")
: Collections.singletonList("奇数");
}
});
splitStream
.select("偶数")
.print("偶数");
splitStream
.select("奇数")
.print("奇数");
env.execute();
// Lambda表达式写法
// 奇数一个流, 偶数一个流
SplitStream<Integer> splitStream = env
.fromElements(10, 3, 5, 9, 20, 8)
.split(value -> value % 2 == 0
? Collections.singletonList("偶数")
: Collections.singletonList("奇数"));
splitStream
.select("偶数")
.print("偶数");
splitStream
.select("奇数")
.print("奇数");
env.execute();
5.3.7 connect
作用:在某些情况下,我们需要将两个不同来源的数据流进行连接,实现数据匹配,比如订单支付和第三方交易信息,这两个信息的数据就来自于不同数据源,连接后,将订单支付和第三方交易信息进行对账,此时,才能算真正的支付完成。Flink中的connect算子可以连接两个保持他们类型的数据流,两个数据流被connect之后,只是被放在了一个同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。
参数:另外一个流
返回:DataStream[A], DataStream[B] -> ConnectedStreams[A,B]
package com.atguigu.flink.transform;
import org.apache.flink.streaming.api.datastream.ConnectedStreams;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class Test06_Connect {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Integer> intStream = env.fromElements(1, 2, 3, 4, 5);
DataStreamSource<String> stringStream = env.fromElements("a", "b", "c");
ConnectedStreams<Integer, String> connectedStreams = intStream.connect(stringStream);
connectedStreams.getFirstInput().print("frist");
connectedStreams.getSecondInput().print("second");
env.execute();
}
}
注意:
- 两个流中存储的数据类型可以不同
- 只是机械的合并在一起,内部仍然是分离的2个流
- 只能2个流进行connect,不能有第3个流参与
5.3.8 union
作用:对两个或两个以上的DataStream进行union操作,产生一个包含所有DataStream元素的新DataStream
package com.atguigu.flink.transform;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class Test07_Union {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Integer> stream1 = env.fromElements(1, 2, 3, 4, 5);
DataStreamSource<Integer> stream2 = env.fromElements(10, 20, 30, 40, 50);
DataStreamSource<Integer> stream3 = env.fromElements(100, 200, 300, 400, 500);
stream1
.union(stream2)
.union(stream3)
.print();
env.execute();
}
}
connect 和 union 的区别:
- union之前两个流的类型必须是一样,connect可以不一样
- connect 只能操作两个流,union可以操作多个流
5.3.9 简单滚动聚合算子
sum min max minBy maxBy
作用:对KeyedStream的每一个支流做聚合。执行完成后,会将聚合的结果合成一个流返回,所以结果都是DataStream。
参数:
1 如果流中存储的是POJO或者Scala的样例类,参数使用字段名
2 如果流中存储的是元组,参数就是位置(基于 0 1...)
返回:KeyedStream -> SingleOutputStreamOperator
DataStreamSource<Integer> stream = env.fromElements(1, 2, 3, 4, 5);
KeyedStream<Integer, String> kbStream = stream.keyBy(ele -> ele % 2 == 0 ? "奇数" : "偶数");
kbStream.sum(0).print("sum");
kbStream.max(0).print("max");
kbStream.min(0).print("min");
ArrayList<WaterSensor> waterSensors = new ArrayList<>();
waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 30));
waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));
KeyedStream<WaterSensor, String> kbStream = env
.fromCollection(waterSensors)
.keyBy(WaterSensor::getId);
kbStream
.sum("vc")
.print("max...");
ArrayList<WaterSensor> waterSensors = new ArrayList<>();
waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));
KeyedStream<WaterSensor, String> kbStream = env
.fromCollection(waterSensors)
.keyBy(WaterSensor::getId);
kbStream
.maxBy("vc", false)
.print("maxBy...");
env.execute();
注意:
滚动聚合算子:来一条,聚合一条
- 聚合算子在 keyby 之后调用,因为这些算子都是属于 KeyedStream 里的
- 聚合算子,作用范围,都是分组内。 也就是说,不同分组,要分开算。
- max、maxBy的区别:
- max:取指定字段当前的最大值,如果有多个字段,其他非比较字段,以第一条为准
- maxBy:取指定字段当前的最大值,如果有多个字段,其他字段以最大值那条数据为准;
- 如果出现两条数据都是最大值,由第二个参数决定:
- true => 其他字段取 比较早的值
- false => 其他字段,取最新的值
5.3.10 reduce
作用:
一个分组数据流的聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次聚合的最终结果。为什么保留聚合的中间值?考虑流式数据的特点: 没有终点, 也就没有最终的概念了. 任何一个中间的聚合结果都是输出值!
参数:
interface ReduceFunction<T>
返回:
KeyedStream => SingleOutputStreamOperator
package com.atguigu.flink.transform;
import com.atguigu.flink.source.WaterSensor;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.ArrayList;
public class Test08_Reduce {
public static void main(String[] args) throws Exception {
ArrayList<WaterSensor> waterSensors = new ArrayList<>();
waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
KeyedStream<WaterSensor, String> kbStream = env
.fromCollection(waterSensors)
.keyBy(WaterSensor::getId);
/* kbStream
.reduce(new ReduceFunction<WaterSensor>() {
@Override
public WaterSensor reduce(WaterSensor value1, WaterSensor value2) throws Exception {
System.out.println("reduce function...");
return new WaterSensor(value1.getId(), value1.getTs(), value1.getVc() + value2.getVc());
}
})
.print("reduce...");*/
kbStream
.reduce((value1, value2) -> {
System.out.println("reduce function...");
return new WaterSensor(value1.getId(), value1.getTs(), value1.getVc() + value2.getVc());
})
.print();
env.execute();
}
}
注意:
- 一个分组的第一条数据来的时候,不会进入reduce方法
- 输入和输出的数据类型,一定要一样
5.3.11 process
作用:process算子在Flink里算是一个比较底层的算子,很多类型的流上都可以调用,可以从流中获取更多的信息(不仅仅是数据本身)
// 在keyBy之前的流上使用
env
.fromCollection(waterSensors)
.process(new ProcessFunction<WaterSensor, Tuple2<String, Integer>>() {
@Override
public void processElement(WaterSensor value,
Context ctx,
Collector<Tuple2<String, Integer>> out) throws Exception {
out.collect(new Tuple2<>(value.getId(), value.getVc()));
}
})
.print();
// 在keyBy之后的流上使用
env
.fromCollection(waterSensors)
.keyBy(WaterSensor::getId)
.process(new KeyedProcessFunction<String, WaterSensor, Tuple2<String, Integer>>() {
@Override
public void processElement(WaterSensor value, Context ctx, Collector<Tuple2<String, Integer>> out) throws Exception {
out.collect(new Tuple2<>("key是:" + ctx.getCurrentKey(), value.getVc()));
}
})
.print();
5.3.12 对流重新分区的几个算子
keyBy: 先按照key分组,按照key的双重hash来选择后面的分区
shuffle: 对流中的元素随机分区
reblance: 对流中的元素平均分布到每个区,当处理倾斜数据的时候,进行性能优化
rescale: 同rebalance一样,也是平均值循环的分布数据。但是要比rebalance更高效,因为rescale不需要通过网络,完全走的“管道”
5.4 Sink
Sink有下沉的意思,在Flink中所谓的Sink其实可以表示为将数据存储起来的意思,也可以将范围扩大,表示将处理完的数据发送到指定的存储系统的输出操作.
之前我们一直在使用的print方法其实就是一种Sink
public DataStreamSink<T> print(String sinkIdentifier) {
PrintSinkFunction<T> printFunction = new PrintSinkFunction<>(sinkIdentifier, false);
return addSink(printFunction).name("Print to Std. Out");
}
Flink内置了一些Sink, 除此之外的Sink需要用户自定义!
5.4.1 KafkaSink
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.12</artifactId>
<version>1.13.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.75</version>
</dependency>
package com.atguigu.flink.sink;
import com.alibaba.fastjson.JSON;
import com.atguigu.flink.source.WaterSensor;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import java.util.ArrayList;
public class Test01_KafkaSink {
public static void main(String[] args) throws Exception {
ArrayList<WaterSensor> waterSensors = new ArrayList<>();
waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.fromCollection(waterSensors)
.map(JSON::toJSONString)
.addSink(new FlinkKafkaProducer<String>("hadoop102:9092","topic_sensor",new SimpleStringSchema()));
env.execute();
}
}
bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic topic_sensor
5.4.2 RedisSink
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
<exclusions>
<exclusion>
<artifactId>flink-streaming-java_2.11</artifactId>
<groupId>org.apache.flink</groupId>
</exclusion>
<exclusion>
<artifactId>flink-runtime_2.11</artifactId>
<groupId>org.apache.flink</groupId>
</exclusion>
<exclusion>
<artifactId>flink-core</artifactId>
<groupId>org.apache.flink</groupId>
</exclusion>
<exclusion>
<artifactId>flink-java</artifactId>
<groupId>org.apache.flink</groupId>
</exclusion>
</exclusions>
</dependency>
package com.atguigu.flink.sink;
import com.alibaba.fastjson.JSON;
import com.atguigu.flink.source.WaterSensor;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.redis.RedisSink;
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisCommand;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisCommandDescription;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisMapper;
import java.util.ArrayList;
public class Test02_RedisSink {
public static void main(String[] args) throws Exception {
ArrayList<WaterSensor> waterSensors = new ArrayList<>();
waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));
FlinkJedisPoolConfig redisConfig = new FlinkJedisPoolConfig.Builder()
.setHost("hadoop102")
.setPort(6379)
.setMaxTotal(100)
.setTimeout(1000 * 10)
.build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.fromCollection(waterSensors)
.addSink(new RedisSink<>(redisConfig, new RedisMapper<WaterSensor>() {
@Override
public RedisCommandDescription getCommandDescription() {
// 返回存在Redis中的数据类型,存储的是Hash,第二个参数是外面的key
return new RedisCommandDescription(RedisCommand.HSET, "sensor");
}
@Override
public String getKeyFromData(WaterSensor data) {
// 从数据中获取key:Hash的key
return data.getId();
}
@Override
public String getValueFromData(WaterSensor data) {
// 从吃烤串中获取value: Hash的value
return JSON.toJSONString(data);
}
}));
env.execute();
}
}
redis-cli --raw
hgetall sensor
发送了5条数据, redis中只有2条数据. 原因是hash的field的重复了, 后面的会把前面的覆盖掉。
5.4.3 ElasticsearchSink
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-elasticsearch6 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch6_2.12</artifactId>
<version>1.13.0</version>
</dependency>
package com.atguigu.flink.sink;
import com.alibaba.fastjson.JSON;
import com.atguigu.flink.source.WaterSensor;
import jdk.nashorn.internal.ir.RuntimeNode;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import org.elasticsearch.common.xcontent.XContentType;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Test03_ElasticsearchSink {
public static void main(String[] args) throws Exception {
ArrayList<WaterSensor> waterSensors = new ArrayList<>();
waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));
List<HttpHost> esHosts = Arrays.asList(
new HttpHost("hadoop102", 9200),
new HttpHost("hadoop103", 9200),
new HttpHost("hadoop104", 9200)
);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.fromCollection(waterSensors)
.addSink(new ElasticsearchSink.Builder<WaterSensor>(esHosts, new ElasticsearchSinkFunction<WaterSensor>() {
@Override
public void process(WaterSensor element, RuntimeContext ctx, RequestIndexer indexer) {
// 1. 创建es写入请求
IndexRequest request = Requests
.indexRequest("sensor")
.type("_doc")
.id(element.getId())
.source(JSON.toJSONString(element), XContentType.JSON);
// 2. 写入到es
indexer.add(request);
}
}).build());
env.execute();
}
}
5.4.4 自定义Sink
MySQLSink
create database test;
use test;
CREATE TABLE `sensor` (
`id` varchar(20) NOT NULL,
`ts` bigint(20) NOT NULL,
`vc` int(11) NOT NULL,
PRIMARY KEY (`id`,`ts`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.49</version>
</dependency>
package com.atguigu.flink.sink;
import com.atguigu.flink.source.WaterSensor;
import org.apache.flink.api.common.functions.IterationRuntimeContext;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import com.mysql.jdbc.Driver;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.ArrayList;
public class Test04_MySink {
public static void main(String[] args) throws Exception {
ArrayList<WaterSensor> waterSensors = new ArrayList<>();
waterSensors.add(new WaterSensor("sensor_1", 1607527992000L, 20));
waterSensors.add(new WaterSensor("sensor_1", 1607527994000L, 50));
waterSensors.add(new WaterSensor("sensor_1", 1607527996000L, 50));
waterSensors.add(new WaterSensor("sensor_2", 1607527993000L, 10));
waterSensors.add(new WaterSensor("sensor_2", 1607527995000L, 30));
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
env.fromCollection(waterSensors)
.addSink(new RichSinkFunction<WaterSensor>() {
private PreparedStatement ps;
private Connection conn;
@Override
public void open(Configuration parameters) throws Exception {
conn = DriverManager.getConnection("jdbc:mysql://hadoop102:3306/test?useSSL=false", "root",
"123456");
ps = conn.prepareStatement("insert into sensor values(?, ?, ?)");
}
@Override
public void close() throws Exception {
ps.close();
conn.close();
}
@Override
public void invoke(WaterSensor value, Context context) throws Exception {
ps.setString(1, value.getId());
ps.setLong(2, value.getTs());
ps.setInt(3, value.getVc());
ps.execute();
}
});
env.execute();
}
}
5.4.5 JDBCSink
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc_2.12</artifactId>
<version>1.13.0</version>
</dependency>
package com.atguigu.flink.sink;
import com.atguigu.flink.source.WaterSensor;
import com.mysql.jdbc.Driver;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.connector.jdbc.JdbcConnectionOptions;
import org.apache.flink.connector.jdbc.JdbcExecutionOptions;
import org.apache.flink.connector.jdbc.JdbcSink;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class Test05_JDBCSink {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);
SingleOutputStreamOperator<WaterSensor> result = streamSource.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] split = value.split(",");
WaterSensor waterSensor = new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
return waterSensor;
}
});
result.addSink(JdbcSink.sink(
"insert into sensor values (? ? ?)",
(ps, t) -> {
ps.setString(1, t.getId());
ps.setLong(2, t.getTs());
ps.setInt(3, t.getVc());
},
new JdbcExecutionOptions.Builder()
// 一条一条写 和ES中类似
.withBatchSize(1)
.build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://hadoop102:3306/test?useSSL=false")
.withUsername("root")
.withPassword("123456")
.withDriverName(Driver.class.getName())
.build()
));
env.execute();
}
}
5.5 执行模式(Execution Mode)
Flink在1.12.0上对流式API新增一项特性:可以根据你的使用情况和Job的特点, 可以选择不同的运行时执行模式(runtime execution modes)。
流式API的传统执行模式我们称之为 STREAMING 执行模式, 这种模式一般用于无界流, 需要持续的在线处理。
1.12.0新增了一个BATCH执行模式, 这种执行模式在执行方式上类似于MapReduce框架. 这种执行模式一般用于有界数据,目的为了更像流批一体。
默认是使用的STREAMING 执行模式。
5.5.1 选择执行模式
BATCH执行模式仅仅用于有界数据, 而STREAMING 执行模式可以用在有界数据和无界数据。
一个公用的规则就是: 当你处理的数据是有界的就应该使用 BATCH 执行模式, 因为它更加高效. 当你的数据是无界的, 则必须使用STREAMING 执行模式, 因为只有这种模式才能处理持续的数据流。
5.5.2 配置BATH执行模式
执行模式有3个选择可配置:
- STREAMING(默认)
- BATCH
- AUTOMATIC
// 1 通过命令行配置
bin/flink run -Dexecution.runtime-mode-BATCH...
// 2 通过代码配置
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.BATCH);
建议: 不要在运行时配置(代码中配置), 而是使用命令行配置, 引用这样会灵活: 同一个应用即可以用于无界数据也可以用于有界数据,无界数据不能使用Batch模式。
5.5.3 有界数据用STREAMING和BATCH的区别
STREAMING模式下, 数据是来一条输出一次结果。
BATCH模式下, 数据处理完之后, 一次性输出结果。
下面展示WordCount的程序读取文件内容在不同执行模式下的执行结果对比:
// 默认流式模式,可以不用配置
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
// 批处理模式
env.setRuntimeMode(RuntimeExecutionMode.BATCH);
// 自动模式
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);