当对两个并行的数据源进行连接操作,如何保证数据的共享?
1. 场景:
- 在项目中,对两个数据源进行整合,出现了数据丢失的情况。
- 需求:Kafka数据中domain通过MySQL数据转换为userid
- Kafka(ip,domain,traffic)=Flink - connect=MySQL(userid,domain)==>Result(ip,userid,traffic)
- 数据源一:Kafka中日志信息(包括ip、域名、流量)
- 数据源二:MySQL数据中的配置信息(含有域名和用户的对应关系)
2.代码如下
val env = StreamExecutionEnvironment.getExecutionEnvironment
//使用事件时间
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val data = env.addSource(getConsumer)
//通过Kafka获取的数据
val logData = data.map(new MapSplitFunction)
.filter(_._1 == "E")
.filter(_._2 != 0)
.map(x => {
(x._2, x._3, x._4)
})
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator) //添加时间戳和水印
.keyBy(1)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.apply(new MyWindowFunction)
//通过MySQL获取的数据
val sqlData = env.addSource(new JdbcSourceFunction)
//进行连接操作
val connectData = logData.connect(sqlData)
.flatMap(new CoFlatMapFunction[(String, String, Long), mutable.HashMap[String, String], (String, String, Long)] {
var userDomainMap = new mutable.HashMap[String, String]()
override def flatMap1(value: (String, String, Long), out: Collector[(String, String, Long)]): Unit = {
val domain = value._2
val userid = userDomainMap.get(domain)
out.collect(value._1, value._2 + "=" + userid, value._3)
}
override def flatMap2(value: