Flink 双流Join

目录

DataStream API(函数编程)

window Join

join

coGroup

interval Join

Table API(flink sql)

Reguler Join (常规join)

inner join

left join / right join

full join

interval join

lookup join

Window Join

INNER/LEFT/RIGHT/FULL OUTER 


DataStream API(函数编程)

window Join

join

对处于同一窗口的数据进行join

时间类型:processTime、eventTime

问题:1、不在同一窗口的数据无法join,

           2、只能inner join

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>) // 左侧key值
    .equalTo(<KeySelector>) // 右侧key值
    .window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */)) // 开窗方式 tumbing/sliding/session
    .apply (new JoinFunction<Integer, Integer, String> (){
        @Override
        public String join(Integer first, Integer second) {
            return first + "," + second;
        }
    });

coGroup

coGroup是join的底层方法,通过coGroup可以实现inner/left/right/full 四种join

时间类型:processTime、eventTime

问题:不在同一窗口的数据无法join

latedata.coGroup(stream)
    .where(a->a.getStr("a"))
    .equalTo(a->a.getStr("a"))
    .window(TumblingEventTimeWindows.of(Time.seconds(10)))
    .apply(new CoGroupFunction<JSONObject, JSONObject, Object>() {
           @Override
           public void coGroup(Iterable<JSONObject> iterable, Iterable<JSONObject> iterable1, Collector<Object> collector) throws Exception {
                            
     }
})

interval Join

 为了解决window join的问题:处于不同窗口的数据无法join

时间类型:eventTime

interval join :根据左流的数据的时间点,左右各等待一段右流时间,在此范围内进行join

问题:只能是以左流为时间线,因此只支持inner join

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;

...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream
    .keyBy(<KeySelector>)
    .intervalJoin(greenStream.keyBy(<KeySelector>))
    .between(Time.milliseconds(-2), Time.milliseconds(1))
    .process (new ProcessJoinFunction<Integer, Integer, String(){

        @Override
        public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
            out.collect(first + "," + second);
        }
    });

Table API(flink sql)

Reguler Join (常规join)

默认没有时间范围,全局都可以join,时间长了状态一定会爆炸

必须设置数据过期时间

tableEnv.getConfig().setIdleStateRetention(xx)

设置过期时间后以后,四种join 数据过期方式各有不同

inner join

inner join 左流右流,创建后进入过期倒计时

SELECT *
FROM Orders
INNER JOIN Product
ON Orders.product_id = Product.id

left join / right join

left: 左流创建后进入过期倒计时,但是成功join一次后,就会重置过期时间

right: 右流创建后进入过期倒计时,但是成功join一次后,就会重置过期时间

SELECT *
FROM Orders
LEFT JOIN Product
ON Orders.product_id = Product.id

SELECT *
FROM Orders
RIGHT JOIN Product
ON Orders.product_id = Product.id

full join

左、右流创建后进入过期倒计时,但是成功join一次后,就会重置过期时间

SELECT *
FROM Orders
FULL OUTER JOIN Product
ON Orders.product_id = Product.id

interval join

作为DataStreamApi升级版的interval join,sql版本的支持处理时间语义和事件事件语义

SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.order_id
AND o.order_time BETWEEN s.ship_time - INTERVAL '4' HOUR AND s.ship_time

lookup join

效果等同于cdc,但是每次过来一条数据都会去数据库进行一次查询关联、效率差

但是可以设置缓存机制,如果用过一次后会缓存指定的时间,但是在缓存期间内就不会实时同步mysql的数据了。此时就和regular join 一样了

因此lookup join 试用场景为字典数据需要变化,但是变化的时间不需要实时变化,有点延迟也可以。

Lookup join 一定要用左关联(left join),lookup join 是上述 interval join的变种,如果使用 (inner ) join 且关联维度恰好不存在,会导致主流丢失,一定要注意

关键语句

FOR SYSTEM_TIME AS OF o.proc_time

lookup.cache.max-rows

optional(none)Integer

The max number of rows of lookup cache, over this value, the oldest rows will be expired. Lookup cache is disabled by default. See the following Lookup Cache section for more details

最多缓存多少条

lookup.cache.ttl

optional(none)Duration

The max time to live for each rows in lookup cache, over this time, the oldest rows will be expired. Lookup cache is disabled by default. See the following Lookup Cache section for more details.

缓存数据ttl

1 DAY 

1 HOUR

CREATE TEMPORARY TABLE Orders (
  id INT,
  order_id INT,
  total INT,
  proc_time as procetime()
) WITH (
  'connector' = 'kafka',
  ...
);

-- Customers is backed by the JDBC connector and can be used for lookup joins
CREATE TEMPORARY TABLE Customers (
  id INT,
  name STRING,
  country STRING,
  zip STRING
) WITH (
  'connector' = 'jdbc',
  'url' = 'jdbc:mysql://mysqlhost:3306/customerdb',
  'table-name' = 'customers',
  'lookup.cache.max-rows' = '10',
  'lookup.cache.ttl' = '1 hour'
);

-- enrich each order with customer information
SELECT o.order_id, o.total, c.country, c.zip
FROM Orders AS o
-- 最好用LEFT JOIN,INNER JOIN的场景很少
LEFT JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
    ON o.customer_id = c.id;

-- flink1.16+ 新提供的hints,lookup 重试策略,如果关联不上,重试600次,每次间隔1秒,这种场景下,会阻塞主流,重试600次后,才能接受下一跳
SELECT /*+ LOOKUP('table'='c', 'retry-predicate'='lookup_miss', 'retry-strategy'='fixed_delay', 'fixed-delay'='1s', 'max-attempts'='600') */
o.order_id, o.total, c.country, c.zip
FROM orders AS o
JOIN customers
FOR SYSTEM_TIME AS OF o.proc_time AS c
ON o.customer_id = c.id;

-- 异步重试,但是得'output-mode'='allow_unordered',允许乱序
SELECT /*+ LOOKUP('table'='c', 'retry-predicate'='lookup_miss', 'output-mode'='allow_unordered', 'retry-strategy'='fixed_delay', 'fixed-delay'='1s', 'max-attempts'='600') */
o.order_id, o.total, c.country, c.zip
FROM orders AS o
JOIN customers /*+ OPTIONS('lookup.async'='true', 'lookup.async-thread-number'='16') */
FOR SYSTEM_TIME AS OF o.proc_time AS c
ON o.customer_id = c.id;

Look up join |Apache Flink --- Hints

Window Join

窗口join,必须对表进行TVF开窗才能使用

table(tumple(table tablegreen,descriptor(rt),interval '5' minutes))

时间类型:processTime、eventTime

INNER/LEFT/RIGHT/FULL OUTER 

SELECT ...
FROM L [LEFT|RIGHT|FULL OUTER] JOIN R -- L and R are relations applied windowing TVF
ON L.window_start = R.window_start AND L.window_end = R.window_end AND ...

Flink双流join是指在Flink流处理框架中,将两个流数据进行关联操作的一种方式。在Flink中,支持两种方式的流的Join: Window Join和Interval Join。 Window Join是基于时间窗口的关联操作,包括滚动窗口Join、滑动窗口Join和会话窗口Join。滚动窗口Join是指将两个流中的元素根据固定大小的时间窗口进行关联操作。滑动窗口Join是指将两个流中的元素根据固定大小的时间窗口以固定的滑动间隔进行关联操作。会话窗口Join是指将两个流中的元素根据一段时间内的活动情况进行关联操作。 Interval Join是基于时间区间的关联操作,它允许流中的元素根据时间区间进行关联操作,而不依赖于固定大小的时间窗口。这样可以处理两条流步调不一致的情况,避免等不到join流窗口就自动关闭的问题。 总结起来,Flink双流join提供了通过时间窗口和时间区间的方式将两个流进行关联操作的灵活性和可靠性。根据具体的需求和数据特点,可以选择合适的窗口类型来进行双流join操作。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *3* [Flink双流join](https://blog.youkuaiyun.com/weixin_42796403/article/details/114713553)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] - *2* [Flink双流JOIN](https://blog.youkuaiyun.com/qq_44696532/article/details/124456980)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值