小表join顺序和广播问题

1. 问题背景

最近CRM平台上报的数据,DI ingestion老出问题,需要做一个CRM平台的上报数据与后台统计数据的基础对账功能。大概步骤如下:

  1. CRM用flink + hive方式,ingestion到近实时日志表
  2. 统计后台昨天entry的所有task,及每个task的用户数
  3. 根据后台统计的task,统计日志表昨天每个task用户数
  4. 后台统计的每个task用户数与日志表统计的用户数进行对比

发现在第3步,即统计昨天每个task的用户数时,spark任务OOM了。

2. 问题分析

示例代码

explain
with be as (
  select
    task_tab.region as region,
    journey_tab.journey_name as journey_name,
    task_tab.journey_id as journey_id,
    task_tab.version_id as version_id,
    task_tab.id as task_id,
    task_tab.total_user_count as be_user_count
  from table_xx1 as task_tab
  left join table_xx2 as journey_tab on journey_tab.id = task_tab.journey_id and journey_tab.region = task_tab.region
  left join table_xx3 as version_tab on version_tab.id = task_tab.version_id and version_tab.region = task_tab.region
  where task_tab.create_time >= unix_timestamp('2022-11-03', 'yyyy-MM-dd')
    and task_tab.create_time < unix_timestamp('2022-11-04', 'yyyy-MM-dd')
    and get_json_object(version_tab.schedule_setting,'$.repeatSetting.frequency') = 1
),
de as (
  select
    be.region as region,
    be.journey_name as journey_name,
    be.journey_id as journey_id,
    be.version_id as version_id,
    be.task_id as task_id,
    count(distinct user_id) as de_user_count
  from be
  left join table_xx4 as tmp on tmp.grass_region = be.region
    and tmp.journey_id = be.journey_id
    and tmp.task_id = be.task_id
    and tmp.grass_date = '2022-11-03'
  group by 1,2,3,4,5
)
select
  be.region,
  be.journey_name,
  be.journey_id,
  be.version_id,
  be.task_id,
  be.be_user_count,
  de.de_user_count
from be
left join de on de.region = be.region and de.task_id = be.task_id
;

发现一直卡在下面这段SQL:

select
  be.region as region,
  be.journey_name as journey_name,
  be.journey_id as journey_id,
  be.version_id as version_id,
  be.task_id as task_id,
  count(distinct user_id) as de_user_count
from be
left join table_xx4 as tmp on tmp.grass_region = be.region
  and tmp.journey_id = be.journey_id
  and tmp.task_id = be.task_id
  and tmp.grass_date = '2022-11-03'
group by 1,2,3,4,5

统计分析发现,其实临时表be的数据量只有几十条,ods_crm_di_user_task_rt_live虽然过滤完还有20亿条左右的数据,理论上来说会进行小表广播优化,不应该那么慢,看yarn ui分析,发现大量spill disk和shuffle操作。

初步分析有两个问题:一方面是大小表join顺序问题,二是物理计划解析对临时表无法识别大小。临时表be没有被判断出是小表,无法进行优化并广播,直接采用普通的left join,详细分析看下面。

2.1. 临时小表join大表与broadcast

2.1.1 临时小表left join大表:不广播

explain sql查看原来sql物理计划,left join底层实际是用SortMergeJoin,没有按预期进行小表广播,没有用预期的BroadcastHashJoin。

2.1.2. 临时小表inner join大表:不广播

left join改成inner join不影响最终想要的结果,看看物理执行计划,inner join底层还是SortMergeJoin,还是没有按预期优化成BroadcastHashJoin。

 2.1.3. 大表left join临时小表:不广播

代码示例如下,调整大小表left join顺序,大表在前,小表在后,也还是无法将SortMergeJoin优化成BroadcastHashJoin!

2.2. BroadcastHashJoin与hint用法

2.2.1. 临时小表left join大表 + hint广播:不广播

示例代码

explain
with be as (
    select
        task_tab.region as region,
        journey_tab.journey_name as journey_name,
        task_tab.journey_id as journey_id,
        task_tab.version_id as version_id,
        task_tab.id as task_id,
        task_tab.total_user_count as be_user_count
    from table_xx1 as task_tab
    left join table_xx2 as journey_tab on journey_tab.id = task_tab.journey_id and journey_tab.region = task_tab.region
    left join table_xx3 as version_tab on version_tab.id = task_tab.version_id and version_tab.region = task_tab.region
    where task_tab.create_time >= unix_timestamp('2022-11-03', 'yyyy-MM-dd')
        and task_tab.create_time < unix_timestamp('2022-11-04', 'yyyy-MM-dd')
        and get_json_object(version_tab.schedule_setting,'$.repeatSetting.frequency') = 1
),
de as (
    select
        /*+ BROADCAST(be) */
        be.region as region,
        be.journey_name as journey_name,
        be.journey_id as journey_id,
        be.version_id as version_id,
        be.task_id as task_id,
        count(distinct user_id) as de_user_count
    from be
    left join table_xx4 as tmp on tmp.grass_region = be.region
        and tmp.journey_id = be.journey_id
        and tmp.task_id = be.task_id
        and tmp.grass_date = '2022-11-03'
    group by 1,2,3,4,5
)
select
    be.region,
    be.journey_name,
    be.journey_id,
    be.version_id,
    be.task_id,
    be.be_user_count,
    de.de_user_count
from be
left join de on de.region = be.region and de.task_id = be.task_id
;

2.2.2 临时小表inner join大表 + hint广播:广播

示例代码

explain
with be as (
    select
        task_tab.region as region,
        journey_tab.journey_name as journey_name,
        task_tab.journey_id as journey_id,
        task_tab.version_id as version_id,
        task_tab.id as task_id,
        task_tab.total_user_count as be_user_count
    from table_xx1 as task_tab
    left join table_xx2 as journey_tab on journey_tab.id = task_tab.journey_id and journey_tab.region = task_tab.region
    left join table_xx3 as version_tab on version_tab.id = task_tab.version_id and version_tab.region = task_tab.region
    where task_tab.create_time >= unix_timestamp('2022-11-03', 'yyyy-MM-dd')
        and task_tab.create_time < unix_timestamp('2022-11-04', 'yyyy-MM-dd')
        and get_json_object(version_tab.schedule_setting,'$.repeatSetting.frequency') = 1
),
de as (
    select
        /*+ BROADCAST(be) */
        be.region as region,
        be.journey_name as journey_name,
        be.journey_id as journey_id,
        be.version_id as version_id,
        be.task_id as task_id,
        count(distinct user_id) as de_user_count
    from be
    inner join table_xx4 as tmp on tmp.grass_region = be.region
        and tmp.journey_id = be.journey_id
        and tmp.task_id = be.task_id
        and tmp.grass_date = '2022-11-03'
    group by 1,2,3,4,5
)
select
    be.region,
    be.journey_name,
    be.journey_id,
    be.version_id,
    be.task_id,
    be.be_user_count,
    de.de_user_count
from be
left join de on de.region = be.region and de.task_id = be.task_id
;

2.2.3. 大表left join临时小表 + hint广播:广播

示例代码

explain
with be as (
    select
        task_tab.region as region,
        journey_tab.journey_name as journey_name,
        task_tab.journey_id as journey_id,
        task_tab.version_id as version_id,
        task_tab.id as task_id,
        task_tab.total_user_count as be_user_count
    from table_xx1 as task_tab
    left join table_xx2 as journey_tab on journey_tab.id = task_tab.journey_id and journey_tab.region = task_tab.region
    left join table_xx3 as version_tab on version_tab.id = task_tab.version_id and version_tab.region = task_tab.region
    where task_tab.create_time >= unix_timestamp('2022-11-03', 'yyyy-MM-dd')
        and task_tab.create_time < unix_timestamp('2022-11-04', 'yyyy-MM-dd')
        and get_json_object(version_tab.schedule_setting,'$.repeatSetting.frequency') = 1
),
de as (
    select
        /*+ BROADCAST(be) */
        be.region as region,
        be.journey_name as journey_name,
        be.journey_id as journey_id,
        be.version_id as version_id,
        be.task_id as task_id,
        count(distinct user_id) as de_user_count
    from table_xx4 as tmp
    inner join be on tmp.grass_region = be.region
        and tmp.journey_id = be.journey_id
        and tmp.task_id = be.task_id
        and tmp.grass_date = '2022-11-03'
    group by 1,2,3,4,5
)
select
    be.region,
    be.journey_name,
    be.journey_id,
    be.version_id,
    be.task_id,
    be.be_user_count,
    de.de_user_count
from be
left join de on de.region = be.region and de.task_id = be.task_id
;

2.3 中间小表join大表与broadcast

将临时小表落地到中间表

drop table temp;
create table temp
select
    task_tab.region as region,
    journey_tab.journey_name as journey_name,
    task_tab.journey_id as journey_id,
    task_tab.version_id as version_id,
    task_tab.id as task_id,
    task_tab.total_user_count as be_user_count
from table_xx1 as task_tab
left join table_xx2 as journey_tab on journey_tab.id = task_tab.journey_id and journey_tab.region = task_tab.region
left join table_xx3 as version_tab on version_tab.id = task_tab.version_id and version_tab.region = task_tab.region
where task_tab.create_time >= unix_timestamp('2022-11-03', 'yyyy-MM-dd')
    and task_tab.create_time < unix_timestamp('2022-11-04', 'yyyy-MM-dd')
    and get_json_object(version_tab.schedule_setting,'$.repeatSetting.frequency') = 1
;

2.3.1. 中间表小left join大表:不广播

示例代码

explain

select

    tmp.grass_region as region,

    tmp.journey_name as journey_name,

    tmp.journey_id as journey_id,

    tmp.version_id as version_id,

    tmp.task_id as task_id,

    count(distinct user_id) as de_user_count

from temp as be

left join table_xx4 as tmp on tmp.grass_region = be.region

    and tmp.journey_id = be.journey_id

    and tmp.task_id = be.task_id

    and tmp.grass_date = '2022-11-03'

    and tmp.grass_region != 'ID'

group by 1,2,3,4,5;

2.3.1. 中间表小inner join大表:广播

示例代码

explain

select

    tmp.grass_region as region,

    tmp.journey_name as journey_name,

    tmp.journey_id as journey_id,

    tmp.version_id as version_id,

    tmp.task_id as task_id,

    count(distinct user_id) as de_user_count

from temp as be

inner join table_xx4 as tmp on tmp.grass_region = be.region

    and tmp.journey_id = be.journey_id

    and tmp.task_id = be.task_id

    and tmp.grass_date = '2022-11-03'

    and tmp.grass_region != 'ID'

group by 1,2,3,4,5;

2.3.3. 大表left join中间小表:广播

示例代码

explain

select

    tmp.grass_region as region,

    tmp.journey_name as journey_name,

    tmp.journey_id as journey_id,

    tmp.version_id as version_id,

    tmp.task_id as task_id,

    count(distinct user_id) as de_user_count

from table_xx4 as tmp

inner join temp as be on tmp.grass_region = be.region

    and tmp.journey_id = be.journey_id

    and tmp.task_id = be.task_id

    and tmp.grass_date = '2022-11-03'

    and be.task_id is not null

group by 1,2,3,4,5;

3. 问题总结

  1.  临时表:复杂sql生成的临时表,哪些只有少量数据,spark-sql并不会识别临时表大小并进行优化和广播
  2. left join顺序:小表left join大表,无论如何也不会对小表进行广播;想要广播小表,必须大表left join小表
  3. hint显式广播:临时小表inner join大表、大表left join临时小表,可以用hint显示指定广播,达到SortMergeJoin优化成BroadcastHashJoin的目的
  4. 中间表替换临时表:临时小表落地到中间表,中间小表inner join或者大表、大表left join中间小表,无需hint显示指定广播,spark-sql都可以自动识别优化,SortMergeJoin优化成BroadcastHashJoin
<think>嗯,用户想了解Spark SQL如何广播。首先,我需要回忆一下Spark SQL中的广播机制。广播变量在Spark中通常用于优化join操作,特别是当一个小需要与大进行join时,避免数据的shuffle,减少网络传输。 接下来,我应该考虑广播join的工作原理。广播意味着将小的数据分发到各个Executor节点,这样每个节点都有一份小的副本,大的数据就可以在本地进行join操作,而不需要移动大的数据。这样可以显著提高性能。 然后,需要说明如何在Spark SQL中启用广播。用户可能知道在DataFrame API中使用broadcast函数,但在Spark SQL中是否有不同的方法?应该提到使用配置参数或者提示(hints)。例如,设置spark.sql.autoBroadcastJoinThreshold参数,或者使用/*+ BROADCAST */的SQL提示。 还要注意参数的单位,比如autoBroadcastJoinThreshold默认是10MB,单位是字节。用户需要根据小的大小调整这个值,如果小超过这个阈值,可能不会被广播。这时候可能需要手动添加提示。 另外,要举例说明具体的SQL写法,比如在SELECT语句中加入提示。同时,需要提醒用户广播不宜过大,否则会占用过多内存,影响性能。可能还要提到如何检查执行计划,确认是否成功启用了广播join。 最后,确保回答结构清晰,分步骤解释,让用户能够逐步理解应用。可能还需要对比自动广播手动提示的区别,以及使用场景,帮助用户根据实际情况选择合适的方法。</think>Spark SQL 广播的原理是通过将小分发到所有Executor节点内存中,避免Shuffle操作,从而优化Join性能。以下是具体实现步骤注意事项: 1. **自动广播机制**(默认开启) - 设置阈值参数: ```sql -- 单位:字节(默认10MB) SET spark.sql.autoBroadcastJoinThreshold = 10485760; -- 10*1024*1024 ``` - 当小尺寸 < 阈值时,Spark自动执行广播 - 可通过以下方式查看执行计划: ```sql EXPLAIN EXTENDED SELECT /*+ MAPJOIN(small_table) */ * FROM large_table JOIN small_table ON large_table.id = small_table.id; ``` 2. **强制广播提示**(适用于尺寸超过阈值但确信可以广播时) ```sql SELECT /*+ BROADCAST(small_table) */ * FROM large_table JOIN small_table ON large_table.id = small_table.id; ``` 3. **手动广播实现**(编程API方式) ```scala val smallDF = spark.table("small_table").cache() val largeDF = spark.table("large_table") // 创建广播变量 val broadcastVar = spark.sparkContext.broadcast(smallDF.collect()) // 使用广播变量 largeDF.rdd.mapPartitions { iter => val smallData = broadcastVar.value // 实现自定义Join逻辑 } ``` **注意事项**: - 广播应满足 $$ \text{大小} \leq \text{spark.driver.maxResultSize} $$ 且 $$ \text{大小} \leq \text{Executor内存} $$ - 广播数据会驻留在各Executor内存中,需确保集群内存充足 - 广播更新后需重新广播Spark 3.0+支持动态更新) - 可通过`spark.ui`的"Storage"页查看广播变量状态 **性能对比公式**: 当大数据量为$$ N $$,小数据量为$$ M $$时: - 普通Join时间复杂度:$$ O(N \times M) $$ - 广播Join时间复杂度:$$ O(N) $$ 建议优先使用自动广播机制,当遇到以下情况时手动干预: 1. 尺寸略微超过阈值但内存充足 2. 需要强制特定Join顺序 3. 处理非等值Join时需自定义广播逻辑
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值