hive写动态分区加distribute by

最新推荐文章于 2023-12-12 17:10:25 发布

原创

最新推荐文章于 2023-12-12 17:10:25 发布 · 446 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#hive #hadoop #数据仓库

文章讲述了如何通过在Spark程序中添加DistributeBy来减少因设备ID、类型和salt分区导致的大量小文件问题，从而提高数据处理效率并避免任务失败。

背景

需要把设备id，依据其类型（包括cuid、imei、oaid、idfa）和mumuhash的值(salt)写到对应的分区中，我们定义的是type, salt分区，在写的时候也是用动态分区的方法，type和salt都是在程序中计算出来的，sql如下：

insert overwrite table ugc_test_new.dwd_cpa_act_user_df_txt partition (dt = '{@date}', app = 'zuoyebang', type, salt)
select
    device_id,
    type,
    salt
from
    (
        select
            'cuid' as type,
            cuid as device_id,
            cast(mumuhash(cuid, 512) as string)