full join增量刷新

本文介绍如何使用Hive的FULL OUTER JOIN与COALESCE函数实现增量数据与基表数据的有效合并,通过具体实例展示了完整的操作流程及优化过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

增量数据和合并问题验证

1.建立基表和增量测试数据

复制代码
[root@node1 delta_merge]# pwd
/root/delta_merge
[root@node1 delta_merge]# cat base.txt 
1001,gongshaocheng
1002,LIDACHAO
[root@node1 delta_merge]# cat delta.txt 
1002,lidachao
1003,chenjianzhong

[root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta
[root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta/base
[root@node1 delta_merge]# hdfs dfs -mkdir /user/merge_delta/delta
[root@node1 delta_merge]# hdfs dfs -put base.txt /user/merge_delta/base
[root@node1 delta_merge]# hdfs dfs -put delta.txt /user/merge_delta/delta

复制代码

2.建立测试表

复制代码
hive> create external table base(id string,name string)
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    > location "/user/merge_delta/base/";
OK
Time taken: 0.304 seconds
hive> select * from base;
OK
1001    gongshaocheng
1002    LIDACHAO
Time taken: 0.875 seconds, Fetched: 2 row(s)
hive> create external table delta(id string,name string)
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    > location "/user/merge_delta/delta/";
OK
Time taken: 0.134 seconds
hive> select * from delta;
OK
1002    lidachao
1003    chenjianzhong
Time taken: 0.321 seconds, Fetched: 2 row(s)
复制代码

3.测试:

a. full outer join语法:

hive> select base.*,delta.* from base full outer join delta on base.id = delta.id;

结果如下:

1001    gongshaocheng    NULL    NULL
1002    LIDACHAO    1002    lidachao
NULL    NULL    1003    chenjianzhong

我们最终想要的答案应该是:

1001 gongshaocheng --代表保持不变的记录

1002 lidachao  --代表修改后的最新记录

1003 chenjianzhong --代表新增记录

b.coalesce函数:

select 
coalesce(base.id, delta.id)
from base full outer join delta on base.id = delta.id
where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);

结果:

1001
1002
1003

上面验证了对于主键列,我们可以采用coalesce函数,使得结果集中主键列总是有值的

c.if函数

select 
if(delta.id is NULL, base.name,delta.name)
from base full outer join delta on base.id = delta.id
where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);

结果:

gongshaocheng
lidachao
chenjianzhong

上面验证了对于普通列,如果是未修改的数据(delta.id is NULL),则直接用基表里的值,否则直接用增量表的数据

 

最后综合起来,得到我们想要的HQL语句:

select 
coalesce(base.id, delta.id),
if(delta.id is NULL, base.name,delta.name)
from base full outer join delta on base.id = delta.id
where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);

结果如下:

复制代码
hive> select 
    > coalesce(base.id, delta.id),
    > if(delta.id is NULL, base.name,delta.name)
    > from base full outer join delta on base.id = delta.id
    > where (delta.id is NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NOT NUll) OR (delta.id is NOT NULL AND base.id is NUll);
Query ID = root_20151230235050_befa6322-f78f-4166-8bbd-4fde04a1a9b1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1451024710809_0005, Tracking URL = http://node1.clouderachina.com:8088/proxy/application_1451024710809_0005/
Kill Command = /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/bin/hadoop job  -kill job_1451024710809_0005
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2015-12-30 23:51:04,904 Stage-1 map = 0%,  reduce = 0%
2015-12-30 23:51:13,245 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.69 sec
2015-12-30 23:51:22,685 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.16 sec
MapReduce Total cumulative CPU time: 5 seconds 160 msec
Ended Job = job_1451024710809_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.16 sec   HDFS Read: 12293 HDFS Write: 52 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 160 msec
OK
1001    gongshaocheng
1002    lidachao
1003    chenjianzhong
Time taken: 31.376 seconds, Fetched: 3 row(s)
复制代码

 

注意:上面所有的HQL都只需要有一个MR作业。这就是本解决方案的精髓所在!

 

最后对HQL进行进一步优化:之前为了保持逻辑上的清晰,增加了WHERE子句,对FULL OUTER JOIN的三种情况进行分布讨论,但实际上两个OR合并后就是全集,其实WHERE子句是多余的。最终的HQL为:

select 
coalesce(base.id, delta.id),
if(delta.id is NULL, base.name,delta.name)
from base full outer join delta on base.id = delta.id;
#!/bin/bash dayid=${1:-$(date +%Y-%m-%d)} echo $dayid PROJECT_ID=833 source /data/nfs/scripts/${PROJECT_ID}/hhweb_config.sh echo ${hhweb_db_user} ${hhweb_db_host} ${hhweb_db_port} ${hhweb_db_name} mysql -u${hhweb_db_user} -h${hhweb_db_host} -p${hhweb_db_pass} -P${hhweb_db_port} -D${hhweb_db_name} -v -e" CREATE TABLE IF NOT EXISTS bvreport.hhweb_yqy_user_orders_d ( province_name CHAR COMMENT '省份名称', new_order_users INT COMMENT '新增订购用户数', vip_name CHAR COMMENT '会员类型', dayid CHAR ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; DELETE FROM bvreport.hhweb_yqy_user_orders_d WHERE dayid = '${dayid}'; " hive -hiveconf dayid="${dayid}" -e " DROP TABLE IF EXISTS mgwh_rd_temp.product_vip_info_${dayid}; CREATE TABLE mgwh_rd_temp.product_vip_info_${dayid} AS SELECT product_id, vip_name FROM ( SELECT a.product_id, '体育会员' AS vip_name FROM mgwh_rd_video.dwd_dim_product_1d_full_daily a JOIN ( SELECT sub_busi_id FROM mgwh_rd_video.dwd_dim_sub_business_1d_full_daily WHERE dt = '${dayid}' AND business_name = '体育赛场' ) b ON a.sub_busi_id = b.sub_busi_id WHERE a.dt = '${dayid}' GROUP BY a.product_id UNION SELECT a.product_id, '钻石会员' AS vip_name FROM mgwh_rd_video.dwd_dim_product_1d_full_daily a JOIN ( SELECT sub_busi_id FROM mgwh_rd_video.dwd_dim_sub_business_1d_full_daily WHERE dt = '${dayid}' AND (business_id = 'BB000105301' OR sub_busi_id IN ('SB000110313', 'SB000110411', 'SB000110412')) ) b ON a.sub_busi_id = b.sub_busi_id WHERE a.dt = '${dayid}' GROUP BY a.product_id ) subquery; " hive -hiveconf dayid="${dayid}" -e " INSERT INTO bvreport.hhweb_yqy_user_orders_d (province_name, new_order_users, vip_name, dayid) SELECT a.province_name, COUNT(DISTINCT a.msisdn) as new_order_users, b.vip_name, '${dayid}' AS dayid FROM mgwh_rd_dwm.dwm_video_order_d a INNER JOIN mgwh_rd_temp.product_vip_info_${dayid} b ON a.product_id = b.product_id INNER JOIN defa
最新发布
03-25
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值