hive 练习案例:
①wordcount
上传数据到hdfs上,创建external表去关联数据,
执行操作:
说明:explode函数可以把行数据按照指定的规则切分成多行数据
select count(*) from (select explode(split(line, ‘,’)) a1 from word) w1 group by w1.a1; //则完成单词计数的需求
②regrex_replace操作
数据准备:
| ***http://ww****w.baidu.com:192.168.23@@@4.22:312 http://www.sina.com:192.168.221.12:321 @@@http://www.ali.com:192.16@@@8.127.23:123 ht***tp://www.baidu.com:192.168.343.45:213 http://w**ww.sin***a.com:192.16%%%%8.223.45:2***31 http://www.jd.com:192.168.345.23:123 htt%%%%p://www.yahoo.co&&&&m:192.168.342.251:3323 http://www.go%%ogle.c@@@om:192.168.231.231:23***22 http://www.badiu.com:192.168.123.121:4422 ht@@@tp://www.wangyi.c@@@om:192.16***8.343.223:123 http://www.wangyi.com:192.168.455.21:3344 htt#####p://www..com:192.168.232.1:222 http://ww******w.baidu.com:19%%2.168.456.23:1***234 |
创建external table 关联HDfs上的数据目录
create external table info(line string) location ‘/replace’;
select regrex_replace(regrex_replace(line, ‘[*|@|$|#|%]’, ‘’), ‘[.][.]’, ‘.’) from info;
③数据extract
| http://www.baidu.com:192.168.234.22:312 http://www.sina.com:192.168.221.12:321 http://www.ali.com:192.168.127.23:123 http://www.baidu.com:192.168.343.45:213 http://www.sina.com:192.168.223.45:231 http://www.jd.com:192.168.345.23:123 http://www.yahoo.com:192.168.342.251:3323 http://www.google.com:192.168.231.231:2322 http://www.badiu.com:192.168.123.121:4422 http://www.wangyi.com:192.168.343.223:123 http://www.wangyi.com:192.168.455.21:3344 http://www.com:192.168.232.1:222 http://www.baidu.com:192.168.456.23:1234 |
抽取其中的域名
select regexp_extract(line, 'http://(.*)(:)(.*)(:)(.*)',1) from r1;
发现有错误的数据:www.com
select * from (select regexp_extract(line, 'http://(.*)(:)(.*)(:)(.*)',1) a1 from r1) w1 where size(split(w1.a1, '[.]')) == 3;
接下来总结一下hive的join操作:
product.txt
P001 xiaomi 2999
P002 huawei 3999
P005 chuizi 4000
order.txt
1 20180710 P001 20
2 20180710 P002 14
3 20180710 P001 35
4 20180710 P002 40
5 20180710 P001 10
6 20180710 P003 20
7 20180710 P004 12
1. inner join
select * from product_t join order_t on product_t.pid=order_t.pid;
select * from product_t inner join order_t on product_t.pid=order_t.pid;

2. left join
select * from product_t left join order_t on product_t.pid=order_t.pid;

3.right join
select * from product_t right join order_t on product_t.pid=order_t.pid;

4. left outer join
select * from product_t full outer join order_t on product_t.pid=order_t.pid;

5 left semi join
这种join解决的是exist in(是否存在) 的问题
a表里哪些数据在b表中出现过
select * from product_t left semi join order_t on product_t.pid=order_t.pid;
![]()
本文提供了Hive的练习案例,包括wordcount、regrex_replace操作、数据extract等,介绍了相关操作步骤和SQL语句。还总结了Hive的join操作,如inner join、left join、right join等,并给出了对应的SQL示例。
553

被折叠的 条评论
为什么被折叠?



