需求:区域 下 受欢迎的产品的TOPN
hive表里现在并没有区域信息
区域信息和产品信息我们是存放在MySQL里面的(city_info和product_info两张表)
user_click.tx用户行为日志,存放在/home/hadoop/data/user_click.txt这个目录,
[hadoop@hadoop001 shell]$ head -10 /home/hadoop/data/user_click.txt
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:01:56,1,72
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:52:26,1,68
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:17:03,1,40
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:32:07,1,21
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:26:06,1,63
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:03:11,1,60
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:43:43,1,30
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:09:58,1,96
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:18:45,1,71
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:42:39,1,8
[hadoop@hadoop001 shell]$
mysql> select * from city_info ;
+---------+-----------+------+
| city_id | city_name | area |
+---------+-----------+------+
| 1 | BEIJING | NC |
| 2 | SHANGHAI | EC |
| 3 | NANJING | EC |
| 4 | GUANGZHOU | SC |
| 5 | SANYA | SC |
| 6 | WUHAN | CC |
| 7 | CHANGSHA | CC |
| 8 | XIAN | NW |
| 9 | CHENGDU | SW |
| 10 | HAERBIN | NE |
+---------+-----------+------+
10 rows in set (0.00 sec)
mysql> select * from product_info limit 5 ;
+------------+--------------+----------------------+
| product_id | product_name | extend_info |
+------------+--------------+----------------------+
| 1 | product1 | {"product_status":1} |
| 2 | product2 | {"product_status":1} |
| 3 | product3 | {"product_status":1} |
| 4 | product4 | {"product_status":1} |
| 5 | product5 | {"product_status":1} |
+------------+--------------+----------------------+
5 rows in set (0.00 sec)
mysql>
1)city_ifno表和product_info表 放到Hive里面
2)通过user_click关联Hive里面的city_info和product_info
3)再使用窗口函数求分组内的TOPN
解决思路:
1.city_info表和product_info表通过sqoop放到Hive里面
2.通过user_click关联Hive里面的city_info和product_info
3.再使用窗口函数求分组内的TOPN将结果sqoop导入MySQL
4.shell脚本封装这个业务线的所有代码的思路,需要提及的一点,因为city_info/product_info数据变动少,所以通过其他的脚本导入,这个shell脚本不涉及.。
5使用crontab触发,每天凌晨2点开始执行 注意点: a) 每次创建的临时表,在执行之前一定要先删除,要使用if not exits b) 关键的执行要有日志输出 c) shell脚本如何解决幂等性问题
shell脚本如下:
#!/bin/bash
#hive top n
current_date=`date +%Y-%m-%d` 当前时间
echo "current_date=${current_date}"
pre_date=$(date -d '1 day ago' +%Y-%m-%d) 当前时间的昨天
echo "pre_date=${pre_date}"
hive -e "use default;drop table if exists user_click;create table if not exists user_click(
user_id int,
session_id string,
action_time string,
city_id int,
product_id int
) partitioned by (date string)
row format delimited fields terminated by ',';"
hive -e "use default;load data local inpath '/home/hadoop/data/user_click.txt' overwrite into table user_click partition(date='${pre_date}');"
echo "create user_click table and load data success.."
#hive create city_info and product_info
hive -e "use default;drop table if exists city_info;create table if not exists city_info(
city_id int,
city_name string,
area string
)
row format delimited fields terminated by '\t';"
echo "create hive city_info table"
hive -e "use default;drop table if exists product_info; create table if not exists
product_info(
product_id int,
product_name string,
extend_info string
)
row format delimited fields terminated by '\t';"
echo "create hive product_info table"
#city_info load hive
sqoop import \
--connect jdbc:mysql://localhost:3306/ruozedata \
--password root \
--username root \
--table "city_info" \
--mapreduce-job-name city_info \
--delete-target-dir \
--hive-database default \
--hive-table city_info \
--hive-import \
--hive-overwrite \
--fields-terminated-by '\t' \
--split-by city_id \
-m 2
echo "city_info load hive successful ...."
#product_info load hive
sqoop import \
--connect jdbc:mysql://localhost:3306/ruozedata \
--password root \
--username root \
--table "product_info" \
--mapreduce-job-name product_info \
--delete-target-dir \
--hive-database default \
--hive-table product_info \
--hive-import \
--hive-overwrite \
--fields-terminated-by '\t' \
--split-by product_id \
-m 2
echo "product_info load hive successful ...."
hive -e "use default;drop table if exists tmp_product_click_basic_info;
create table if not exists tmp_product_click_basic_info
as
select t.city_id,t.product_id,c.city_name,c.area from
(select city_id,product_id from user_click where date='${pre_date}') //${pre_date}这里如果用" "获取不pre_date这个值,最好用''
as t
join
(select city_id,city_name,area from city_info) as c
on t.city_id=c.city_id ;"
echo "hive create tmp_product_click_basic_info table ......"
hive -e "use default;drop table if exists tmp_area_product_click_count;
create table if not exists tmp_area_product_click_count as
select product_id,count(*) click_sum,area from tmp_product_click_basic_info
group by product_id,area ;"
echo "hive create tmp_area_product_click_count table ......"
hive -e "use default;drop table if exists tmp_area_product_click_count_full_info;
create table if not exists tmp_area_product_click_count_full_info as
select t.product_id,t.click_sum,t.area,p.product_name from
tmp_area_product_click_count as t join product_info as p
on t.product_id=p.product_id ;"
eho "hive tmp_area_product_click_count_full_info table ......"
hive -e "use default;drop table if exists area_product_click_count_top5 ;
create table if not exists area_product_click_count_top5
row format delimited fields terminated by '\t' //这里最好指定一下分割符为'\t',不然后面将这个hive表导入到mysql中不知道分割符是什么,会报错
as
select * from
(select '${pre_date}' day,product_id,click_sum,product_name,area, // '${pre_date}' day 这是给该表新加一个字段
row_number() over(partition by area order by click_sum desc) r
from tmp_area_product_click_count_full_info) t
where t.r<=3 ;"
echo "##############################################pre_date=${pre_date}"
### 这里是在脚本里给mysql的ruozedata数据库中创建一张area_product_click_count_top5表。
/usr/local/mysql/bin/mysql -uroot -proot --socket=/usr/local/mysql/data/mysql.sock <<EOF
use ruozedata;
create table if not exists area_product_click_count_top5(
day varchar(15),
product_id int(10),
click_sum int(11),
product_name varchar(15),
area varchar(10),
r int(10)
);
EOF
echo "mysql creat area_product_click_count_top5 table ......"
### 为了保持幂等性,开始删除MySQL结果表中当前${pre_date}数据,因为此时需要传递一个变量值${pre_date}这个进去,因此不能用上面那种方式,只能用 -e "mysql命令的方式"。
/usr/local/mysql/bin/mysql -uroot -proot --socket=/usr/local/mysql/data/mysql.sock -e" use ruozedata;delete from area_product_click_count_top5 where day='${pre_date}'"
### hive中统计后的最终表,导入到mysql中去。
sqoop export \
--connect jdbc:mysql://localhost:3306/ruozedata \
--password root \
--username root \
--mapreduce-job-name area_product_click_count_top5 \
--table area_product_click_count_top5 \
--export-dir /user/hive/warehouse/area_product_click_count_top5 \
--columns "day,product_id,click_sum,product_name,area,r" \
--fields-terminated-by '\t' \
-m 2
echo "hive load mysql data successful ...."
然后写一个定时任务
crontab -e
0 2 * * * /home/hadoop/shell/topN.sh &>/dev/null
注意:创建临时表的时候最后前缀加个tmp列如tmp_area_product_click_count_full_info,然后统计成功后,把临时表全部删除。