生产上完成TopN统计

最新推荐文章于 2025-04-18 14:26:29 发布

原创最新推荐文章于 2025-04-18 14:26:29 发布 · 179 阅读

0 ·

CC 4.0 BY-SA版权

sqoop 专栏收录该内容

2 篇文章

订阅专栏

需求：区域下受欢迎的产品的TOPN
hive表里现在并没有区域信息
区域信息和产品信息我们是存放在MySQL里面的(city_info和product_info两张表)
user_click.tx用户行为日志，存放在/home/hadoop/data/user_click.txt这个目录，

[hadoop@hadoop001 shell]$ head -10 /home/hadoop/data/user_click.txt
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:01:56,1,72
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:52:26,1,68
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:17:03,1,40
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:32:07,1,21
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:26:06,1,63
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:03:11,1,60
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:43:43,1,30
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:09:58,1,96
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:18:45,1,71
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:42:39,1,8
[hadoop@hadoop001 shell]$ 

	
mysql> select * from city_info ;
+---------+-----------+------+
| city_id | city_name | area |
+---------+-----------+------+
|       1 | BEIJING   | NC   |
|       2 | SHANGHAI  | EC   |
|       3 | NANJING   | EC   |
|       4 | GUANGZHOU | SC   |
|       5 | SANYA     | SC   |
|       6 | WUHAN     | CC   |
|       7 | CHANGSHA  | CC   |
|       8 | XIAN      | NW   |
|       9 | CHENGDU   | SW   |
|      10 | HAERBIN   | NE   |
+---------+-----------+------+
10 rows in set (0.00 sec)

mysql> select * from product_info limit 5 ;
+------------+--------------+----------------------+
| product_id | product_name | extend_info          |
+------------+--------------+----------------------+
|          1 | product1     | {"product_status":1} |
|          2 | product2     | {"product_status":1} |
|          3 | product3     | {"product_status":1} |
|          4 | product4     | {"product_status":1} |
|          5 | product5     | {"product_status":1} |
+------------+--------------+----------------------+
5 rows in set (0.00 sec)

mysql>

1）city_ifno表和product_info表放到Hive里面
2）通过user_click关联Hive里面的city_info和product_info
3）再使用窗口函数求分组内的TOPN

解决思路：
1.city_info表和product_info表通过sqoop放到Hive里面
2.通过user_click关联Hive里面的city_info和product_info
3.再使用窗口函数求分组内的TOPN将结果sqoop导入MySQL
4.shell脚本封装这个业务线的所有代码的思路，需要提及的一点，因为city_info/product_info数据变动少，所以通过其他的脚本导入，这个shell脚本不涉及.。
5使用crontab触发，每天凌晨2点开始执行注意点： a) 每次创建的临时表，在执行之前一定要先删除，要使用if not exits b) 关键的执行要有日志输出 c) shell脚本如何解决幂等性问题

shell脚本如下：

	#!/bin/bash

#hive top n

current_date=`date +%Y-%m-%d`      当前时间
echo "current_date=${current_date}"  
pre_date=$(date -d '1 day ago' +%Y-%m-%d)  当前时间的昨天
echo "pre_date=${pre_date}"

hive -e "use default;drop table if exists user_click;create table if not exists  user_click(
user_id int,
session_id string,
action_time string,
city_id int,
product_id int
) partitioned by (date string)
row format delimited fields terminated by ',';"

hive -e "use default;load data local inpath '/home/hadoop/data/user_click.txt' overwrite into table user_click partition(date='${pre_date}');"

echo "create user_click table and load data success.."

#hive create city_info and product_info 

hive -e "use default;drop table if exists city_info;create table if not exists city_info(
city_id int,
city_name string,
area string
)
row format delimited fields terminated by '\t';"
echo "create hive city_info table"

hive -e "use default;drop table if exists product_info;  create table if not  exists
 product_info(
product_id int,
product_name string,
extend_info string
)
row format delimited fields terminated by '\t';"

echo "create hive product_info table"


#city_info load hive
 sqoop import \
--connect  jdbc:mysql://localhost:3306/ruozedata \
--password root \
--username root \
--table "city_info" \
--mapreduce-job-name  city_info \
--delete-target-dir \
--hive-database  default \
--hive-table  city_info \
--hive-import \
--hive-overwrite  \
--fields-terminated-by  '\t' \
--split-by city_id \
 -m 2
echo "city_info load hive successful ...."

#product_info load hive
sqoop import \
--connect  jdbc:mysql://localhost:3306/ruozedata \
--password root \
--username root \
--table "product_info" \
--mapreduce-job-name  product_info \
--delete-target-dir \
--hive-database  default \
--hive-table  product_info \
--hive-import \
--hive-overwrite  \
--fields-terminated-by  '\t' \
--split-by product_id \
 -m 2
echo "product_info  load hive successful ...."

hive -e "use default;drop  table if exists tmp_product_click_basic_info;
create table if not exists tmp_product_click_basic_info 
as
select  t.city_id,t.product_id,c.city_name,c.area  from 
(select city_id,product_id from user_click  where date='${pre_date}')       //${pre_date}这里如果用" "获取不pre_date这个值，最好用''  
as t
join
(select city_id,city_name,area from city_info) as c
on t.city_id=c.city_id  ;"

echo "hive create tmp_product_click_basic_info  table ......"

hive -e "use default;drop table if exists tmp_area_product_click_count;
create table if not exists tmp_area_product_click_count as 
select  product_id,count(*) click_sum,area  from tmp_product_click_basic_info
group by product_id,area ;"

echo "hive create tmp_area_product_click_count  table ......"


hive -e  "use default;drop table if exists tmp_area_product_click_count_full_info;
create table if not exists tmp_area_product_click_count_full_info as 
select t.product_id,t.click_sum,t.area,p.product_name  from
tmp_area_product_click_count as t join product_info as p
on  t.product_id=p.product_id ;"

eho "hive tmp_area_product_click_count_full_info  table ......"

hive -e "use default;drop  table if exists area_product_click_count_top5 ;
create table if not exists area_product_click_count_top5 
row format delimited fields terminated by '\t'     //这里最好指定一下分割符为'\t'，不然后面将这个hive表导入到mysql中不知道分割符是什么，会报错
as
select * from
(select  '${pre_date}' day,product_id,click_sum,product_name,area,   // '${pre_date}' day  这是给该表新加一个字段
row_number() over(partition by area order by click_sum desc) r  
from tmp_area_product_click_count_full_info) t
 where t.r<=3 ;"

echo "##############################################pre_date=${pre_date}"


### 这里是在脚本里给mysql的ruozedata数据库中创建一张area_product_click_count_top5表。

  /usr/local/mysql/bin/mysql  -uroot -proot  --socket=/usr/local/mysql/data/mysql.sock  <<EOF

        use  ruozedata;
         create table if not exists area_product_click_count_top5(
 day varchar(15),
 product_id  int(10),
 click_sum  int(11),
 product_name varchar(15),
 area  varchar(10),
 r   int(10)
 );
EOF

echo "mysql creat area_product_click_count_top5 table ......"

### 为了保持幂等性，开始删除MySQL结果表中当前${pre_date}数据，因为此时需要传递一个变量值${pre_date}这个进去，因此不能用上面那种方式，只能用 -e "mysql命令的方式"。
  /usr/local/mysql/bin/mysql  -uroot -proot  --socket=/usr/local/mysql/data/mysql.sock -e" use  ruozedata;delete from area_product_click_count_top5 where day='${pre_date}'"


### hive中统计后的最终表，导入到mysql中去。

sqoop export \
 --connect jdbc:mysql://localhost:3306/ruozedata \
 --password root \
 --username root \
 --mapreduce-job-name  area_product_click_count_top5 \
 --table area_product_click_count_top5 \
 --export-dir /user/hive/warehouse/area_product_click_count_top5 \
 --columns "day,product_id,click_sum,product_name,area,r"  \
 --fields-terminated-by  '\t'  \
 -m 2 

echo "hive load mysql data successful ...."

然后写一个定时任务
crontab -e

0   2    *  *  *    /home/hadoop/shell/topN.sh  &>/dev/null

注意：创建临时表的时候最后前缀加个tmp列如tmp_area_product_click_count_full_info,然后统计成功后，把临时表全部删除。