生产上完成TopN统计

需求:区域 下 受欢迎的产品的TOPN
hive表里现在并没有区域信息
区域信息和产品信息我们是存放在MySQL里面的(city_info和product_info两张表)
user_click.tx用户行为日志,存放在/home/hadoop/data/user_click.txt这个目录,

[hadoop@hadoop001 shell]$ head -10 /home/hadoop/data/user_click.txt
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:01:56,1,72
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:52:26,1,68
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:17:03,1,40
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:32:07,1,21
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:26:06,1,63
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:03:11,1,60
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:43:43,1,30
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:09:58,1,96
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:18:45,1,71
95,2bf501a7637549c89cf55342331b15db,2016-05-05 21:42:39,1,8
[hadoop@hadoop001 shell]$ 

	
mysql> select * from city_info ;
+---------+-----------+------+
| city_id | city_name | area |
+---------+-----------+------+
|       1 | BEIJING   | NC   |
|       2 | SHANGHAI  | EC   |
|       3 | NANJING   | EC   |
|       4 | GUANGZHOU | SC   |
|       5 | SANYA     | SC   |
|       6 | WUHAN     | CC   |
|       7 | CHANGSHA  | CC   |
|       8 | XIAN      | NW   |
|       9 | CHENGDU   | SW   |
|      10 | HAERBIN   | NE   |
+---------+-----------+------+
10 rows in set (0.00 sec)

mysql> select * from product_info limit 5 ;
+------------+--------------+----------------------+
| product_id | product_name | extend_info          |
+------------+--------------+----------------------+
|          1 | product1     | {"product_status":1} |
|          2 | product2     | {"product_status":1} |
|          3 | product3     | {"product_status":1} |
|          4 | product4     | {"product_status":1} |
|          5 | product5     | {"product_status":1} |
+------------+--------------+----------------------+
5 rows in set (0.00 sec)

mysql> 

1)city_ifno表和product_info表 放到Hive里面
2)通过user_click关联Hive里面的city_info和product_info
3)再使用窗口函数求分组内的TOPN

解决思路:
1.city_info表和product_info表通过sqoop放到Hive里面
2.通过user_click关联Hive里面的city_info和product_info
3.再使用窗口函数求分组内的TOPN将结果sqoop导入MySQL
4.shell脚本封装这个业务线的所有代码的思路,需要提及的一点,因为city_info/product_info数据变动少,所以通过其他的脚本导入,这个shell脚本不涉及.。
5使用crontab触发,每天凌晨2点开始执行 注意点: a) 每次创建的临时表,在执行之前一定要先删除,要使用if not exits b) 关键的执行要有日志输出 c) shell脚本如何解决幂等性问题

shell脚本如下:

	#!/bin/bash

#hive top n

current_date=`date +%Y-%m-%d`      当前时间
echo "current_date=${current_date}"  
pre_date=$(date -d '1 day ago' +%Y-%m-%d)  当前时间的昨天
echo "pre_date=${pre_date}"

hive -e "use default;drop table if exists user_click;create table if not exists  user_click(
user_id int,
session_id string,
action_time string,
city_id int,
product_id int
) partitioned by (date string)
row format delimited fields terminated by ',';"

hive -e "use default;load data local inpath '/home/hadoop/data/user_click.txt' overwrite into table user_click partition(date='${pre_date}');"

echo "create user_click table and load data success.."

#hive create city_info and product_info 

hive -e "use default;drop table if exists city_info;create table if not exists city_info(
city_id int,
city_name string,
area string
)
row format delimited fields terminated by '\t';"
echo "create hive city_info table"

hive -e "use default;drop table if exists product_info;  create table if not  exists
 product_info(
product_id int,
product_name string,
extend_info string
)
row format delimited fields terminated by '\t';"

echo "create hive product_info table"


#city_info load hive
 sqoop import \
--connect  jdbc:mysql://localhost:3306/ruozedata \
--password root \
--username root \
--table "city_info" \
--mapreduce-job-name  city_info \
--delete-target-dir \
--hive-database  default \
--hive-table  city_info \
--hive-import \
--hive-overwrite  \
--fields-terminated-by  '\t' \
--split-by city_id \
 -m 2
echo "city_info load hive successful ...."

#product_info load hive
sqoop import \
--connect  jdbc:mysql://localhost:3306/ruozedata \
--password root \
--username root \
--table "product_info" \
--mapreduce-job-name  product_info \
--delete-target-dir \
--hive-database  default \
--hive-table  product_info \
--hive-import \
--hive-overwrite  \
--fields-terminated-by  '\t' \
--split-by product_id \
 -m 2
echo "product_info  load hive successful ...."

hive -e "use default;drop  table if exists tmp_product_click_basic_info;
create table if not exists tmp_product_click_basic_info 
as
select  t.city_id,t.product_id,c.city_name,c.area  from 
(select city_id,product_id from user_click  where date='${pre_date}')       //${pre_date}这里如果用" "获取不pre_date这个值,最好用''  
as t
join
(select city_id,city_name,area from city_info) as c
on t.city_id=c.city_id  ;"

echo "hive create tmp_product_click_basic_info  table ......"

hive -e "use default;drop table if exists tmp_area_product_click_count;
create table if not exists tmp_area_product_click_count as 
select  product_id,count(*) click_sum,area  from tmp_product_click_basic_info
group by product_id,area ;"

echo "hive create tmp_area_product_click_count  table ......"


hive -e  "use default;drop table if exists tmp_area_product_click_count_full_info;
create table if not exists tmp_area_product_click_count_full_info as 
select t.product_id,t.click_sum,t.area,p.product_name  from
tmp_area_product_click_count as t join product_info as p
on  t.product_id=p.product_id ;"

eho "hive tmp_area_product_click_count_full_info  table ......"

hive -e "use default;drop  table if exists area_product_click_count_top5 ;
create table if not exists area_product_click_count_top5 
row format delimited fields terminated by '\t'     //这里最好指定一下分割符为'\t',不然后面将这个hive表导入到mysql中不知道分割符是什么,会报错
as
select * from
(select  '${pre_date}' day,product_id,click_sum,product_name,area,   // '${pre_date}' day  这是给该表新加一个字段
row_number() over(partition by area order by click_sum desc) r  
from tmp_area_product_click_count_full_info) t
 where t.r<=3 ;"

echo "##############################################pre_date=${pre_date}"


### 这里是在脚本里给mysql的ruozedata数据库中创建一张area_product_click_count_top5表。

  /usr/local/mysql/bin/mysql  -uroot -proot  --socket=/usr/local/mysql/data/mysql.sock  <<EOF

        use  ruozedata;
         create table if not exists area_product_click_count_top5(
 day varchar(15),
 product_id  int(10),
 click_sum  int(11),
 product_name varchar(15),
 area  varchar(10),
 r   int(10)
 );
EOF

echo "mysql creat area_product_click_count_top5 table ......"

### 为了保持幂等性,开始删除MySQL结果表中当前${pre_date}数据,因为此时需要传递一个变量值${pre_date}这个进去,因此不能用上面那种方式,只能用 -e "mysql命令的方式"。
  /usr/local/mysql/bin/mysql  -uroot -proot  --socket=/usr/local/mysql/data/mysql.sock -e" use  ruozedata;delete from area_product_click_count_top5 where day='${pre_date}'"


### hive中统计后的最终表,导入到mysql中去。

sqoop export \
 --connect jdbc:mysql://localhost:3306/ruozedata \
 --password root \
 --username root \
 --mapreduce-job-name  area_product_click_count_top5 \
 --table area_product_click_count_top5 \
 --export-dir /user/hive/warehouse/area_product_click_count_top5 \
 --columns "day,product_id,click_sum,product_name,area,r"  \
 --fields-terminated-by  '\t'  \
 -m 2 

echo "hive load mysql data successful ...."

然后写一个定时任务
crontab -e

0   2    *  *  *    /home/hadoop/shell/topN.sh  &>/dev/null

注意:创建临时表的时候最后前缀加个tmp列如tmp_area_product_click_count_full_info,然后统计成功后,把临时表全部删除。

WITH DateRange AS ( SELECT 台账状态, 生产线, 工单号, 设备编号, 内容与方法, 点检项目, 基准值, 定期时长, 单位, 下次点检日期, CASE WHEN 单位 = '天' THEN 定期时长 WHEN 单位 = '月' THEN DATEDIFF(DAY, 下次点检日期, DATEADD(MONTH, 定期时长, 下次点检日期)) END AS 周期天数 FROM [dbo].[dwd_设备_定期台账一览] WHERE 生产线 IN ('缸体_TNGA1线', '缸体_TNGA2线') AND YEAR(下次点检日期) >=2025 AND 作业部门 = '现场' AND 台账状态 = '启用' ), FutureInspections AS ( SELECT *, CASE -- 当定期时长为1天时,直接使用下次点检日期 WHEN 定期时长 = 0 AND 单位 = '天' THEN 下次点检日期 -- 否则计算未来点检日期 ELSE DATEADD(DAY, 周期天数 * (n-1), 下次点检日期) END AS 预计点检日期 FROM DateRange CROSS APPLY ( -- 动态生成序列号:1天周期只生成1个,其他生成53个 SELECT TOP (CASE WHEN 定期时长 = 1 AND 单位 = '天' THEN 365 else 365 END) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM sys.all_objects ) AS Numbers ) select * from (SELECT 生产线,工单号, 设备编号,内容与方法, 点检项目, 基准值, 单位, 定期时长, 周期天数, convert(date,下次点检日期)下次点检日期, convert(date,预计点检日期)预计点检日期, DATEDIFF(DAY, GETDATE(), 预计点检日期) AS 距离下次点检天数 FROM FutureInspections where year(预计点检日期)=year(getdate()) ) T1 left join (select * from (SELECT 关联工单 as 台账号,实施工单号 as 实施工单号 ,作业状态,实施担当, convert(date,实际完成日期) as 实际完成日期, case when 作业状态='已完成' then 'OK' end as 点检结果 FROM [dbo].[dwd_设备_点检单一览] where 实施部门='现场' and 生产线 in ('缸体_TNGA1线' , '缸体_TNGA2线') and 作业状态='已完成' union all SELECT 台账号,工单号 as 实施工单号,作业状态,实施人 as 实施担当,convert(date,实际完成日期) as 实际完成日期,点检结果 FROM [dbo].[dwd_TPM实施] where 生产线 in ('缸体_TNGA1线','缸体_TNGA2线'))T2 ) as T3 on T1.工单号=T3.台账号 ORDER BY 生产线, 设备编号, 预计点检日期 只统计显示预计点检日期,按照12个月统计出数量
最新发布
08-02
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值