Hive实战

最新推荐文章于 2021-11-12 17:34:28 发布

原创最新推荐文章于 2021-11-12 17:34:28 发布 · 1.8k 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#Hive实战

Hive 专栏收录该内容

5 篇文章

订阅专栏

实战案例1——数据ETL

1.1 需求

对web点击流日志基础数据表进行etl（按照仓库模型设计）
按各时间维度统计来源域名top10

已有数据表 “t_orgin_weblog”：

+------------------+------------+----------+--+
|     col_name     | data_type  | comment  |
+------------------+------------+----------+--+
| valid            | string     |          |
| remote_addr      | string     |          |
| remote_user      | string     |          |
| time_local       | string     |          |
| request          | string     |          |
| status           | string     |          |
| body_bytes_sent  | string     |          |
| http_referer     | string     |          |
| http_user_agent  | string     |          |
+------------------+------------+----------+--+

1.2 数据示例

| true|1.162.203.134| - | 18/Sep/2013:13:47:35| /images/my.jpg                        | 200| 19939 | "http://www.angularjs.cn/A0d9"                      | "Mozilla/5.0 (Windows   |
 
| true|1.202.186.37 | - | 18/Sep/2013:15:39:11| /wp-content/uploads/2013/08/windjs.png| 200| 34613 | "http://cnodejs.org/topic/521a30d4bee8d3cb1272ac0f" | "Mozilla/5.0 (Macintosh;|

1.3 实现步骤

1、对原始数据进行抽取转换

--将来访url分离出host path query query id

drop table if exists t_etl_referurl;
create table t_etl_referurl as
SELECT a.*,b.*
FROM t_orgin_weblog a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id

2、从前述步骤进一步分离出日期时间形成ETL明细表“t_etl_detail” day tm

drop table if exists t_etl_detail;
create table t_etl_detail as
select b.*,substring(time_local,0,11) as daystr,
substring(time_local,13) as tmstr,
substring(time_local,4,3) as month,
substring(time_local,0,2) as day,
substring(time_local,13,2) as hour
from t_etl_referurl b;

3、对etl数据进行分区(包含所有数据的结构化信息)

drop table t_etl_detail_prt;
create table t_etl_detail_prt(
valid                   string,
remote_addr            string,
remote_user            string,
time_local               string,
request                 string,
status                  string,
body_bytes_sent         string,
http_referer             string,
http_user_agent         string,
host                   string,
path                   string,
query                  string,
query_id               string,
daystr                 string,
tmstr                  string,
month                  string,
day                    string,
hour                   string)
partitioned by (mm string,dd string);

导入数据

insert into table t_etl_detail_prt partition(mm='Sep',dd='18')
select * from t_etl_detail where daystr='18/Sep/2013';
 
insert into table t_etl_detail_prt partition(mm='Sep',dd='19')
select * from t_etl_detail where daystr='19/Sep/2013';

分个时间维度统计各referer_host的访问次数并排序

create table t_refer_host_visit_top_tmp as
select referer_host,count(*) as counts,mm,dd,hh from t_display_referer_counts group by hh,dd,mm,referer_host order by hh asc,dd asc,mm asc,counts desc;

4、来源访问次数topn各时间维度URL

取各时间维度的referer_host访问次数topn

select * from (select referer_host,counts,concat(hh,dd),row_number() 
over (partition by concat(hh,dd) 
order by concat(hh,dd) asc) as od from t_refer_host_visit_top_tmp) t where od<=3;

实战案例2——访问时长统计

2.1 需求

从web日志中统计每日访客平均停留时间

2.2 实现步骤

1、由于要从大量请求中分辨出用户的各次访问，逻辑相对复杂，通过hive直接实现有困难，因此编写一个mr程序来求出访客访问信息（详见代码）

启动mr程序获取结果：

[hadoop@hdp-node-01 ~]$ hadoop jar weblog.jar cn.itcast.bigdata.hive.mr.UserStayTime /weblog/input /weblog/stayout

2、将mr的处理结果导入hive表

drop table t_display_access_info_tmp;
create table t_display_access_info_tmp(remote_addr string,firt_req_time string,last_req_time string,stay_long bigint)
row format delimited fields terminated by '\t';
 
load data inpath '/weblog/stayout4' into table t_display_access_info_tmp;

3、得出访客访问信息表"t_display_access_info"

由于有一些访问记录是单条记录，mr程序处理处的结果给的时长是0，所以考虑给单次请求的停留时间一个默认市场30秒

drop table t_display_access_info;
create table t_display_access_info as
select remote_addr,firt_req_time,last_req_time,
case stay_long
when 0 then 30000
else stay_long
end as stay_long
from t_display_access_info_tmp;

4、统计所有用户停留时间平均值

select avg(stay_long) fromt_display_access_info;

实战案例3——级联求和---常见的累加报表

3.1 需求

有如下访客访问次数统计表 t_access_times

访客	月份	访问次数
A	2015-01	5
A	2015-01	15
B	2015-01	5
A	2015-01	8
B	2015-01	25
A	2015-01	5
A	2015-02	4
A	2015-02	6
B	2015-02	10
B	2015-02	5
……	……	……

需要输出报表：t_access_times_accumulate

访客	月份	月访问总计	累计访问总计
A	2015-01	33	33
A	2015-02	10	43
…….	…….	…….	…….
B	2015-01	30	30
B	2015-02	15	45
…….	…….	…….	…….

3.2 实现步骤

可以用一个hql语句即可实现：

思路：月访问：group by 月份再累加访问次数

累计访问：两个表自己连自己做笛卡尔积然后过滤只有用户A-用户A的数据用户B连用户B的数据，形成数据类似

left right

A 1 10 A 1 10

A 1 10 A 2 5

A 1 10 A 3 10

A 2 5 A 1 10

A 2 5 A 2 5

A 2 5 A 3 10

将如上数据按月分组同时根据数据右边的月份<=左边的数据进行累加

sum (right.money) where right.month <= left.month

group by left.month

建表：

create table t_access_times（username string,month string,salary int）
row format delimited fields terminated by ',';

load data local inpath '/home/hadoop/t_access_times.dat' into table t_access_times;

检查数据和思路的SQL语句

select username,month,sum(salary) as salary from t_access_times group by username,month

检查数据和思路的SQL语句

select A.* B.* From

(select username,month,sum(salary) as salary from t_access_times group by username,month) A

inner join

(select username,month,sum(salary) as salary from t_access_times group by username,month) B

A.username=B.username

select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate
from
(select username,month,sum(salary) as salary from t_access_times group by username,month) A
inner join
(select username,month,sum(salary) as salary from t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month
group by A.username,A.month
order by A.username,A.month;

总结：

create table t_access_times(username string,month string,salary int)
row format delimited fields terminated by ',';

load data local inpath '/home/hadoop/t_access_times.dat' into table t_access_times;
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
1、第一步，先求个用户的月总金额
select username,month,sum(salary) as salary from t_access_times group by username,month
+-----------+----------+---------+--+
| username | month | salary |
+-----------+----------+---------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+--+

2、第二步，将月总金额表自己连接自己连接
+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username | a.month | a.salary | b.username | b.month | b.salary |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+-----------+--+
max(a.salary不在分组字段，需要使用聚合函数 sum、max) group by (a.username, a.month 两个分组字段)
3、第三步，从上一步的结果中
进行分组查询，分组的字段是a.username a.month
求月累计值：将b.month <= a.month的所有b.salary求和即可
select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate
from
(select username,month,sum(salary) as salary from t_access_times group by username,month) A
inner join
(select username,month,sum(salary) as salary from t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month//比如右边的必须<=左边的进行求和
group by A.username,A.month
order by A.username,A.month;