Hadoop对于数据的离线处理流程及实现Top N 查询

最新推荐文章于 2025-06-08 21:41:07 发布

爆发的~小宇宙

最新推荐文章于 2025-06-08 21:41:07 发布

阅读量2.1k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： Hadoop Hive hive 文章标签： hive hadoop离线处理 hive实现top n row_number

本文链接：https://blog.youkuaiyun.com/yu0_zhang0/article/details/79204128

Hadoop 同时被 3 个专栏收录

44 篇文章

订阅专栏

hive

15 篇文章

订阅专栏

Hive

10 篇文章

订阅专栏

本文介绍如何利用Hive查询各个区域下最受欢迎的Top3产品访问次数，包括MySQL与Hive表的关联查询、离线数据处理流程及Row_Number()函数的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.需求：查询各个区域下最受欢迎的Top 3的访问次数

2.数据：三张表在MySQL中，一张在hive中

1.MySQL中的表：

city_info：城市信息表



mysql> select * from city_info;
+---------+-----------+--------+
| city_id | city_name | area   |
+---------+-----------+--------+
|       1 | 北京      | 华北   |
|       2 | 上海      | 华东   |
|       3 | 南京      | 华东   |
|       4 | 广州      | 华南   |
|       5 | 三亚      | 华南   |
|       6 | 武汉      | 华中   |
|       7 | 长沙      | 华中   |
|       8 | 西安      | 西北   |
|       9 | 成都      | 西南   |
|      10 | 哈尔滨    | 东北   |
+---------+-----------+--------+

product_info：产品信息表

创建product_info表
create table 'product_info'(
'product_id' int(11) NOT NULL,
'product_name' varchar(255) NOT NULL,
'extend_info' varchar(255) NOT NULL,
PRIMARY KEY ('product_id')
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

mysql> select * from product_info;    
+------------+--------------+----------------------+
| product_id | product_name | extend_info          |
+------------+--------------+----------------------+
|          1 | product1     | {"product_status":1} |
|          2 | product2     | {"product_status":1} |
|          3 | product3     | {"product_status":1} |
|          4 | product4     | {"product_status":1} |
|          5 | product5     | {"product_status":1} |
|          6 | product6     | {"product_status":1} |
..........（信息太多略）

user_info：用户信息表

创建user_info表
create table 'user_info'(
'user_id' int(11) NOT NULL,
'name' varchar(255) NOT NULL,
'age' int(11) NOT NULL,
'city_id' int(11) NOT NULL,
'sex' char(5) NOT NULL,
PRIMARY KEY ('user_id'),
FOREIGN KEY ('city_id') REFERENCES 'city_info' ('city_id')
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

mysql> select * from user_info;       
+---------+---------+-----+---------+-----+
| user_id | name    | age | city_id | sex |
+---------+---------+-----+---------+-----+
|       1 | name1   |  25 |       5 | M   |
|       2 | name2   |   8 |       9 | M   |
|       3 | name3   |   8 |       9 | M   |
|       4 | name4   |  26 |       3 | M   |
|       5 | name5   |   8 |       6 | F   |
|       6 | name6   |  48 |       2 | M   |
|       7 | name7   |  53 |       6 | M   |
|       8 | name8   |   7 |       1 | M   |
......

2.Hive中的表

user_click

hive> CREATE EXTERNAL TABLE user_click (
    > user_id     int,
    > session_id  string,
    > action_time string,
    > city_id     int,
    > product_id  int
    > )
    > PARTITIONED BY (dt string)
    > row format delimited fields terminated by '\t';
OK

3.Hadoop对于离线数据的处理流程分析

这里写图片描述
1. apache Flume 是一个从可以收集例如日志，事件等数据资源，并将这些数量庞大的数据从各项数据资源中集中起来存储的工具/服务，或者数集中机制。flume具有高可用，分布式，配置工具，其设计的原理也是基于将数据流，如日志数据从各种网站服务器上汇集起来存储到HDFS，HBase等集中存储器中；
2. 通过Sqoop框架将RDBMS中的数据抽取到HDFS上；
3. HDFS上的数据可以通过MapReduce/Hive/Spark进行数据的清晰，处理一些不合规则的数据或者丰富我们的字段，然后把清晰的数据存储在HDFS/HBase中；
4. 然后我们可以通过Hive进行数据的离线处理，其中分为两步，第一，建立外部表存储数据；第二，通过HQL对数据进行筛选。
5. 最后可以通过Sqoop框架把数据抽取到RDBMS及Hive上；当然这里可以做一个Web UI的数据平台，进行展示数据和处理数据。

4.本文主要分析Hive对离线数据进行处理，查询出Top N

需求：各个区域下最受欢迎的Top 3的产品访问次数
分析：

查询字段：
area(user_click表和city_info表关联及把user_click作为主表left join city_info，根据city_id 查询出所在地区) 
product_name（user_click表和product_info表关联及把user_click作为主表left join product_inf,oproduct_id---->product_name）
访问次数：根据产品id出现的次数

注意：(连接（join）group by、order by、where的执行顺序)
1. 先连接from后的数据源(若有join，则先执行on后条件，再连接数据源)。
2. 执行where条件
3. 执行group by
4.执行having
5.执行order by
6.输出结果。

说到这里先举一个简单的例子，理解起来会更容易：

Hive分组取Top N:
学生信息表：

name    sub            score
a       chinese         98
a       english         90
d       chinese         88
c       english         82
c       math            98
b       math            89
b       chinese         79
z       english         90
z       math            89  
z       chinese         80
e       math            99
e       english         87
d       english         90

问题一：统计每个学科的前三名
select * from (select *, row_number() over (partition by sub order by score desc) as od from t ) t where od<=3;

问题二：语文成绩是80分的排名是多少
hselect od from (select *, row_number() over (partition by sub order by score desc) as od from t ) t where sub=‘chinese‘ and score=80;

row_number：排序后，顺序下来，相同项按先后顺序排序，1，2，3，4，5 hive (test)> select *, row_number() over (partition by sub order by score) as od from t;
rank:排序后，遇到数据相同项时序号一致，后面并留空一位，比如，1，2，2，4，4，6 hive (test)> select *, rank() over (partition by sub order by score) as od from t;
dense_rank:在遇到数据相同项时,序号一致，不留空位，如 1，2，2，3,3,4,4,5 hive (test)> select *, dense_rank() over (partition by sub order by score desc) from t;
- 理解上面的例子后我们分析这个题目：

1.查询各个区域下产品的访问次数

select c.area,p.product_name,count(u.product_id)
from user_click u left join city_info c on u.city_id=c.city_id 
left join product_info p on u.product_id=p.product_id
AND u.dt='2016-05-05'
group by c.area,p.product_name;

（select c.area,p.product_name,count(u.product_id)
from user_click u left join city_info c on u.city_id=c.city_id 
left join product_info p on u.product_id=p.product_id
where u.dt='2016-05-05'
group by c.area,p.product_name;）

（select c.area,p.product_name,count(u.product_id) 
        from user_click u, city_info c, product_info p 
        where u.dt='2016-05-05'
        and u.city_id = c.city_id
        and u.product_id = p.product_id
        group by c.area,p.product_name;）

        注意：select后面的非聚合列必须出现在group by中(如上面的col1和col2)。

（left join on and 与 left join on where的区别）
1、 on条件是在生成临时表时使用的条件，它不管on中的条件是否为真，都会返回左边表中的记录。
2、where条件是在临时表生成好后，再对临时表进行过滤的条件。这时已经没有left join的含义（必须返回左边表的记录）了，条件不为真的就全部过滤掉。
多表使用left join只是把主表里的所有数据查询出来，其他表只查询表中的符合条件的某一条记录，所以速度非常快；
而多表使用where内联，是把所有表的数据全查出来，然后进行比对，所以速度非常慢。

2 ROW_NUMBER() （Hive的内置函数使用）

select area,name,ct,
    row_number() over(partition by area order by ct desc) as rn
    from (
    select c.area area,p.product_name name,count(u.product_id) ct
    from user_click u left join city_info c on u.city_id=c.city_id 
    left join product_info p on u.product_id=p.product_id
    where u.dt='2016-05-05'
    group by c.area,p.product_name)n1;

3 查询top 3

select *
    from(
    select area,name,ct,
    row_number() over(partition by area order by ct desc) as rn
    from (
    select c.area area,p.product_name name,count(u.product_id) ct
    from user_click u left join city_info c on u.city_id=c.city_id 
    left join product_info p on u.product_id=p.product_id
    where u.dt='2016-05-05'
    group by c.area,p.product_name)n1)n2
    where rn<4;

4 创建top_3表并把查询结果导入

CREATE EXTERNAL TABLE top(
area string,
product_name string,
view_count int,
rank int
)
PARTITIONED BY (dt string)
row format delimited fields terminated by '\t';


insert into top partition(dt='2016-05-05')
select *
from(
select area,name,ct,
row_number() over(partition by area order by ct desc) as rn
from (
select c.area area,p.product_name name,count(u.product_id) ct
from user_click u left join city_info c on u.city_id=c.city_id 
left join product_info p on u.product_id=p.product_id
where u.dt='2016-05-05'
group by c.area,p.product_name)n1)n2
where rn<4;

结果：

华东    product4        40      1
华东    product96       32      2
华东    product5        31      3
华中    product7        39      1
华中    product26       39      2
华中    product70       38      3
华北    product9        16      1
华北    product40       16      2
华北    product5        13      3
华南    product38       35      1
华南    product98       34      2
华南    product88       34      3
西北    product67       20      1
西北    product56       20      2
西北    product48       19      3
西南    product16       20      1
西南    product60       19      2
西南    product95       19      3