Hive学习之路（十二）Hive SQL练习之影评案例

最新推荐文章于 2023-04-05 19:27:06 发布

weixin_34400525

最新推荐文章于 2023-04-05 19:27:06 发布

阅读量1.2k

点赞数

文章标签：大数据数据库 shell

这篇博客详细介绍了如何使用Hive进行影评数据的分析，包括数据清洗、建表、导入数据，并通过一系列复杂的SQL查询解答了多个影评统计问题，如评分最多的电影、性别评分排名、电影平均影评等。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

案例说明

现有如此三份数据：
1、users.dat 数据格式为： 2::M::56::16::70072，

共有6040条数据
对应字段为：UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释：用户id，性别，年龄，职业，邮政编码

2、movies.dat 数据格式为： 2::Jumanji (1995)::Adventure|Children's|Fantasy，

共有3883条数据
对应字段为：MovieID BigInt, Title String, Genres String
对应字段中文解释：电影ID，电影名字，电影类型

3、ratings.dat 数据格式为： 1::1193::5::978300760，

共有1000209条数据
对应字段为：UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释：用户ID，电影ID，评分，评分时间戳

题目要求

　　数据要求：
　　　　（1）写shell脚本清洗数据。（hive不支持解析多字节的分隔符，也就是说hive只能解析':', 不支持解析'::'，所以用普通方式建表来使用是行不通的，要求对数据做一次简单清洗）
　　　　（2）使用Hive能解析的方式进行

　　Hive要求：
　　　　（1）正确建表，导入数据（三张表，三份数据），并验证是否正确

　　　　（2）求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）

　　　　（3）分别求男性，女性当中评分最高的10部电影（性别，电影名，影评分）

　　　　（4）求movieid = 2116这部电影各年龄段（因为年龄就只有7个，就按这个7个分就好了）的平均影评（年龄段，影评分）

　　　　（5）求最喜欢看电影（影评次数最多）的那位女性评最高分的10部电影的平均影评分（观影者，电影名，影评分）

　　　　（6）求好片（评分>=4.0）最多的那个年份的最好看的10部电影

　　　　（7）求1997年上映的电影中，评分最高的10部Comedy类电影

　　　　（8）该影评库中各种类型电影中评价最高的5部电影（类型，电影名，平均影评分）

　　　　（9）各年评分最高的电影类型（年份，类型，影评分）

　　　　（10）每个地区最高评分的电影名，把结果存入HDFS（地区，电影名，影评分）

数据下载

https://files.cnblogs.com/files/qingyunzong/hive%E5%BD%B1%E8%AF%84%E6%A1%88%E4%BE%8B.zip

解析

之前已经使用MapReduce程序将3张表格进行合并，所以只需要将合并之后的表格导入对应的表中进行查询即可。

1、正确建表，导入数据（三张表，三份数据），并验证是否正确

（1）分析需求

需要创建一个数据库movie，在movie数据库中创建3张表，t_user，t_movie，t_rating

t_user:userid bigint,sex string,age int,occupation string,zipcode string
t_movie:movieid bigint,moviename string,movietype string
t_rating:userid bigint,movieid bigint,rate double,times string

原始数据是以::进行切分的，所以需要使用能解析多字节分隔符的Serde即可

使用RegexSerde

需要两个参数：
input.regex = "(.*)::(.*)::(.*)"
output.format.string = "%1$s %2$s %3$s"

（2）创建数据库

drop database if exists movie;
create database if not exists movie;
use movie;

（3）创建t_user表

create table t_user(
userid bigint,
sex string,
age int,
occupation string,
zipcode string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;

（4）创建t_movie表

use movie;
create table t_movie(
movieid bigint,
moviename string,
movietype string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;

（5）创建t_rating表

use movie;
create table t_rating(
userid bigint,
movieid bigint,
rate double,
times string) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;

（6）导入数据

0: jdbc:hive2://hadoop3:10000> load data local inpath "/home/hadoop/movie/users.dat" into table t_user;
No rows affected (0.928 seconds)
0: jdbc:hive2://hadoop3:10000> load data local inpath "/