数据分析(三)
承接Spark SQL Dataframe API数据处理(二),不过纯sql
1. 寻找热门电影类型(电影标签、演员、导演)
该数据在movies表,表的格式为下图。难点在于要对字符串进行分割。代码需要好好理解一下。
// 这两个参考链接很对理解带代码重要
// Spark并不支持 RECURSIVE 函数
// https://blog.youkuaiyun.com/pjymyself/article/details/81668157
// https://www.bilibili.com/video/BV1644y137iT/?spm_id_from=333.337.search-card.all.click&vd_source=93c837f4a402490d41d605131a2ceb57
select genres,count(*) as genres_num from (
WITH RECURSIVE m(movie_id, name, genres, str) AS (
SELECT movie_id, name, substr(concat(genres,'/'), 1, instr(concat(genres,'/'), '/')-1), substr(concat(genres,'/'), instr(concat(genres,'/'), '/')+1)
FROM douban.movies
UNION ALL
SELECT movie_id, name,substr(str, 1, instr(str, '/')-1), substr(str, instr(str, '/')+1)
FROM m WHERE instr(str, '/')>0
)
SELECT movie_id,name,genres
FROM m
ORDER BY movie_id) as split_view
GROUP BY genres
ORDER BY genres_num desc;
结果图如下图
2.建立两个单独的表,分别存储电影类型与电影标签
分割字符串的逻辑与第一个问题相同
# 垂直分表的逻辑
# 先创建两个表
-- create table douban.genres(
-- id int primary key auto_increment not null,
-- genre varchar(35) not null
-- );
-- create table douban.tags(
-- id int primary key auto_increment not null,
-- tag varchar(35) not null
-- );
# 连接主表
insert into douban.genres(genre)
select distinct genres from (
WITH RECURSIVE m(genres, str) AS (
SELECT substr(concat(genres,'/'), 1, instr(concat(genres,'/'), '/')-1), substr(concat(genres,'/'), instr(concat(genres,'/'), '/')+1)
FROM douban.movies
UNION ALL
SELECT substr(str, 1, instr(str, '/')-1), substr(str, instr(str, '/')+1)
FROM m WHERE instr(str, '/')>0
)
SELECT genres
FROM m
where genres is not null) as split_view
# tags同理
genres表的结果
3. 过滤不合法电影名字
先把字符串分割出来看看
select movie_id, name as full_name,substr(name, 1, instr(name, '-')-1) as pre_name from douban.movies
接下来,只需要把pre_name中有值的替换到full_name就可,如果full_name中存在‘-’,就用pre_name替换。
select movie_id,if(instr(full_name,'-'),pre_name,full_name) as name
from (select movie_id, name as full_name,substr(name, 1, instr(name, '-')-1) as pre_name from douban.movies) as temp;
4.找出每个国家对应的演员数量
数据在person表中
依然需要进行字符串分割,只保留逗号前面的,也需要考虑没有逗号存在的情况。
先处理birthplace列,如果逗号存在就取第一个逗号前面的,不存在就取整体。
SELECT person_id,name,if(instr(birthplace,','),substr(birthplace,1,instr(birthplace,',')-1),birthplace) as country
FROM douban.person
where birthplace is not null;
处理结果图
然后以country进行分组计数,排序。
SELECT country,count(*) as actor_nums from
(SELECT person_id,name,if(instr(birthplace,','),substr(birthplace,1,instr(birthplace,',')-1),birthplace) as country
FROM douban.person
WHERE birthplace is not null) as person_country
GROUP BY country
ORDER BY actor_nums desc;
5.每个用户对电影的平均打分和打分次数,判断用户的打分爱好
数据在ratings表中
SELECT user_md5,avg(rating),count(*) from douban.ratings
GROUP BY user_md5;
ratings表的数据量较大,这句代码的效率偏低,执行时间28s。
6.每部电影的平均打分,判断电影的整体评价
需要联合ratings表和movies表,来获取id与名字的对应。
SELECT r.movie_id,m.name,avg(r.rating)
FROM douban.ratings r inner join douban.movies m on r.movie_id = m.movie_id
GROUP BY movie_id;
7.找出每个用户最早最晚打分时间,判断用户是否为长/短期用户
SELECT user_md5,max(rating_time) as last_time,min(rating_time) as pre_time
FROM douban.ratings
GROUP BY user_md5;
不知道为什么datadiff函数用不起
**8.找出每个用户最喜欢的电影类型(标签、导演、演员)
- 过滤出用户打分 >=3 的ratings行数据
- 统计每个用户点击的电影类型次数
- 根据点击次数排名,取出最高的一项
步骤1的代码简单
SELECT user_md5,movie_id,rating FROM douban.ratings
WHERE rating >= 3;
步骤二需要联合ratings表和movies表
bug:lost connection
SELECT r.user_md5,substr(m.genres,1,2) as genre,count(*) as genre_num FROM douban.ratings as r
INNER JOIN douban.movies as m
ON r.movie_id = m.movie_id and r.rating >= 3 and m.genres is not null
GROUP BY r.user_md5,genre
ORDER BY genre_num;
**9.找出每部电影的topk评论
该数据在comments数据表中
难点在于取出分类后的前几行。
逻辑是通过窗口函数对分组排序后的每一行标记一个递增id,最后过滤出id < k的行就行。
# 窗口函数
select movie_id ,content,votes
from (
select movie_id,content,votes,row_number() over (partition by movie_id order by votes desc) as group_votes_id
from douban.comments
where movie_id is not null and votes is not null) as temp
where temp.group_votes_id < 5;