2021-01-19课堂检测三

最新推荐文章于 2025-05-02 15:19:06 发布

屡傻不改

最新推荐文章于 2025-05-02 15:19:06 发布

阅读量507

点赞数 1

CC 4.0 BY-SA版权

分类专栏：课堂检测

本文链接：https://blog.youkuaiyun.com/qianchun22/article/details/112992011

课堂检测专栏收录该内容

3 篇文章

订阅专栏

一、环境要求

Hadoop+Hive+Spark+HBase 开发环境。

二、提交结果要求

1.必须提交源码或对应分析语句，如不提交则不得分。
2.带有分析结果的功能，请分析结果的截图与代码一同提交。

三、数据描述

UserBehavior 是阿里巴巴提供的一个淘宝用户行为数据集。本数据集包含了 2017-09-11
至 2017-12-03 之间有行为的约 5458 位随机用户的所有行为（行为包括点击、购买、加
购、喜欢）。数据集的每一行表示一条用户行为，由用户 ID、商品 ID、商品类目 ID、
行为类型和时间戳组成，并以逗号分隔。关于数据集中每一列的详细描述如下具体字段
说明如下：
在这里插入图片描述
注意到，用户行为类型共有四种，它们分别是

四、功能要求

1.数据准备（10 分）

1 请在 HDFS 中创建目录/data/userbehavior，并将 UserBehavior.csv 文件传到该目录。（5 分）

[root@hadoop001 software]# hdfs dfs -mkdir -p /app/data/userbehavior

[root@hadoop001 software]# hdfs dfs -put UserBehavior.csv /app/data/userbehavior

在这里插入图片描述

2 通过 HDFS 命令查询出文档有多少行数据。（5 分）

[root@hadoop001 software]# hdfs dfs -cat /app/data/userbehavior/UserBehavior.csv | wc -l

在这里插入图片描述

2.数据清洗（40 分）

请在 Hive 中创建数据库 exam（5 分）

hive> create database exam202010;

在这里插入图片描述

请在 exam 数据库中创建外部表 userbehavior，并将 HDFS 数据映射到表中（5 分）

hive> create external table userbehavior(
    > user_id int,
    > item_id int,
    > category_id int,
    > behavior_type string,
    > time bigint
    > )
    > row format delimited
    > fields terminated by ','
    > stored as textfile
    > location '/app/data/userbehavior';

在这里插入图片描述

请在 HBase 中创建命名空间 exam，并在命名空间 exam 创建 userbehavior 表，包含一个列簇 info（5 分）

hbase(main):047:0> create_namespace 'exam202010'

在这里插入图片描述

hbase(main):049:0> create 'exam202010:userbehavior','info'

在这里插入图片描述

请在 Hive 中创建外部表 userbehavior_hbase，并映射到 HBase 中（5 分），并将数据加载到 HBase 中（5 分）

hive> create external table userbehavior_hbase(
    > user_id int,
    > item_id int,
    > category_id int,
    > behavior_type string,
    > time bigint
    > )
    > stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > with serdeproperties('hbase.columns.mapping'=':key,
    > info:item_id,
    > info:category_id,
    > info:behavior_type,
    > info:time')
    > tblproperties('hbase.table.name'='exam202010:userbehavior');

在这里插入图片描述

hive> insert into userbehavior_hbase select * from userbehavior;

在这里插入图片描述

请在 exam 数据库中创建内部分区表 userbehavior_partitioned（按照日期进行分区），并通过查询 userbehavior 表将时间戳格式化为”年-月-日时:分:秒”格式，将数据插入至 userbehavior_partitioned 表中，例如下图：（15 分）

hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;


hive> create table userbehavior_partitioned(
    > user_id int,
    > item_id int,
    > category_id int,
    > behavior_type string,
    > time string
    > )
    > partitioned by (dt string)
    > stored as orc;

在这里插入图片描述

hive> insert into userbehavior_partitioned partition(dt)
    > select user_id,item_id,category_id,behavior_type,from_unixtime(time,"yyyy-MM-dd HH:mm:ss") as time,from_unixtime(time,"yyyy-MM-dd") as dt
    > from userbehavior;

在这里插入图片描述

3.用户行为分析（20 分）

请使用 Spark，加载 HDFS 文件系统 UserBehavior.csv 文件，并分别使用 RDD 完成以下分析。

scala> val lines = sc.textFile("hdfs://hadoop001:9000/app/data/userbehavior/UserBehavior.csv")

1）统计 uv 值（一共有多少用户访问淘宝）（10 分）

scala> lines.map(_.split(",")).map(x => (x(0),1)).groupBy(_._1).count

在这里插入图片描述
2）分别统计浏览行为为点击，收藏，加入购物车，购买的总数量（10 分）

scala> lines.map(_.split(",")).map(x => (x(3),1)).reduceByKey(_+_).collect.foreach(println)

在这里插入图片描述

4.找出有价值的用户（30 分）

1）使用 SparkSQL 统计用户最近购买时间。以 2017-12-03 为当前日期，计算时间范围为一个月，计算用户最近购买时间，时间的区间为 0-30 天，将其分为 5 档，0-6 天,7-12 4 天,13-18 天,19-24 天,25-30 天分别对应评分 4 到 0（15 分）

scala> val sqlString="""
     |       with temptb as (select user_id,DATEDIFF('2017-12-03',MAX(time)) as ltime from exam202010.userbehavior_partitioned where dt between '2017-11-03' and '2017-12-03' and behavior_type='buy' group by user_id)
     |       select user_id,
     |       (case when ltime between 0 and 6 then 4
     |       when ltime between 7 and 12 then 3
     |       when ltime between 13 and 18 then 2
     |       when ltime between 19 and 24 then 1
     |       when ltime between 25 and 30 then 0
     |       else null end) level
     |       from temptb
     |       """

在这里插入图片描述
2）使用 SparkSQL 统计用户的消费频率。以 2017-12-03 为当前日期，计算时间范围为一个月，计算用户的消费次数，用户中消费次数从低到高为 1-161 次，将其分为 5 档，1-32，33-64，65-96，97-128，129-161 分别对应评分 0 到 4（15 分）

scala> val sqlString="""
     |       with temptb as (select user_id,count(item_id) as buycount from exam202010.userbehavior_partitioned where dt between '2017-11-03' and '2017-12-03' group by user_id)
     |       select user_id,
     |       (case when buycount between 1 and 32 then 0
     |       when buycount between 33 and 64 then 1
     |       when buycount between 65 and 96 then 2
     |       when buycount between 97 and 128 then 3
     |       when buycount between 129 and 161 then 4
     |       else null end) level
     |       from temptb
     |       """

在这里插入图片描述