hive窗口函数练习题

有作业问题可私信小刘,欢迎探讨

一、第一套练习

需求:
1、求用户明细并统计每天的用户总数
2、计算从第一天到现在的所有 score 大于80分的用户总数
3、计算每个用户到当前日期分数大于80的天数

test_window.txt数据:

20191020,11111,85
20191020,22222,83
20191020,33333,86
20191021,11111,87
20191021,22222,65
20191021,33333,98
20191022,11111,67
20191022,22222,34
20191022,33333,88
20191023,11111,99
20191023,22222,33

建表:

0: jdbc:hive2://hadoop:11240> create table test_window(logday string,userid string,score int)
. . . . . . . . . . . . . . > row format delimited
. . . . . . . . . . . . . . > fields terminated by ',';
0: jdbc:hive2://hadoop:11240> load data local inpath '/home/xiaokang/hivedata/test_window.txt'
. . . . . . . . . . . . . . > into table test_window;
0: jdbc:hive2://hadoop:11240> select * from test_window;
+---------------------+---------------------+--------------------+
| test_window.logday  | test_window.userid  | test_window.score  |
+---------------------+---------------------+--------------------+
| 20191020            | 11111               | 85                 |
| 20191020            | 22222               | 83                 |
| 20191020            | 33333               | 86                 |
| 20191021            | 11111               | 87                 |
| 20191021            | 22222               | 65                 |
| 20191021            | 33333               | 98                 |
| 20191022            | 11111               | 67                 |
| 20191022            | 22222               | 34                 |
| 20191022            | 33333               | 88                 |
| 20191023            | 11111               | 99                 |
| 20191023            | 22222               | 33                 |
+---------------------+---------------------+--------------------+

1、求用户明细并统计每天的用户总数

0: jdbc:hive2://hadoop:11240> select *,count()over(partition by logday)as day_total from test_window;
+---------------------+---------------------+--------------------+------------+
| test_window.logday  | test_window.userid  | test_window.score  | day_total  |
+---------------------+---------------------+--------------------+------------+
| 20191020            | 33333               | 86                 | 3          |
| 20191020            | 22222               | 83                 | 3          |
| 20191020            | 11111               | 85                 | 3          |
| 20191021            | 33333               | 98                 | 3          |
| 20191021            | 22222               | 65                 | 3          |
| 20191021            | 11111               | 87                 | 3          |
| 20191022            | 33333               | 88                 | 3          |
| 20191022            | 22222               | 34                 | 3          |
| 20191022            | 11111               | 67                 | 3          |
| 20191023            | 22222               | 33                 | 2          |
| 20191023            | 11111               | 99                 | 2          |
+---------------------+---------------------+--------------------+------------+

2、计算从第一天到现在的所有 score 大于80分的用户总数

0: jdbc:hive2://hadoop:11240> select *,count()over(order by logday rows between unbounded preceding and current row) as total from test_window where score>80;
+---------------------+---------------------+--------------------+--------+
| test_window.logday  | test_window.userid  | test_window.score  | total  |
+---------------------+---------------------+--------------------+--------+
| 20191020            | 33333               | 86                 | 1      |
| 20191020            | 22222               | 83                 | 2      |
| 20191020            | 11111               | 85                 | 3      |
| 20191021            | 33333               | 98                 | 4      |
| 20191021            | 11111               | 87                 | 5      |
| 20191022            | 33333               | 88                 | 6      |
| 20191023            | 11111               | 99                 | 7      |
+---------------------+---------------------+--------------------+--------+

3、计算每个用户到当前日期分数大于80的天数

0: jdbc:hive2://hadoop:11240> select *,count()over(partition by userid order by logday rows between unbounded preceding and current row) as total
. . . . . . . . . . . . . . > from test_window where score>80 order by logday,userid;
+---------------------+---------------------+--------------------+--------+
| test_window.logday  | test_window.userid  | test_window.score  | total  |
+---------------------+---------------------+--------------------+--------+
| 20191020            | 11111               | 85                 | 1      |
| 20191020            | 22222               | 83                 | 1      |
| 20191020            | 33333               | 86                 | 1      |
| 20191021            | 11111               | 87                 | 2      |
| 20191021            | 33333               | 98                 | 2      |
| 20191022            | 33333               | 88                 | 3      |
| 20191023            | 11111               | 99                 | 3      |
+---------------------+---------------------+--------------------+--------+

二、第二套练习

需求:
1、查询在2017年4月份购买过的顾客及总人数
2、查询顾客的购买明细及月购买总额
3、查询顾客的购买明细及到目前为止每个顾客购买总金额
4、查询顾客上次的购买时间----lag()over()偏移量分析函数的运用

数据:

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

建表:

0: jdbc:hive2://hadoop:11240> create table business(name string,orderdate string,cost int)
. . . . . . . . . . . . . . > row format delimited
. . . . . . . . . . . . . . > fields terminated by ',';
0: jdbc:hive2://hadoop:11240> load data local inpath "/home/xiaokang/hivedata/business.txt"
. . . . . . . . . . . . . . > into table business;
0: jdbc:hive2://hadoop:11240> select * from business;
+----------------+---------------------+----------------+
| business.name  | business.orderdate  | business.cost  |
+----------------+---------------------+----------------+
| jack           | 2017-01-01          | 10             |
| tony           | 2017-01-02          | 15             |
| jack           | 2017-02-03          | 23             |
| tony           | 2017-01-04          | 29             |
| jack           | 2017-01-05          | 46             |
| jack           | 2017-04-06          | 42             |
| tony           | 2017-01-07          | 50             |
| jack           | 2017-01-08          | 55             |
| mart           | 2017-04-08          | 62             |
| mart           | 2017-04-09          | 68             |
| neil           | 2017-05-10          | 12             |
| mart           | 2017-04-11          | 75             |
| neil           | 2017-06-12          | 80             |
| mart           | 2017-04-13          | 94             |
+----------------+---------------------+----------------+

1、查询在2017年4月份购买过的顾客及总人数

在本例中:

  • over()必须跟在聚合函数(本例中count())后面,over()叫做开窗函数。
  • 开窗的意义在于它开了一个窗口,这个窗口叫做数据集。
  • 开窗的作用范围:仅仅是给前面的聚合函数count()使用的
  • 开窗等于开一部分数据集出来
  • over()中为空,表示对整个数据集开窗
0: jdbc:hive2://hadoop:11240> select name,count(*) over()
. . . . . . . . . . . . . . > from business
. . . . . . . . . . . . . . > where substring(orderdate,1,7)='2017-04'
+-------+-----------------+
| name  | count_window_0  |
+-------+-----------------+
| mart  | 5               |
| mart  | 5               |
| mart  | 5               |
| mart  | 5               |
| jack  | 5               |
+-------+-----------------+

2、查询顾客的购买明细及所有顾客的购买总额
所有人的花费求和

0: jdbc:hive2://hadoop:11240> select *,sum(cost)over() 
. . . . . . . . . . . . . . > from business;
+----------------+---------------------+----------------+---------------+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+
| mart           | 2017-04-13          | 94             | 661           |
| neil           | 2017-06-12          | 80             | 661           |
| mart           | 2017-04-11          | 75             | 661           |
| neil           | 2017-05-10          | 12             | 661           |
| mart           | 2017-04-09          | 68             | 661           |
| mart           | 2017-04-08          | 62             | 661           |
| jack           | 2017-01-08          | 55             | 661           |
| tony           | 2017-01-07          | 50             | 661           |
| jack           | 2017-04-06          | 42             | 661           |
| jack           | 2017-01-05          | 46             | 661           |
| tony           | 2017-01-04          | 29             | 661           |
| jack           | 2017-02-03          | 23             | 661           |
| tony           | 2017-01-02          | 15             | 661           |
| jack           | 2017-01-01          | 10             | 661           |
+----------------+---------------------+----------------+---------------+

3、查询顾客的购买明细以及每位顾客的总花费
按人分组求和

0: jdbc:hive2://hadoop:11240> select *,sum(cost) over(distribute by name)
. . . . . . . . . . . . . . > from business;
+----------------+---------------------+----------------+---------------+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+
| jack           | 2017-01-05          | 46             | 176           |
| jack           | 2017-01-08          | 55             | 176           |
| jack           | 2017-01-01          | 10             | 176           |
| jack           | 2017-04-06          | 42             | 176           |
| jack           | 2017-02-03          | 23             | 176           |
| mart           | 2017-04-13          | 94             | 299           |
| mart           | 2017-04-11          | 75             | 299           |
| mart           | 2017-04-09          | 68             | 299           |
| mart           | 2017-04-08          | 62             | 299           |
| neil           | 2017-05-10          | 12             | 92            |
| neil           | 2017-06-12          | 80             | 92            |
| tony           | 2017-01-04          | 29             | 94            |
| tony           | 2017-01-02          | 15             | 94            |
| tony           | 2017-01-07          | 50             | 94            |
+----------------+---------------------+----------------+---------------+

4、查询顾客的购买明细及到目前为止每个顾客购买总金额
按人分组,按时间排序,花费累加

# 方法一:
0: jdbc:hive2://hadoop:11240> select * ,sum(cost) over(distribute by name sort by orderdate)
. . . . . . . . . . . . . . > from business;
+----------------+---------------------+----------------+---------------+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+
| jack           | 2017-01-01          | 10             | 10            |
| jack           | 2017-01-05          | 46             | 56            |
| jack           | 2017-01-08          | 55             | 111           |
| jack           | 2017-02-03          | 23             | 134           |
| jack           | 2017-04-06          | 42             | 176           |
| mart           | 2017-04-08          | 62             | 62            |
| mart           | 2017-04-09          | 68             | 130           |
| mart           | 2017-04-11          | 75             | 205           |
| mart           | 2017-04-13          | 94             | 299           |
| neil           | 2017-05-10          | 12             | 12            |
| neil           | 2017-06-12          | 80             | 92            |
| tony           | 2017-01-02          | 15             | 15            |
| tony           | 2017-01-04          | 29             | 44            |
| tony           | 2017-01-07          | 50             | 94            |
+----------------+---------------------+----------------+---------------+
#方法二:
0: jdbc:hive2://hadoop:11240> select *,
. . . . . . . . . . . . . . > sum(cost)
. . . . . . . . . . . . . . > over(partition by name
. . . . . . . . . . . . . . > order by orderdate rows between unbounded preceding and current row)
. . . . . . . . . . . . . . > as total_amount
. . . . . . . . . . . . . . > from business;
+----------------+---------------------+----------------+---------------+
| business.name  | business.orderdate  | business.cost  | total_amount  |
+----------------+---------------------+----------------+---------------+
| jack           | 2017-01-01          | 10             | 10            |
| jack           | 2017-01-05          | 46             | 56            |
| jack           | 2017-01-08          | 55             | 111           |
| jack           | 2017-02-03          | 23             | 134           |
| jack           | 2017-04-06          | 42             | 176           |
| mart           | 2017-04-08          | 62             | 62            |
| mart           | 2017-04-09          | 68             | 130           |
| mart           | 2017-04-11          | 75             | 205           |
| mart           | 2017-04-13          | 94             | 299           |
| neil           | 2017-05-10          | 12             | 12            |
| neil           | 2017-06-12          | 80             | 92            |
| tony           | 2017-01-02          | 15             | 15            |
| tony           | 2017-01-04          | 29             | 44            |
| tony           | 2017-01-07          | 50             | 94            |
+----------------+---------------------+----------------+---------------+

5、查询顾客上次的购买时间----lag()over()偏移量分析函数的运用

0: jdbc:hive2://hadoop:11240> select *,
#如果上次的购买时间为null,将其处理为1970-01-01
. . . . . . . . . . . . . . > lag(orderdate,1,'1970-01-01') over(partition by name order by orderdate) last_date
. . . . . . . . . . . . . . > from business;
+----------------+---------------------+----------------+-------------+
| business.name  | business.orderdate  | business.cost  |  last_date  |
+----------------+---------------------+----------------+-------------+
| jack           | 2017-01-01          | 10             | 1970-01-01  |
| jack           | 2017-01-05          | 46             | 2017-01-01  |
| jack           | 2017-01-08          | 55             | 2017-01-05  |
| jack           | 2017-02-03          | 23             | 2017-01-08  |
| jack           | 2017-04-06          | 42             | 2017-02-03  |
| mart           | 2017-04-08          | 62             | 1970-01-01  |
| mart           | 2017-04-09          | 68             | 2017-04-08  |
| mart           | 2017-04-11          | 75             | 2017-04-09  |
| mart           | 2017-04-13          | 94             | 2017-04-11  |
| neil           | 2017-05-10          | 12             | 1970-01-01  |
| neil           | 2017-06-12          | 80             | 2017-05-10  |
| tony           | 2017-01-02          | 15             | 1970-01-01  |
| tony           | 2017-01-04          | 29             | 2017-01-02  |
| tony           | 2017-01-07          | 50             | 2017-01-04  |
+----------------+---------------------+----------------+-------------+

6、查询顾客下一次的购买时间

0: jdbc:hive2://hadoop:11240> select *,
. . . . . . . . . . . . . . > lead(orderdate,1,'9999-99-99') over(partition by name order by orderdate) last_date
. . . . . . . . . . . . . . > from business;
+----------------+---------------------+----------------+-------------+
| business.name  | business.orderdate  | business.cost  |  last_date  |
+----------------+---------------------+----------------+-------------+
| jack           | 2017-01-01          | 10             | 2017-01-05  |
| jack           | 2017-01-05          | 46             | 2017-01-08  |
| jack           | 2017-01-08          | 55             | 2017-02-03  |
| jack           | 2017-02-03          | 23             | 2017-04-06  |
| jack           | 2017-04-06          | 42             | 9999-99-99  |
| mart           | 2017-04-08          | 62             | 2017-04-09  |
| mart           | 2017-04-09          | 68             | 2017-04-11  |
| mart           | 2017-04-11          | 75             | 2017-04-13  |
| mart           | 2017-04-13          | 94             | 9999-99-99  |
| neil           | 2017-05-10          | 12             | 2017-06-12  |
| neil           | 2017-06-12          | 80             | 9999-99-99  |
| tony           | 2017-01-02          | 15             | 2017-01-04  |
| tony           | 2017-01-04          | 29             | 2017-01-07  |
| tony           | 2017-01-07          | 50             | 9999-99-99  |
+----------------+---------------------+----------------+-------------+

7、查询顾客购买明细及月购买总额
先按姓名进行分组,再按月份进行分组

hive> select *,
. . > sum(cost)over(partition by name,substr(orderdate,1,7)) as total
. . > from business;

在这里插入图片描述
8、查询日期前20%的订单信息
先按日期排序,并分成5组

hive> select *,
. . > ntile(5)over(order by orderdate) as sortgroup_num
. . > from business;

在这里插入图片描述

再查询出分组号为1的记录

hive> select * from
. . > (select *,
. . > ntile(5)over(order by orderdate) as sortgroup_num
. . > from business) t
. . > where t.sortgroup_num = 1;

在这里插入图片描述

三、第三套练习

需求:
1、每门学科学生成绩排名(是否并列排名、空位排名三种实现)
2、每门学科成绩排名top n的学生

score.txt

name	subject	score
孙悟空	语文	87
孙悟空	数学	95
孙悟空	英语	68
大海	语文	94
大海	数学	56
大海	英语	84
宋宋	语文	64
宋宋	数学	86
宋宋	英语	84
婷婷	语文	65
婷婷	数学	85
婷婷	英语	78

建表:

0: jdbc:hive2://hadoop:11240> create table score(name string,subject string,score int)
. . . . . . . . . . . . . . > row format delimited fields terminated by "\t";
0: jdbc:hive2://hadoop:11240> load data local inpath '/home/xiaokang/hivedata/score.txt' into table score;
0: jdbc:hive2://hadoop:11240> select * from score;
+-------------+----------------+--------------+
| score.name  | score.subject  | score.score  |
+-------------+----------------+--------------+
| 孙悟空         | 语文             | 87           |
| 孙悟空         | 数学             | 95           |
| 孙悟空         | 英语             | 68           |
| 大海          | 语文             | 94           |
| 大海          | 数学             | 56           |
| 大海          | 英语             | 84           |
| 宋宋          | 语文             | 64           |
| 宋宋          | 数学             | 86           |
| 宋宋          | 英语             | 84           |
| 婷婷          | 语文             | 65           |
| 婷婷          | 数学             | 85           |
| 婷婷          | 英语             | 78           |
+-------------+----------------+--------------+

1、每门学科学生成绩排名(是否并列排名、空位排名三种实现)

  • row_number()按照值排序时产生一个自增编号,不会重复(如:1、2、3、4、5、6)
  • rank() 按照值排序时产生一个自增编号,值相等时会重复,会产生空位(如:1、2、3、3、3、6)
  • dense_rank() 按照值排序时产生一个自增编号,值相等时会重复,不会产生空位(如:1、2、3、3、3、4)
0: jdbc:hive2://hadoop:11240> select *,
. . . . . . . . . . . . . . > row_number()over(partition by subject order by score desc) as row_number_method,
. . . . . . . . . . . . . . > rank()over(partition by subject order by score desc) as rank_method,
. . . . . . . . . . . . . . > dense_rank()over(partition by subject order by score desc) as dense_rank_method
. . . . . . . . . . . . . . > from score;
+-------------+----------------+--------------+--------------------+--------------+--------------------+
| score.name  | score.subject  | score.score  | row_number_method  | rank_method  | dense_rank_method  |
+-------------+----------------+--------------+--------------------+--------------+--------------------+
| 孙悟空         | 数学             | 95           | 1                  | 1            | 1                  |
| 宋宋          | 数学             | 86           | 2                  | 2            | 2                  |
| 婷婷          | 数学             | 85           | 3                  | 3            | 3                  |
| 大海          | 数学             | 56           | 4                  | 4            | 4                  |
| 宋宋          | 英语             | 84           | 1                  | 1            | 1                  |
| 大海          | 英语             | 84           | 2                  | 1            | 1                  |
| 婷婷          | 英语             | 78           | 3                  | 3            | 2                  |
| 孙悟空         | 英语             | 68           | 4                  | 4            | 3                  |
| 大海          | 语文             | 94           | 1                  | 1            | 1                  |
| 孙悟空         | 语文             | 87           | 2                  | 2            | 2                  |
| 婷婷          | 语文             | 65           | 3                  | 3            | 3                  |
| 宋宋          | 语文             | 64           | 4                  | 4            | 4                  |
+-------------+----------------+--------------+--------------------+--------------+--------------------+

2、每门学科成绩排名前三的学生

0: jdbc:hive2://hadoop:11240> select * from (
. . . . . . . . . . . . . . > select *,
. . . . . . . . . . . . . . > row_number() over(partition by subject order by score desc) as rmp
. . . . . . . . . . . . . . > from score
. . . . . . . . . . . . . . > ) as t
. . . . . . . . . . . . . . > where t.rmp<=3;
+---------+------------+----------+--------+
| t.name  | t.subject  | t.score  | t.rmp  |
+---------+------------+----------+--------+
| 孙悟空     | 数学         | 95       | 1      |
| 宋宋      | 数学         | 86       | 2      |
| 婷婷      | 数学         | 85       | 3      |
| 宋宋      | 英语         | 84       | 1      |
| 大海      | 英语         | 84       | 2      |
| 婷婷      | 英语         | 78       | 3      |
| 大海      | 语文         | 94       | 1      |
| 孙悟空     | 语文         | 87       | 2      |
| 婷婷      | 语文         | 65       | 3      |
+---------+------------+----------+--------+
1.上传tar包 2.解压 tar -zxvf hive-1.2.1.tar.gz 3.安装mysql数据库 推荐yum 在线安装 4.配置hive (a)配置HIVE_HOME环境变量 vi conf/hive-env.sh 配置其中的$hadoop_home (b)配置元数据库信息 vi hive-site.xml 添加如下内容: javax.jdo.option.ConnectionURL jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore javax.jdo.option.ConnectionUserName root username to use against metastore database javax.jdo.option.ConnectionPassword hadoop password to use against metastore database 5.安装hive和mysq完成后,将mysql的连接jar包拷贝到$HIVE_HOME/lib目录下 如果出现没有权限的问题,在mysql授权(在安装mysql的机器上执行) mysql -uroot -p #(执行下面的语句 *.*:所有库下的所有表 %:任何IP地址或主机都可以连接) GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'root' WITH GRANT OPTION; FLUSH PRIVILEGES; 6. Jline包版本不一致的问题,需要拷贝hive的lib目录中jline.2.12.jar的jar包替换掉hadoop中的 /home/hadoop/app/hadoop-2.6.4/share/hadoop/yarn/lib/jline-0.9.94.jar 启动hive bin/hive ---------------------------------------------------------------------------------------------------- Hive几种使用方式: 1.Hive交互shell bin/hive 2.Hive JDBC服务(参考java jdbc连接mysql) 3.hive启动为一个服务器,来对外提供服务 bin/hiveserver2 nohup bin/hiveserver2 1>/var/log/hiveserver.log 2>/var/log/hiveserver.err & 启动成功后,可以在别的节点上用beeline去连接 bin/beeline -u jdbc:hive2://mini1:10000 -n root 或者 bin/beeline ! connect jdbc:hive2://mini1:10000 4.Hive命令 hive -e ‘sql’ bin/hive -e 'select * from t_test'
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

小刘新鲜事儿

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值