Hive知识:复杂数据类型、行转列、列转行

本文介绍了Hive中的复杂数据类型,包括ARRAY、MAP和STRUCT的使用方法,以及在查询过程中如何处理这些类型。同时,文章详细阐述了如何进行行转列和列转行的操作,提供数据样例和相关操作示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.1 复杂数据类型

对于复杂数据类型,需要掌握的是如何存(建表时定义,数据导入)和如何取(掌握常用的查询方法)
ARRAY 泛型,数据类型要一样
数据内容如下:

[ruoze@hadoop001 data]$ cat hive_array.txt 
pk      beijing,shanghai,tianjin,hangzhou
jepson  changchu,chengdu,wuhan,beijing
[ruoze@hadoop001 data]$ 

简单使用:

0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> create table hive_array(name string, work_locations array<string>)
. . . . . . . . . . . . . . . . . . . . . . > row format delimited
. . . . . . . . . . . . . . . . . . . . . . > fields terminated by '\t'
. . . . . . . . . . . . . . . . . . . . . . > collection items terminated by ',';
No rows affected (0.13 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> load data local inpath '/home/ruoze/data/hive_array.txt' overwrite into table hive_array; 
No rows affected (0.357 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select * from hive_array limit 3;
+------------------+----------------------------------------------+--+
| hive_array.name  |          hive_array.work_locations           |
+------------------+----------------------------------------------+--+
| pk               | ["beijing","shanghai","tianjin","hangzhou"]  |
| jepson           | ["changchu","chengdu","wuhan","beijing"]     |
+------------------+----------------------------------------------+--+
2 rows selected (0.066 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
2 rows selected (0.066 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select name ,work_locations[0] 
. . . . . . . . . . . . . . . . . . . . . . > from hive_array limit 3;
+---------+-----------+--+
|  name   |    _c1    |
+---------+-----------+--+
| pk      | beijing   |
| jepson  | changchu  |
+---------+-----------+--+
2 rows selected (0.078 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select name ,size(work_locations) 
. . . . . . . . . . . . . . . . . . . . . . > from hive_array limit 3;
+---------+------+--+
|  name   | _c1  |
+---------+------+--+
| pk      | 4    |
| jepson  | 4    |
+---------+------+--+
2 rows selected (0.077 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select * from hive_array where array_contains(work_locations,'beijing') limit 5;
+------------------+----------------------------------------------+--+
| hive_array.name  |          hive_array.work_locations           |
+------------------+----------------------------------------------+--+
| pk               | ["beijing","shanghai","tianjin","hangzhou"]  |
| jepson           | ["changchu","chengdu","wuhan","beijing"]     |
+------------------+----------------------------------------------+--+
2 rows selected (0.236 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb>

MAP<primitive_type,data_type> k-v类型
数据内容如下:

[ruoze@hadoop001 data]$ cat hive_map.txt 
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26
[ruoze@hadoop001 data]$ 

简单使用:

0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> create table hive_map(
. . . . . . . . . . . . . . . . . . . . . . > id int,
. . . . . . . . . . . . . . . . . . . . . . > name string,
. . . . . . . . . . . . . . . . . . . . . . > members map<string,string>,
. . . . . . . . . . . . . . . . . . . . . . > age int
. . . . . . . . . . . . . . . . . . . . . . > )
. . . . . . . . . . . . . . . . . . . . . . > row format delimited 
. . . . . . . . . . . . . . . . . . . . . . > fields terminated by ','
. . . . . . . . . . . . . . . . . . . . . . > collection items terminated by '#'
. . . . . . . . . . . . . . . . . . . . . . > map keys terminated by ':';
No rows affected (0.083 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> load data local inpath '/home/ruoze/data/hive_map.txt' overwrite into table hive_map;
No rows affected (0.222 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select * from hive_map limit 3;
+--------------+----------------+----------------------------------------------------+---------------+--+
| hive_map.id  | hive_map.name  |                  hive_map.members                  | hive_map.age  |
+--------------+----------------+----------------------------------------------------+---------------+--+
| 1            | zhangsan       | {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"} | 28            |
| 2            | lisi           | {"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22            |
| 3            | wangwu         | {"father":"wangjianlin","mother":"ruhua","sister":"jingtian"} | 29            |
+--------------+----------------+----------------------------------------------------+---------------+--+
3 rows selected (0.073 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select id,name,members['father'] as father,members['mother'] as mother,age 
. . . . . . . . . . . . . . . . . . . . . . > from hive_map;
+-----+-----------+--------------+-------------+------+--+
| id  |   name    |    father    |   mother    | age  |
+-----+-----------+--------------+-------------+------+--+
| 1   | zhangsan  | xiaoming     | xiaohuang   | 28   |
| 2   | lisi      | mayun        | huangyi     | 22   |
| 3   | wangwu    | wangjianlin  | ruhua       | 29   |
| 4   | mayun     | mayongzhen   | angelababy  | 26   |
+-----+-----------+--------------+-------------+------+--+
4 rows selected (0.079 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select id,name,map_keys(members) as relation, age 
. . . . . . . . . . . . . . . . . . . . . . > from hive_map;
+-----+-----------+--------------------------------+------+--+
| id  |   name    |            relation            | age  |
+-----+-----------+--------------------------------+------+--+
| 1   | zhangsan  | ["father","mother","brother"]  | 28   |
| 2   | lisi      | ["father","mother","brother"]  | 22   |
| 3   | wangwu    | ["father","mother","sister"]   | 29   |
| 4   | mayun     | ["father","mother"]            | 26   |
+-----+-----------+--------------------------------+------+--+
4 rows selected (0.061 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select id,name,map_values(members) as names, age 
. . . . . . . . . . . . . . . . . . . . . . > from hive_map;
+-----+-----------+-------------------------------------+------+--+
| id  |   name    |                names                | age  |
+-----+-----------+-------------------------------------+------+--+
| 1   | zhangsan  | ["xiaoming","xiaohuang","xiaoxu"]   | 28   |
| 2   | lisi      | ["mayun","huangyi","guanyu"]        | 22   |
| 3   | wangwu    | ["wangjianlin","ruhua","jingtian"]  | 29   |
| 4   | mayun     | ["mayongzhen","angelababy"]         | 26   |
+-----+-----------+-------------------------------------+------+--+
4 rows selected (0.056 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select id,name,size(members) as msize 
. . . . . . . . . . . . . . . . . . . . . . > from hive_map;
+-----+-----------+--------+--+
| id  |   name    | msize  |
+-----+-----------+--------+--+
| 1   | zhangsan  | 3      |
| 2   | lisi      | 3      |
| 3   | wangwu    | 3      |
| 4   | mayun     | 2      |
+-----+-----------+--------+--+
4 rows selected (0.068 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb>  # 查询出有兄弟的人以及他的兄弟的名字    方式1
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select id,name,members['brother'] as brother  
. . . . . . . . . . . . . . . . . . . . . . > from hive_map  
. . . . . . . . . . . . . . . . . . . . . . > where array_contains(map_keys(members),'brother') ;
+-----+-----------+---------+--+
| id  |   name    | brother |
+-----+-----------+---------+--+
| 1   | zhangsan  | xiaoxu  |
| 2   | lisi      | guanyu  |
+-----+-----------+---------+--+
2 rows selected (0.07 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb>     
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb>  # 查询出有兄弟的人以及他的兄弟的名字(直接查)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select id,name,members['brother'] as brother  
. . . . . . . . . . . . . . . . . . . . . . > from hive_map  
. . . . . . . . . . . . . . . . . . . . . . > where members['brother'] is not null;
+-----+-----------+----------+--+
| id  |   name    | brother  |
+-----+-----------+----------+--+
| 1   | zhangsan  | xiaoxu   |
| 2   | lisi      | guanyu   |
+-----+-----------+----------+--+
2 rows selected (0.056 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> # 查询出有兄弟的人以及他的兄弟的名字(使用子查询实现)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select id,name,members['brother'] as brother
. . . . . . . . . . . . . . . . . . . . . . > from (
. . . . . . . . . . . . . . . . . . . . . . > select id,name,members from hive_map where members['brother'] is not null
. . . . . . . . . . . . . . . . . . . . . . > ) t;
+-----+-----------+----------+--+
| id  |   name    | brother  |
+-----+-----------+----------+--+
| 1   | zhangsan  | xiaoxu   |
| 2   | lisi      | guanyu   |
+-----+-----------+----------+--+
2 rows selected (0.049 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 

STRUCT<col_name : data_type> 内部的数据类型可以不同
数据内容如下:

[ruoze@hadoop001 data]$ cat hive_struct.txt 
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70
[ruoze@hadoop001 data]$ 

简单使用:

0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> create table hive_struct(
. . . . . . . . . . . . . . . . . . . . . . > ip string,
. . . . . . . . . . . . . . . . . . . . . . > info struct<name:string,age:int>
. . . . . . . . . . . . . . . . . . . . . . > )
. . . . . . . . . . . . . . . . . . . . . . > row format delimited
. . . . . . . . . . . . . . . . . . . . . . > fields terminated by '#'
. . . . . . . . . . . . . . . . . . . . . . > collection items terminated by ':';
No rows affected (0.937 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> load data local inpath '/home/ruoze/data/hive_struct.txt' overwrite into table hive_struct;
No rows affected (0.778 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select * from hive_struct;
+-----------------+-------------------------------+--+
| hive_struct.ip  |       hive_struct.info        |
+-----------------+-------------------------------+--+
| 192.168.1.1     | {"name":"zhangsan","age":40}  |
| 192.168.1.2     | {"name":"lisi","age":50}      |
| 192.168.1.3     | {"name":"wangwu","age":60}    |
| 192.168.1.4     | {"name":"zhaoliu","age":70}   |
+-----------------+-------------------------------+--+
4 rows selected (0.246 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select ip, info.name, info.age 
. . . . . . . . . . . . . . . . . . . . . . > from hive_struct;
+--------------+-----------+------+--+
|      ip      |   name    | age  |
+--------------+-----------+------+--+
| 192.168.1.1  | zhangsan  | 40   |
| 192.168.1.2  | lisi      | 50   |
| 192.168.1.3  | wangwu    | 60   |
| 192.168.1.4  | zhaoliu   | 70   |
+--------------+-----------+------+--+
4 rows selected (0.13 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 

1.2 查询过程中生成特殊类型

数据样例:

[ruoze@hadoop001 data]$ cat ad_list.txt 
ad_101  http://www.google.com   catalog8|catalog1
ad_102  http://www.sohu.com     catalog6|catalog3
ad_103  http://www.baidu.com    catalog7
ad_104  http://www.qq.com       catalog5|catalog1|catalog4|catalog9
ad_105  http://sina.com
[ruoze@hadoop001 data]$
[ruoze@hadoop001 data]$ cat click_log.txt 
11      ad_101  2014-05-01 06:01:12.334+01
22      ad_102  2014-05-01 07:28:12.342+01
33      ad_103  2014-05-01 07:50:12.33+01
11      ad_104  2014-05-01 09:27:12.33+01
22      ad_103  2014-05-01 09:03:12.324+01
33      ad_102  2014-05-02 19:10:12.343+01
11      ad_101  2014-05-02 09:07:12.344+01
35      ad_105  2014-05-03 11:07:12.339+01
22      ad_104  2014-05-03 12:59:12.743+01
77      ad_103  2014-05-03 18:04:12.355+01
99      ad_102  2014-05-04 00:36:39.713+01
33      ad_101  2014-05-04 19:10:12.343+01
11      ad_101  2014-05-05 09:07:12.344+01
35      ad_102  2014-05-05 11:07:12.339+01
22      ad_103  2014-05-05 12:59:12.743+01
77      ad_104  2014-05-05 18:04:12.355+01
99      ad_105  2014-05-05 20:36:39.713+01
[ruoze@hadoop001 data]$ 

建表,导入数据:

0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> create table ad_list(
. . . . . . . . . . . . . . . . . . . . . . > ad_id string,
. . . . . . . . . . . . . . . . . . . . . . > url string,
. . . . . . . . . . . . . . . . . . . . . . > catalogs string
. . . . . . . . . . . . . . . . . . . . . . > )
. . . . . . . . . . . . . . . . . . . . . . > row format delimited 
. . . . . . . . . . . . . . . . . . . . . . > fields terminated by '\t';
No rows affected (0.169 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> load data local inpath '/home/ruoze/data/ad_list.txt' overwrite into table ad_list;
No rows affected (0.135 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select * from ad_list;
+----------------+------------------------+--------------------------------------+--+
| ad_list.ad_id  |      ad_list.url       |           ad_list.catalogs           |
+----------------+------------------------+--------------------------------------+--+
| ad_101         | http://www.google.com  | catalog8|catalog1                    |
| ad_102         | http://www.sohu.com    | catalog6|catalog3                    |
| ad_103         | http://www.baidu.com   | catalog7                             |
| ad_104         | http://www.qq.com      | catalog5|catalog1|catalog4|catalog9  |
| ad_105         | http://sina.com        | NULL                                 |
+----------------+------------------------+--------------------------------------+--+
5 rows selected (0.18 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> create table click_log (
. . . . . . . . . . . . . . . . . . . . . . > cookie_id string,
. . . . . . . . . . . . . . . . . . . . . . > ad_id string,
. . . . . . . . . . . . . . . . . . . . . . > time string
. . . . . . . . . . . . . . . . . . . . . . > )
. . . . . . . . . . . . . . . . . . . . . . > row format delimited 
. . . . . . . . . . . . . . . . . . . . . . > fields terminated by '\t';
No rows affected (0.08 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> load data local inpath '/home/ruoze/data/click_log.txt' overwrite into table click_log;
No rows affected (0.16 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select * from click_log limit 5;
+----------------------+------------------+-----------------------------+--+
| click_log.cookie_id  | click_log.ad_id  |       click_log.time        |
+----------------------+------------------+-----------------------------+--+
| 11                   | ad_101           | 2014-05-01 06:01:12.334+01  |
| 22                   | ad_102           | 2014-05-01 07:28:12.342+01  |
| 33                   | ad_103           | 2014-05-01 07:50:12.33+01   |
| 11                   | ad_104           | 2014-05-01 09:27:12.33+01   |
| 22                   | ad_103           | 2014-05-01 09:03:12.324+01  |
+----------------------+------------------+-----------------------------+--+
5 rows selected (0.064 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 

Examples

0: jdbc:hive2://hadoop001:10086/ruoze_hivedb>  # 每个人访问每个广告的次数以及对应广告的类目
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select click.cookie_id,click.ad_id,click.amount,ad_list.catalogs
. . . . . . . . . . . . . . . . . . . . . . > from (
. . . . . . . . . . . . . . . . . . . . . . > select cookie_id, ad_id, count(1) amount from click_log group by cookie_id, ad_id
. . . . . . . . . . . . . . . . . . . . . . > ) click join ad_list
. . . . . . . . . . . . . . . . . . . . . . > on ad_list.ad_id = click.ad_id;
2019-12-19 05:07:03     Dump the side-table for tag: 1 with group count: 5 into file: file:/tmp/ruoze/5353762b-cd3b-415c-9fc7-58e47c9350e2/hive_2019-12-19_17-06-42_607_4705675263615256455-2/-local-10004/HashTable-Stage-4/MapJoin-mapfile01--.hashtable
2019-12-19 05:07:03     Uploaded 1 File to: file:/tmp/ruoze/5353762b-cd3b-415c-9fc7-58e47c9350e2/hive_2019-12-19_17-06-42_607_4705675263615256455-2/-local-10004/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (466 bytes)
2019-12-19 05:07:03     End of local task; Time Taken: 0.899 sec.
+------------------+--------------+---------------+--------------------------------------+--+
| click.cookie_id  | click.ad_id  | click.amount  |           ad_list.catalogs           |
+------------------+--------------+---------------+--------------------------------------+--+
| 11               | ad_101       | 3             | catalog8|catalog1                    |
| 11               | ad_104       | 1             | catalog5|catalog1|catalog4|catalog9  |
| 22               | ad_102       | 1             | catalog6|catalog3                    |
| 22               | ad_103       | 2             | catalog7                             |
| 22               | ad_104       | 1             | catalog5|catalog1|catalog4|catalog9  |
| 33               | ad_101       | 1             | catalog8|catalog1                    |
| 33               | ad_102       | 1             | catalog6|catalog3                    |
| 33               | ad_103       | 1             | catalog7                             |
| 35               | ad_102       | 1             | catalog6|catalog3                    |
| 35               | ad_105       | 1             | NULL                                 |
| 77               | ad_103       | 1             | catalog7                             |
| 77               | ad_104       | 1             | catalog5|catalog1|catalog4|catalog9  |
| 99               | ad_102       | 1             | catalog6|catalog3                    |
| 99               | ad_105       | 1             | NULL                                 |
+------------------+--------------+---------------+--------------------------------------+--+
14 rows selected (33.439 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 

1.3 行转列&列转行

行转列

0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select * from click_log ;
+----------------------+------------------+-----------------------------+--+
| click_log.cookie_id  | click_log.ad_id  |       click_log.time        |
+----------------------+------------------+-----------------------------+--+
| 11                   | ad_101           | 2014-05-01 06:01:12.334+01  |
| 22                   | ad_102           | 2014-05-01 07:28:12.342+01  |
| 33                   | ad_103           | 2014-05-01 07:50:12.33+01   |
| 11                   | ad_104           | 2014-05-01 09:27:12.33+01   |
| 22                   | ad_103           | 2014-05-01 09:03:12.324+01  |
| 33                   | ad_102           | 2014-05-02 19:10:12.343+01  |
| 11                   | ad_101           | 2014-05-02 09:07:12.344+01  |
| 35                   | ad_105           | 2014-05-03 11:07:12.339+01  |
| 22                   | ad_104           | 2014-05-03 12:59:12.743+01  |
| 77                   | ad_103           | 2014-05-03 18:04:12.355+01  |
| 99                   | ad_102           | 2014-05-04 00:36:39.713+01  |
| 33                   | ad_101           | 2014-05-04 19:10:12.343+01  |
| 11                   | ad_101           | 2014-05-05 09:07:12.344+01  |
| 35                   | ad_102           | 2014-05-05 11:07:12.339+01  |
| 22                   | ad_103           | 2014-05-05 12:59:12.743+01  |
| 77                   | ad_104           | 2014-05-05 18:04:12.355+01  |
| 99                   | ad_105           | 2014-05-05 20:36:39.713+01  |
+----------------------+------------------+-----------------------------+--+
17 rows selected (0.054 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb>
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb>    # 每个人访问的所有ad_id 去重
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select  cookie_id,collect_set(ad_id)  
. . . . . . . . . . . . . . . . . . . . . . > from click_log group by cookie_id;
+------------+-------------------------------+--+
| cookie_id  |              _c1              |
+------------+-------------------------------+--+
| 11         | ["ad_101","ad_104"]           |
| 22         | ["ad_102","ad_103","ad_104"]  |
| 33         | ["ad_103","ad_102","ad_101"]  |
| 35         | ["ad_105","ad_102"]           |
| 77         | ["ad_103","ad_104"]           |
| 99         | ["ad_102","ad_105"]           |
+------------+-------------------------------+--+
6 rows selected (19.221 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb>   # 每个人访问的所有ad_id  不去重
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb>  select  cookie_id,collect_list(ad_id)  
. . . . . . . . . . . . . . . . . . . . . . >  from click_log group by cookie_id;
+------------+----------------------------------------+--+
| cookie_id  |                  _c1                   |
+------------+----------------------------------------+--+
| 11         | ["ad_101","ad_104","ad_101","ad_101"]  |
| 22         | ["ad_102","ad_103","ad_104","ad_103"]  |
| 33         | ["ad_103","ad_102","ad_101"]           |
| 35         | ["ad_105","ad_102"]                    |
| 77         | ["ad_103","ad_104"]                    |
| 99         | ["ad_102","ad_105"]                    |
+------------+----------------------------------------+--+
6 rows selected (17.263 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select  cookie_id,concat_ws(',',collect_list(ad_id))  as ad_str
. . . . . . . . . . . . . . . . . . . . . . > from click_log group by cookie_id;
+------------+------------------------------+--+
| cookie_id  |            ad_str            |
+------------+------------------------------+--+
| 11         | ad_101,ad_104,ad_101,ad_101  |
| 22         | ad_102,ad_103,ad_104,ad_103  |
| 33         | ad_103,ad_102,ad_101         |
| 35         | ad_105,ad_102                |
| 77         | ad_103,ad_104                |
| 99         | ad_102,ad_105                |
+------------+------------------------------+--+
6 rows selected (19.675 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 

列转行

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView

0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> select ad_id, catalog
. . . . . . . . . . . . . . . . . . . . . . > from ad_list
. . . . . . . . . . . . . . . . . . . . . . > lateral view outer explode(split(catalogs,'\\|')) t as catalog;
+---------+-----------+--+
|  ad_id  |  catalog  |
+---------+-----------+--+
| ad_101  | catalog8  |
| ad_101  | catalog1  |
| ad_102  | catalog6  |
| ad_102  | catalog3  |
| ad_103  | catalog7  |
| ad_104  | catalog5  |
| ad_104  | catalog1  |
| ad_104  | catalog4  |
| ad_104  | catalog9  |
| ad_105  | NULL      |
+---------+-----------+--+
10 rows selected (0.063 seconds)
0: jdbc:hive2://hadoop001:10086/ruoze_hivedb> 
<think>嗯,用户问的是怎么准备数据分析面试。我得先理清楚这个问题需要涵盖哪些方面。首先,数据分析面试通常包括技术问题、业务场景题、统计知识还有行为面试部分。可能用户是刚开始找工作,或者想转行做数据分析,需要全面的指导。 技术方面,SQL和Python/R是必须的。SQL可能会考窗口函数、聚合函数,还有优化问题。Python的话,pandas、numpy这些库的应用,数据清洗、处理,可能还有算法题。用户可能需要具体的例子或者学习资源推荐。 然后是统计知识,假设检验、概率题、A/B测试设计这些常见考点。比如p值、置信区间、t检验、z检验的区别和应用场景。这部分需要简明扼要的解释,避免太复杂的公式,但又要准确。 业务场景题可能会问如何分析某个指标下降的原因,或者设计一个指标评估活动效果。这里需要结构化思维,比如拆解指标、提出假设、验证方法等。用户可能需要例子来说明如何回答这类问题。 项目经验部分,用户需要准备好简历上的项目,用STAR法则描述,突出自己的贡献和解决的问题。面试官可能会深挖细节,所以必须对项目的每个环节都熟悉。 行为面试问题,比如团队合作、遇到的挑战,这些需要提前准备故事,展示软技能。可能用户需要知道如何具体回答,比如用什么样的结构,比如STAR模式。 另外,可能还要提到公司调研,了解他们的业务和数据驱动的文化,这样面试时可以更有针对性。还有模拟面试的重要性,找朋友练习或者录下来自己复盘。 需要确保回答结构清晰,分点出,每个部分有子点,可能用加粗或者编号。但用户要求用中文,而且不能使用markdown,所以得用自然的分段和标题。比如分为技术准备、统计知识、业务思维、项目复盘、行为面试、公司调研、模拟面试等部分。 还要注意用户提到的系统级指令,数学公式用$...$,但用户的问题可能不涉及太多公式,不过如果有统计部分可能需要,比如假设检验的公式。比如p值的计算,或者置信区间的公式,这时候要用正确的LaTeX格式,并且行内公式用$...$,独立公式用$$...$$。但根据用户的问题,可能这部分内容不多,但需要留意。 用户可能需要资源推荐,比如LeetCode、牛客网刷SQL题,《统计学习方法》之类的书籍,或者网课推荐。但不确定用户是否需要,所以可能放在最后作为补充,或者简要提及。 最后要确保回答真实可靠,不给出错误的信息,比如SQL的正确用法,统计概念的正确解释。如果有不确定的地方,需要避免或者注明。比如A/B测试的注意事项,样本量计算的方法等,要准确。 现在整合这些思考,组织成一个结构化的回答,分步骤说明,每个部分详细但不过于冗长。可能需要用序号或者分点,但用户要求不使用markdown,所以用自然的中文分段落,加上标题式的引导词,比如“一、技术基础准备”,“二、统计知识巩固”等,但可能更口语化一些。 检查是否有遗漏的部分,比如简历准备、沟通技巧,或者面试后的跟进。但用户主要问的是准备面试,可能重点在面试前的准备,所以可以集中在技术、统计、业务、项目、行为、公司调研、模拟这几个方面。 总结下来,回答的结构大概是: 1. 技术基础(SQL、Python/R、数据可视化) 2. 统计知识(假设检验、概率题、A/B测试) 3. 业务思维(指标分析、场景题、案例分析) 4. 项目复盘(项目细节、挑战、成果) 5. 行为面试(团队合作、解决问题) 6. 公司调研(业务模式、数据应用) 7. 模拟面试(练习、反馈) 每个部分给出具体的学习方法和例子,比如SQL的常见题目类型,如何练习,业务题的回答框架(拆解指标、提出假设、数据验证等)。 还要提醒用户注意沟通表达,清晰有条理,即使不确定也要展示思考过程。最后鼓励用户多练习,保持自信。 现在开始组织语言,确保每个部分简洁明了,符合用户的要求,使用中文,不使用markdown,行内公式正确,如果有的话。比如在统计部分提到假设检验时,可能需要公式如$H_0$和$H_1$,或者p值的定义,这时候用$符号包裹。但根据用户的问题,可能公式不多,但需要留意。</think>准备数据分析面试需要系统性地覆盖技术能力、业务思维、统计知识、项目经验及行为面试。以下是分步骤的详细指南: --- ### **一、技术基础准备** 1. **SQL** - **高频考点**:聚合函数(`SUM`、`COUNT`)、窗口函数(`ROW_NUMBER`、`RANK`)、多表连接(`JOIN`)、子查询、查询优化(索引、执行效率)。 - **刷题平台**:LeetCode、牛客网,重点练习“留存率计算”“排行榜问题”等高频题型。 - **注意点**:熟悉`EXPLAIN`分析查询性能,避免全表扫描。 2. **Python/R** - **数据处理**:掌握`pandas`(数据清洗、`groupby`合并)、`numpy`(向量化运算);或R的`dplyr`、`tidyr`。 - **算法题**:刷简单-中等难度题目(如LeetCode),重点练习链表、字符串、二分查找。 - **加分项**:了解常用库(如`scikit-learn`用于基础机器学习)。 3. **数据可视化** - 能解释常见图表(如箱线图、热力图)的应用场景,用`Matplotlib`/`Seaborn`或`Tableau`绘制样例。 --- ### **二、统计知识巩固** 1. **假设检验** - 掌握流程:定义$H_0$和$H_1$→选择检验方法(t检验、卡方检验)→计算p值→结论。 - **常见问题**:p值的含义、第一类错误 vs 第二类错误、如何选择单双尾检验。 2. **概率题** - 重点题型:贝叶斯定理(如疾病检测问题)、排组合(如抽球概率)、期望计算。 3. **A/B测试** - 能回答:如何确定样本量?如何分析结果(显著性、置信区间)?如何应对辛普森悖论? --- ### **三、业务思维与场景题** 1. **指标分析** - **框架**:明确问题→拆解指标(如DAU下降=新用户+老用户)→提出假设→验证方法(如维度下钻)。 - **举例**:如何分析“某日GMV骤降”?可能的假设:服务器故障、活动效果差、竞品动作等。 2. **场景设计** - **常见题**:设计一个指标评估功能上线效果;如何判断用户流失原因? - **回答逻辑**:从业务目标出发→选择核心指标→设计数据采集方案→分析维度(如用户分群)。 --- ### **四、项目复盘与表达** 1. **项目深挖** - 用STAR法则描述项目:背景(Situation)、任务(Task)、行动(Action)、结果(Result)。 - **准备细节**:数据来源是否可靠?是否考虑过偏差?如何验证结果? 2. **挑战与解决** - 举例:“数据缺失时如何处理?”→答案可能包括填充均值、多重插补或业务逻辑推断。 --- ### **五、行为面试准备** 1. **高频问题** - 团队合作冲突如何解决? - 最有成就感的数据分析案例? - 如何向非技术人员解释复杂结论? 2. **回答技巧** - 结合具体案例,突出分析能力与沟通能力(如:“我曾用类比法向产品经理解释置信区间”)。 --- ### **六、公司调研与适配** 1. **了解业务** - 研究公司核心业务(如电商关注复购率、内容平台关注停留时长),准备相关分析思路。 2. **文化匹配** - 提前了解团队技术栈(如Hive还是Spark)和数据驱动程度,面试中表达适配性。 --- ### **七、模拟面试与反馈** 1. **实战演练** - 找同行模拟技术面试,用“白板”手写SQL/Python代码。 - 录制回答行为问题的视频,观察表达流畅度。 2. **复盘调整** - 针对卡壳的问题重新梳理逻辑,确保回答简洁(如用“总分总”结构)。 --- ### **资源推荐** - **书籍**:《SQL必知必会》《统计学七支柱》 - **课程**:Coursera的《Data Science专项课程》 - **面经**:牛客网、一亩三分地论坛搜“数据分析岗” --- **最后提醒**:面试中遇到不确定的问题,可先复述问题确认理解,再逐步拆解展示思考过程。保持自信,体现“数据驱动业务”的思维!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值