hive列转行（collect_set()）

最新推荐文章于 2025-10-28 20:32:40 发布

转载最新推荐文章于 2025-10-28 20:32:40 发布 · 1.7k 阅读

文章标签：

#hive #去重 #列转行

hive 专栏收录该内容

10 篇文章

订阅专栏

本文介绍如何在Hive中使用UDAF collect_set进行数据去重并转换为数组，通过具体实例展示了如何将同一分组下的列数据进行汇总及拼接。

在Hive的是用中，我们经常会有这种需求：

按照同一个id进行Ｇroup By，然后对另一个字段去重，例如下面得数据：

id pic

1 1.jpg

2 2.jpg

1 1.jpg

此时，是用DISTINCT或者2 col得Group By都是不行得，我们可以用这个UDAF：collect_set(col)，它将对同一个group by 得key进行set去重后，转换为一个array。

再举一个例子，我们可以对pic进行去重，拼接：

1	SELECT id, CONCAT_WS(',', COLLECT_SET(pic)) FROM tbl GROUP BY id

在这里CONCAT_WS是UDF，COLLECT_SET是UDAF，它将group后的pic去重，并转换为了array，方便udf是用。

PS：如果不需要去重，可以使用COLLECT_LIST。

练习题：

hive如何将

a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6

变为：

a       b       1,2,3
c       d       4,5,6

即为在col1分组下col3数据列变行

二、数据

test.txt
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6

三、答案

1.建表

drop table tmp_jiangzl_test;
create table tmp_jiangzl_test
(
col1 string,
col2 string,
col3 string
)
row format delimited fields terminated by '\t'
stored as textfile;

load data local inpath '/home/jiangzl/shell/test.txt' into table tmp_jiangzl_test;

2.处理

select col1,col2,concat_ws(',',collect_set(col3))
from tmp_jiangzl_test

group by col1,col2;