Hive 中的行转列和列转行操作
在Hive中,行转列和列转行是常见的数据转换操作,用于满足不同的数据分析需求。下面我将详细介绍这两种操作及其实现方式。
一、行转列(行转多列)
行转列是指将多行数据按照某个键值聚合,转换为一行中的多列。
常用方法:
-
使用
collect_list和collect_set配合case whenSELECT user_id, MAX(CASE WHEN attribute = 'age' THEN value END) AS age, MAX(CASE WHEN attribute = 'gender' THEN value END) AS gender, MAX(CASE WHEN attribute = 'city' THEN value END) AS city FROM user_attributes GROUP BY user_id; -
使用
map函数SELECT user_id, map('age', age, 'gender', gender, 'city', city) AS user_info FROM users; -
使用
str_to_map函数(适用于字符串转换)SELECT user_id, str_to_map(concat_ws(',', concat_ws(':', 'age', age), concat_ws(':', 'gender', gender), concat_ws(':', 'city', city) )) AS user_info FROM users;
二、列转行(多列转行)
列转行是指将一行中的多列数据转换为多行数据。
常用方法:
-
使用
LATERAL VIEW explodeSELECT user_id, attr_info.key AS attribute, attr_info.value AS value FROM users LATERAL VIEW explode(map( 'age', age, 'gender', gender, 'city', city )) attr_info AS key, value; -
使用
UNION ALLSELECT user_id, 'age' AS attribute, age AS value FROM users UNION ALL SELECT user_id, 'gender' AS attribute, gender AS value FROM users UNION ALL SELECT user_id, 'city' AS attribute, city AS value FROM users; -
使用
stack函数SELECT user_id, stacked_data.attribute, stacked_data.value FROM users LATERAL VIEW stack(3, 'age', age, 'gender', gender, 'city', city ) stacked_data AS attribute, value;
三、特殊场景处理
1. JSON格式数据转换
如果数据是JSON格式,可以使用get_json_object或JSON函数:
-- 行转列
SELECT
user_id,
get_json_object(user_json, '$.age') AS age,
get_json_object(user_json, '$.gender') AS gender
FROM users_with_json;
-- 列转行
SELECT
user_id,
attr.key AS attribute,
attr.value AS value
FROM users_with_json
LATERAL VIEW json_tuple(user_json, 'age', 'gender', 'city') jt AS age, gender, city
LATERAL VIEW explode(map(
'age', jt.age,
'gender', jt.gender,
'city', jt.city
)) attr AS key, value;
2. 数组数据转换
-- 数组列转行
SELECT
user_id,
exploded_tag AS tag
FROM users_with_tags
LATERAL VIEW explode(tags) t AS exploded_tag;
-- 多行转数组(行转列)
SELECT
user_id,
collect_list(tag) AS tags
FROM user_tags
GROUP BY user_id;
四、性能考虑
- 行转列通常比列转行性能更好,因为减少了数据量
- 对于大数据集,
collect_list比collect_set更快,但后者去重 LATERAL VIEW explode会产生多行数据,可能导致数据膨胀
五、实际应用示例
假设有一个订单表,需要统计每个用户的各类商品购买数量:
-- 原始表结构:user_id, product_type, quantity
-- 行转列:每个用户一行,各商品类型作为列
SELECT
user_id,
SUM(CASE WHEN product_type = 'book' THEN quantity ELSE 0 END) AS book_qty,
SUM(CASE WHEN product_type = 'food' THEN quantity ELSE 0 END) AS food_qty,
SUM(CASE WHEN product_type = 'clothes' THEN quantity ELSE 0 END) AS clothes_qty
FROM orders
GROUP BY user_id;
-- 如果需要将上面的结果再转回多行形式
SELECT
user_id,
stacked.product_type,
stacked.quantity
FROM (
SELECT
user_id,
SUM(CASE WHEN product_type = 'book' THEN quantity ELSE 0 END) AS book_qty,
SUM(CASE WHEN product_type = 'food' THEN quantity ELSE 0 END) AS food_qty,
SUM(CASE WHEN product_type = 'clothes' THEN quantity ELSE 0 END) AS clothes_qty
FROM orders
GROUP BY user_id
) t
LATERAL VIEW stack(3,
'book', book_qty,
'food', food_qty,
'clothes', clothes_qty
) stacked AS product_type, quantity;
通过灵活运用这些技巧,可以满足Hive中各种行列转换的需求。
2644

被折叠的 条评论
为什么被折叠?



