一个HIVE SQL引发的优化分析

最新推荐文章于 2025-12-09 14:44:47 发布

原创最新推荐文章于 2025-12-09 14:44:47 发布 · 598 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#hive #sql #性能 #优化

Hive 专栏收录该内容

1 篇文章

订阅专栏

本文深入探讨了SQL性能调优案例，特别是在处理排序需求时的策略选择。通过实例展示了如何利用`insert overwrite`语句结合`distribute by`和`sort by`实现局部排序，从而有效解决性能问题，并对比了`order by`、`distribute by`、`sort by`和`cluster by`的区别，提供了一种折衷的解决方案。

针对今天的任务失败，做个详细的分析。这是个典型的性能调优案例：

1. 先来看看这个SQL(我把中间select的所有字段有隐去，便于查看)

insert overwrite table dw_user_activity partition(pt='$env.date')
select
…
…
from s_user_activity a
where a.pt  ='$env.date'
order by a.deviceid,a.userid,a.time;

注意到最后的order by，其实order by 意味着全局排序，全局排序意味着一定只有一个reducer，因为只有一个reducer才能保证全局有序（仔细理解下?）

所以尽管后面加了

set mapreduce.job.reduces=300;

其实也丝毫没有影响reducer的个数(截个图表示我再次测试过,reduce个数还是1)：

那碰到这种我们想排序的怎么办呢？难道排序只能忍受只有1个reducer的痛苦么？

当然是有折衷的办法，这也是hive和oracle查询不同的地方！

从这个例子来看，我们的目标是排序，但不一定非得全局有序，如果能按a.deviceid,a.userid分区到不同节点，在同一个节点上保证有序即可，这就是局部有序。因此可以这么写：

insert overwrite table dw_user_activity partition(pt='$env.date')
select
…
…
from s_user_activity a
where a.pt  ='$env.date'
distribute by a.deviceid,a.userid sort by a.deviceid,a.userid,a.time;

可以看到，我没有设置mapreduce.job.reduces，它也自动分配了2个reducer，这样就能跑成功了。

2. 我们再来看看order by，distribute by，sort by，cluste by的区别(先摘录一段官方的说明)

    ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
    SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
    DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn't sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges.
    CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

再总结一下：

Distribute by:就是把数据按hash方式shuffle到不同节点，这样保证相同key值的数据一定到同一个节点。

Sort by:就是在节点上做排序。如果只有sort by没有distribute by，同一个key值的数据可能会到不同的节点，这样排序就意义不大了

Cluster by=distribute by+sort by

3. 那group by 又怎样呢？

Group by首先它是做聚合的，后面必须跟上sum ,count等聚合函数。她其实先隐含了按group by key做distribute，再做聚合。

如果后面也跟上order by，通常它会在后面有stage再生成一个job做order by，保证只有一个reducer做全局排序。

4. 终于写完了，好累，不知道解释清楚没，谢谢捧场看到最后J