首先,用我本地的单机版hive建一张测试表,虽然不能模拟大数据量,但是足够说明问题了,准备数据如下:
-- count distinct测试
create table count_distinct_test(id int,name string);
insert into count_distinct_test values(1,'a'),(2,'a'),(3,'a'),(4,'b'),(5,'b'),(6,'c'),(7,'d'),(8,'e'),(9,'f'),(10,'g');
explain
select count(distinct(name)) from count_distinct_test;
explain
select count(1) from (select name from count_distinct_test group by name) x;
首先,分别执行优化之前的count distinct和group by的语句,观察控制台打印的日志:
count distinct:
可以看到job数为1,整个任务只有1个mr