在数据库中,常常会有Distinct Count的操作,比如,查看每一选修课程的人数:
select course, count(distinct sid)
from stu_table
group by course;
Hive
在大数据场景下,报表很重要一项是UV(Unique Visitor)统计,即某时间段内用户人数。例如,查看一周内app的用户分布情况,Hive中写HiveQL实现:
select app, count(distinct uid) as uv
from log_table
where week_cal = '2016-03-27'
Pig
与之类似,Pig的写法:
-- all users
define DISTINCT_COUNT(A, a) returns dist {
B = foreach $A generate $a;
unique_B = distinct B;
C = group unique_B a