1、理解需求
18161这个redmine 说到了“曝光频次和点击频次”,每个项目跑前20个曝光和点击最高的cookie就可以,说了半天和频次没有关系,就是要COUNT最多的cookie,且取 TOP 20.
2、如果用普通的方案把数先跑出来,再取top20,将是浪费时间且效率也很低的。
采用FLATTEN(TOP(20, 1, data.(cookie, cnt)) ) AS (cookie,cnt) ,事半功倍的完成任务
3、具体代码如下
set job.priority 'normal';
set pig.exec.reducers.bytes.per.reducer 419430400
data = LOAD '/product/log/{14015}/{130701,}' USING PigStorage(',');
data = FOREACH data GENERATE (chararray) REPLACE($10, '_', '') AS cookie, (int) $117 AS camp, (int) $202 AS mediaid,
data = FILTER data BY type<300 and (mediaid==1339 or mediaid==1335 or mediaid==1361 or mediaid==2267);
data = foreach data generate camp, mediaid, cookie;
data = group data by (camp, mediaid, cookie);
data = foreach data generate FLATTEN(group) as (camp, mediaid, cookie) , COUNT(data) as cnt;
data = group data by (camp, mediaid);
data = foreach data generate FLATTEN(group) as (camp, mediaid ), FLATTEN(TOP(20, 1, data.(cookie, cnt)) ) AS (cookie,cnt);
STORE data INTO '/tmp/pigout/18161_camp_mediaid_view_top20' USING PigStorage(',');