hadoop Streaming之aggregate

1. aggregate简介
aggregate是Hadoop提供的一个软件包,其用来做一些通用的计算和聚合。
Generally speaking, in order to implement an application using Map/Reduce model, the developer needs to implement Map and Reduce functions (and possibly Combine function). However, for a lot of applications related to counting and statistics computing, these functions have very similarcharacteristics. This provides a package implementing those patterns. In particular,the package provides a generic mapper class,a reducer class and a combiner class, and a set of built-in value aggregators.It also provides a generic utility class, ValueAggregatorJob, that offers a static function that creates map/reduce jobs。
在Streaming中通常使用Aggregate包作为reducer来做聚合统计。

2. aggregate class summary

DoubleValueSum

This class implements a value aggregator that sums up a sequence of double values.

可利用来统计Top K记录,类似LongValueSum

LongValueMaxThis class implements a value aggregator that maintain the maximum of a sequence of long values.
LongValueMinThis class implements a value aggregator that maintain the minimum of a sequence of long values.
LongValueSumThis class implements a value aggregator that sums up a sequence of long values.
StringValueMaxThis class implements a value aggregator that maintain the biggest of a sequence of strings.
StringValueMinThis class implements a value aggregator that maintain the smallest of a sequence of strings.
UniqValueCountThis class implements a value aggregator that dedupes a sequence of objects.
UserDefinedValueAggregatorDescriptorThis class implements a wrapper for a user defined value aggregator descriptor.
ValueAggregatorBaseDescriptorThis class implements the common functionalities of the subclasses of ValueAggregatorDescriptor class.
ValueAggregatorCombinerThis class implements the generic combiner of Aggregate.
ValueAggregatorJobThis is the main class for creating a map/reduce job using Aggregate framework.
ValueAggregatorJobBaseThis abstract class implements some common functionalities of the the generic mapper, reducer and combiner classes of Aggregate.
ValueAggregatorMapperThis class implements the generic mapper of Aggregate.
ValueAggregatorReducerThis class implements the generic reducer of Aggregate.
ValueHistogramThis class implements a value aggregator that computes the histogram of a sequence of strings

3. streaming中使用aggregate

在mapper任务的输出中添加控制,如下:
function:key\tvalue
eg:
LongValueSum:key\tvalue
此外,置-reducer = aggregate。此时,Reducer使用aggregate中对应的function类对相同key的value进行操作,例如,设置function为LongValueSum则将对每个键值对应的value求和。

4. 实例1(value求和)
测试文件test.txt

      15       

      17       

      18       

      19       

      19       

      19       

      19       

      20       

      15       

      15       

      16       

      16       

mapper程序:
#include   

  1. #include   
  2.   
  3. using namespace std;  
  4.   
  5. int main(int argc, char** argv)  
  6.  
  7.         string a,b,c;  
  8.         while(cin >> >> >> c)  
  9.          
  10.                 cout  << "LongValueSum:"<< << "\t" <<  <<  endl;  
  11.          
  12.         return 0;  
  13.  

运行:
$hadoop streaming -input /app/test.txt -output /app/test -mapper ./mapper -reducer aggregate -file mapper  -jobconf mapred.reduce.tasks=1 -jobconf mapre.job.name="test"
输出:
      142
      20
      30
      16

5. 实例2(强大ValueHistogram)
ValueHistogram是aggregate package中最强大的类,基于每个键,对其value做以下统计
1)唯一值个数
2)最小值个数
3)中位置个数
4)最大值个数
5)平均值个数
6)标准方差
上述例子基础上修改mapper.cpp为:

#include   
  1. #include   
  2.   
  3. using namespace std;  
  4.   
  5. int main(int argc, char** argv)  
  6.  
  7.         string a,b,c;  
  8.         while(cin >> >> >> c)  
  9.          
  10.                 cout  << "ValueHistogram:"<< << "\t" <<  <<  endl;  
  11.          
  12.         return 0;  
  13.  

运行命令同上
运行结果:
                          1.6     1.2
                          1.0     0.0
                          2.0     0.0
                          1.0     0.0

参考:
/docs/api/index.html?org/apache/hadoop/mapred/lib/aggregate/package-summary.htm

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值