[笔记]hadoop tutorial - Reducer

[quote]Reducer reduces a set of intermediate values which share a key to a smaller set of values.[/quote]
[b]Reducer的数量[/b]
可通过以下方法设置
JobConf.setNumReduceTasks(int);

可以修改mapred.reduce.tasks参数,默认值为1。
官网推荐计算方法
[list]
[*]0.95 * NUMBER_OF_NODES * mapred.tasktracker.reduce.tasks.maximum
[*]1.75 * NUMBER_OF_NODES * mapred.tasktracker.reduce.tasks.maximum
[/list]
其中选择0.95时,所有的reducer task都会在map task结束时立即启动;选择1.75时有部分reducer task需要等到第二轮执行(从先结束的节点开始执行),这样可以更好的做到负载均衡。
[quote]It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.[/quote]
[b]Reducer阶段的工作[/b]
[quote]Reducer has 3 primary phases: shuffle, sort and reduce.[/quote]
[*][b]Shuffle[/b]
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
[*][b]Sort[/b]
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
[*][b]Secondary Sort[/b]
If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class). Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.
辅助排序可用于values需要排序的场合,如果将key对应的所有values直接在内存中排序可能会造成OOM问题,通过辅助排序将key和value封装成复合key,JobConf.setOutputKeyComparatorClass(Class)控制排序;JobConf.setOutputValueGroupingComparator(Class)控制分组;同时需要自定义Partitioner,将每个group下的数据分到同一个reducer中,而每个group会形成一个part文件,在key很多的场景下可能会需要文件的拼接。(我的理解一般key对应的value比较少的场景直接在reducer中排序,如果key对应的value非常多,需要使用辅助排序。)
[*][b]Reduce[/b]
In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值