mapreduce-shuffling

本文深入解析了Hadoop MapReduce的工作流程,包括溢出文件处理、分区、排序、合并及Reduce阶段的具体操作,同时介绍了如何通过配置优化溢出文件压缩以节省磁盘和网络I/O。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

map->reduce

map和reduce之间的过程,成为shuffling,官方图是这样介绍的.(这样描述不是很准确)

这里写图片描述

MapTask

每个map任务都有一个环形内存缓冲区用于存储任务的输出.默认100MB(MRJobConfig.IO_SORT_MB修改)
一旦缓冲达到阈值(MRJobConfig.MAP_SORT_SPILL_PERCENT)0.8,后台线程将内容spill到硬盘,将缓缓冲区写到MRJobConfig.JOB_LOCAL_DIR指定目录.
查看MRJobConfig.JOB_LOCAL_DIR值为mapreduce.job.local.dir,查看org.apache.hadoop.mapreduce包下的mapred-default.xml(hadoop-mapreduce-client-core.2.7.1.jar中)文件搜索local.dir,得到配置

<property>
  <name>mapreduce.cluster.local.dir</name>
  <value>${hadoop.tmp.dir}/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>

ok,现在从hadoop-common-2.7.1.jar中的core-default.xml中搜索hadoop.tmp.dir

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

现在我们得到了spill的临时路径/tmp/hadoop-${user.name}/mapred/local.

在spill之前,首先进行partition,每个分区进行sort,如果有combiner,它就在排序后,执行combiner。

如果溢出文件超过三个(JobContext.MAP_COMBINE_MIN_SPILLS),将会再次执行combiner

MapTask.MapOutputBuffer中源码

if (combinerRunner == null || numSpills < minSpillsForCombine) {
    Merger.writeFile(kvIter, writer, reporter, job);
} else {
    combineCollector.setWriter(writer);
    combinerRunner.combine(kvIter, combineCollector);
}

注: map spill到磁盘时,可以设置压缩来节省磁盘和网络IO
设置 MAP_OUTPUT_COMPRESS 为true ,MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC值为codec
例如:
conf.set(MRJobConfig.MAP_OUTPUT_COMPRESS, "true");
conf.set(MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC, "org.apache.hadoop.io.compress.DefaultCodec");

ReduceTask

ReduceTask要从各个MapTask上读取数据,ReduceTask大体流程分为5个阶段。
  1. Shuffle
    ReduceTask从MapTask上远程拷贝数据。超过阈值写道磁盘。
  2. Merge
    ReduceTask启动两个线程,对内存和硬盘数据进行合并。
  3. Sort
    将MapTask的结果归并排序。
  4. Reduce
    用户自定义Reduce
  5. Write
    reduce结果写到HDFS

    源码分析

### KNN Algorithm Implementation Using MapReduce Framework The combination of K-Nearest Neighbors (KNN) with the MapReduce framework allows for efficient processing of large datasets by distributing computations across multiple nodes. While specific libraries like those mentioned do not directly address this topic[^1], insights can be drawn from general principles. #### Overview of KNN and MapReduce Integration In a distributed environment, implementing KNN involves breaking down tasks into mappers and reducers: - **Mapper Phase**: Each mapper receives part of the dataset as input and calculates distances between test points and training samples within its partition. ```python def map(key, value): point_id, coordinates = key, value for train_point in training_data: distance = calculate_distance(coordinates, train_point['coordinates']) yield(train_point['id'], (distance, point_id)) ``` - **Reducer Phase**: Reducers aggregate results sent by all mappers to find k-nearest neighbors for each query point and determine class labels or regression values accordingly. ```python def reduce(key, values): sorted_distances = sorted(values)[:k] nearest_neighbors = [item[1] for item in sorted_distances] if classification_task: label_counts = {} for neighbor in nearest_neighbors: label = get_label(neighbor) if label not in label_counts: label_counts[label] = 0 label_counts[label] += 1 predicted_class = max(label_counts.items(), key=operator.itemgetter(1))[0] yield(point_id, predicted_class) elif regression_task: average_value = sum([get_value(n) for n in nearest_neighbors]) / float(k) yield(point_id, average_value) ``` This approach leverages parallelism provided by Hadoop's architecture while adhering closely to traditional KNN logic. #### Performance Optimization Techniques Several strategies enhance efficiency when deploying KNN via MapReduce: - **Data Partitioning Strategy**: Carefully design how data splits among different machines so that similar items reside together more often than not; this reduces communication overhead during shuffling phase after mapping step completes. - **Index Structures Utilization**: Precompute indexes such as KD-trees before running jo
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值