6.824 paper MapReduce: Simplified Data Processing on Large Clusters

最新推荐文章于 2025-10-13 10:45:30 发布

原创

最新推荐文章于 2025-10-13 10:45:30 发布 · 549 阅读

0 ·

CC 4.0 BY-SA版权

本文详细介绍了MapReduce的实现原理，包括执行概述、数据结构、容错机制以及实现细节。Map任务通过分区输入数据并在多台机器上并行执行，Reduce任务通过分区函数分配。容错机制包括处理worker失败和master失败的情况，确保数据处理的可靠性。此外，文中还讨论了局部性、任务粒度、备份任务等优化措施。

本文关于原理部分的内容主要在第三第四节：

3 Implementation

3.1 Execution Overview

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be pro-cessed in parallel by different machines. Reduce invoca-tions are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g.,hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user.

map:

将输入分为M份（依据什么分？），对应到M台机器上，然后分别调用map函数。他们在不同机器上是并行执行的

reduce：

将map的输出也就是中间键值对用一个partition函数（如哈希取modR）分成R个piece，然后分别调用reduce

上图：

Figure 1 shows the overall flow of a MapReduce op- eration in our implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in Figure 1 corre-spond to the numbers in the list below):

一次MR的流程是：

1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (con-trollable by the user via an optional parameter). It then starts up many copies of the program on a clus-ter of machines.

1.MR将输入分为M份（每份的大小由用户指定），然后在集群上运行多个该程序（用户编写的应用程序和MR框架的集合？）的copy（类似集群里的每台机器运行一个该程序的实例）

2. One of the copies of the program is special – the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

2.这些实例中有一个是特殊的，是master，其余的是worker，master给worker分配任务（所以实际上master和worker对应的程序是相同的，只是他们的身份不同，从而行为也不同？）

共有M个Map任务，R个Reduce任务，master从空闲的worker中挑出并分配map或者reduce任务

3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The interme-diate key/value pairs produced by the Map function are buffered in memory.

3.被分配map任务的worker，首先从对应input split中读取输入，从输入中解析出键值对，然后调用map函数。输出的中间键值对缓存在内存里（？）

4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

4.阶段性的，缓存的键值对保存到本地磁盘，通过partition函数分为R piece。对应的在本地磁盘上的路径，被传给master，master负责将这些路径传给对应的reduce worker。

5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all in-termediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.

5.当reduce worker从master得知输入数据的路径信息后，使用RPC从之前的map worker的本地磁盘读出来。当他将自己的输入数据读取完成后，首先按照key排序，以便将key相同的中间键值对聚集起来，因为有不同的key通过partition函数映射到了同一个reduce worker，所以需要排序。

6. The reduce worker iterates over the sorted interme-diate data and for each unique intermediate key en-countered, it passes the key and the corresponding set of intermediate values to the user’s Reduce func-tion. The output of the Reduce function is appended to a final output file for this reduce partition.

6.reduce worker对排序后的中间值进行遍历，将相同key的键值对作为一次reduce函数的输入，调用reduce函数。reduce函数的输出被追加到a final output file for this reduce partition（？

7. When all map tasks and reduce tasks have been completed, the master wakes up the user program.At this point, the MapReduce call in the user pro-gram returns back to the user code.

7.所有的map和reduce任务完成后，本次MR结束，返回

一次MR成功完成后，输出在R个输出文件里（注意有R个reduce调用）。一般，用户不会直接将这R个文件merge起来，而是将他们作为另一个MR的输入

3.2 Master Data Structures

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress,or completed), and the identity of the worker machine (for non-idle tasks).

master维护了一系列数据结构。对每个map和reduce任务，他保存了任务的状态（空闲，进行中，已完成。空闲就是指还没执行吧？），以及对应worker的身份（如果该任务非空闲。已完成的呢？）

The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task,the master stores the locations and sizes of the R inter-mediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incre-mentally to workers that have in-progress reduce tasks.

中间文件的路径信息通过master从map task传到reduce task。master为每个完成的map任务保存了路径信息以及该map任务产生的中间文件对应到R个region里相应的size（比如产生了3个文件，总共有三个region，刚好产生的中间文件通过partition函数分别对应到三个region，那么size就都是1）当map任务完成时，对路径信息以及size信息的更新就收到了（）。这些信息被逐渐的push到正在处理reduce任务的worker

↑ 这个在注意一下，对应到具体实现里应该是怎样的