6.824 paper MapReduce: Simplified Data Processing on Large Clusters

本文详细介绍了MapReduce的实现原理,包括执行概述、数据结构、容错机制以及实现细节。Map任务通过分区输入数据并在多台机器上并行执行,Reduce任务通过分区函数分配。容错机制包括处理worker失败和master失败的情况,确保数据处理的可靠性。此外,文中还讨论了局部性、任务粒度、备份任务等优化措施。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本文关于原理部分的内容主要在第三第四节:

3 Implementation

3.1 Execution Overview

The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be pro-cessed in parallel by different machines. Reduce invoca-tions are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g.,hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user.

map:

将输入分为M份(依据什么分?),对应到M台机器上,然后分别调用map函数。他们在不同机器上是并行执行的

reduce:

将map的输出也就是中间键值对用一个partition函数(如哈希取modR)分成R个piece,然后分别调用reduce

 

上图:

Figure 1 shows the overall flow of a MapReduce op- eration in our implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in Figure 1 corre-spond to the numbers in the list below):

一次MR的流程是:

1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (con-trollable by the user via an optional parameter). It then starts up many copies of the program on a clus-ter of machines.

1.MR将输入分为M份(每份的大小由用户指定),然后在集群上运行多个该程序(用户编写的应用程序和MR框架的集合?)的copy(类似集群里的每台机器运行一个该程序的实例)

2. One of the copies of the program is special – the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

2.这些实例中有一个是特殊的,是master,其余的是worker,master给worker分配任务(所以实际上master和worker对应的程序是相同的,只是他们的身份不同,从而行为也不同?)

共有M个Map任务,R个Reduce任务,master从空闲的worker中挑出并分配map或者reduce任务

3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The interme-diate key/value pairs produced by the Map function are buffered in memory.

3.被分配map任务的worker,首先从对应input split中读取输入,从输入中解析出键值对,然后调用map函数。输出的中间键值对缓存在内存里(

4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

4.阶段性的,缓存的键值对保存到本地磁盘,通过partition函数分为R piece。对应的在本地磁盘上的路径,被传给master,master负责将这些路径传给对应的reduce worker。

5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all in-termediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.

5.当reduce worker从master得知输入数据的路径信息后,使用RPC从之前的map worker的本地磁盘读出来。当他将自己的输入数据读取完成后,首先按照key排序,以便将key相同的中间键值对聚集起来,因为有不同的key通过partition函数映射到了同一个reduce worker,所以需要排序。

6. The reduce worker iterates over the sorted interme-diate data and for each unique intermediate key en-countered, it passes the key and the corresponding set of intermediate values to the user’s Reduce func-tion. The output of the Reduce function is appended to a final output file for this reduce partition.

6.reduce worker对排序后的中间值进行遍历,将相同key的键值对作为一次reduce函数的输入,调用reduce函数。reduce函数的输出被追加到a final output file for this reduce partition(?

7. When all map tasks and reduce tasks have been completed, the master wakes up the user program.At this point, the MapReduce call in the user pro-gram returns back to the user code.

7.所有的map和reduce任务完成后,本次MR结束,返回

一次MR成功完成后,输出在R个输出文件里(注意有R个reduce调用)。一般,用户不会直接将这R个文件merge起来,而是将他们作为另一个MR的输入

3.2 Master Data Structures

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress,or completed), and the identity of the worker machine (for non-idle tasks).

master维护了一系列数据结构。对每个map和reduce任务,他保存了任务的状态(空闲,进行中,已完成。空闲就是指还没执行吧?),以及对应worker的身份(如果该任务非空闲。已完成的呢?

The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task,the master stores the locations and sizes of the R inter-mediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incre-mentally to workers that have in-progress reduce tasks.

中间文件的路径信息通过master从map task传到reduce task。master为每个完成的map任务保存了路径信息以及该map任务产生的中间文件对应到R个region里相应的size(比如产生了3个文件,总共有三个region,刚好产生的中间文件通过partition函数分别对应到三个region,那么size就都是1)当map任务完成时,对路径信息以及size信息的更新就收到了()。这些信息被逐渐的push到正在处理reduce任务的worker

↑ 这个在注意一下,对应到具体实现里应该是怎样的

3.3 Fault Tolerance

因为MR的主要用途是大型计算,那么自然需要妥善处理machine failure的情况

Worker Failure

The master pings every worker periodica

内容概要:文章基于4A架构(业务架构、应用架构、数据架构、技术架构),对SAP的成本中心和利润中心进行了详细对比分析。业务架构上,成本中心是成本控制的责任单元,负责成本归集与控制,而利润中心是利润创造的独立实体,负责收入、成本和利润的核算。应用架构方面,两者都依托于SAP的CO模块,但功能有所区分,如成本中心侧重于成本要素归集和预算管理,利润中心则关注内部交易核算和获利能力分析。数据架构中,成本中心与利润中心存在多对一的关系,交易数据通过成本归集、分摊和利润计算流程联动。技术架构依赖SAP S/4HANA的内存计算和ABAP技术,支持实时核算与跨系统集成。总结来看,成本中心和利润中心在4A架构下相互关联,共同为企业提供精细化管理和决策支持。 适合人群:从事企业财务管理、成本控制或利润核算的专业人员,以及对SAP系统有一定了解的企业信息化管理人员。 使用场景及目标:①帮助企业理解成本中心和利润中心在4A架构下的运作机制;②指导企业在实施SAP系统时合理配置成本中心和利润中心,优化业务流程;③提升企业对成本和利润的精细化管理水平,支持业务决策。 其他说明:文章不仅阐述了理论概念,还提供了具体的应用场景和技术实现方式,有助于读者全面理解并应用于实际工作中。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值