hadoop权威指南 chapter2 MapReduce

本文详细介绍了如何利用Hadoop提供的并发处理优势,通过MapReduce模型将复杂的数据分析任务分解为并行处理阶段(映射和归约),进而实现大规模数据集的有效分析。从查询表达、数据流处理到任务执行细节,全面展示了从本地测试到集群机器部署的全过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

MapReduce

MapReduce is a programming model for data processing. The model is simple, yet not too simple to express useful programs in. Hadoop can run MapReduce programs written
in various languages; in this chapter, we shall look at the same program expressed in Java, Ruby, Python, and C++. Most important, MapReduce programs are inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal. MapReduce comes into its own for large datasets, so let’s start by looking at one.

2.1 Analyzing the Data with Hadoop 使用Hadoop分析数据

To take advantage of the parallel processing that Hadoop provides, we need to express our query as a MapReduce job. After some local, small-scale testing, we will be able to
run it on a cluster of machines.

利用Hadoop提供的并发处理的优势,我们需要使用MapReduce job来表达一个查询,通过一个本地化、小范围的测试,我们就可以在集群机器上运行了。

2.2 Map and Reduce

MapReduce works by breaking the processing into two phases: the map phase and the reduce phase.

Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.

map函数和 reduce函数   输入输出键值对  

 

2.3 Scaling Out 横向扩展

Data Flow 数据流

A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks.

job 是客户端执行的一个工作单元。由输入数据、程序和配置信息组成。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值