《Hadoop: The Definitive Guide》读书笔记 -- Chapter 2 MapReduce

kiwi小白

于 2015-10-08 19:02:59 发布

阅读量1.8k

点赞数

分类专栏：大数据文章标签：大数据 hadoop

本文链接：https://blog.youkuaiyun.com/kiwi_coder/article/details/48976129

版权

大数据专栏收录该内容

5 篇文章

订阅专栏

本文探讨了使用Hadoop MapReduce进行大规模数据处理的方法，包括如何将工作分割为等大小的部分、数据分布式存储、任务调度、数据局部优化、结果合并以及使用Hadoop Streaming进行文本处理的基本步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Chapter 2 MapReduce

Analyzing the Data with Unix Tools

Problems

1. dividing work into equal-size isn't always easy or obvious.

2. Combining the results from independent processes may require further processing.

3. Still limited by the processing capacity of a single machine

Analyzing the Data with Hadoop

MapReduce

Scaling Out

To scale out, we need to store the data in a distributed filesystem(typically HDFS)

Tasks are scheduled by YARN.

Hadoop divides the input to a MapReduce job into fixed-size pieces called inputsplits. For most jobs, a good split size tends to be the size of anHDFS block, which is 128MB by default.

Data Locality Optimization: Hadoop does it best to run the map task on a node where the input data resides in HDFS. / Reduce task don't have the advantage of data locality.

Combiner reduce data transfered

Hadoop Streaming

Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can read standard input and write to standard output to write your MapReduce program. (Good for text processing)