Chapter 2 MapReduce
Analyzing the Data with Unix Tools
Problems
1. dividing work into equal-size isn't always easy or obvious.
2. Combining the results from independent processes may require further processing.
3. Still limited by the processing capacity of a single machine
Analyzing the Data with Hadoop
MapReduce
Scaling Out
To scale out, we need to store the data in a distributed filesystem(typically HDFS)
Tasks are scheduled by YARN.
Hadoop divides the input to a MapReduce job into fixed-size pieces called inputsplits. For most jobs, a good split size tends to be the size of anHDFS block, which is 128MB by default.
Data Locality Optimization: Hadoop does it best to run the map task on a node where the input data resides in HDFS. / Reduce task don't have the advantage of data locality.
Combiner reduce data transfered
Hadoop Streaming
Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can read standard input and write to standard output to write your MapReduce program. (Good for text processing)
本文探讨了使用Hadoop MapReduce进行大规模数据处理的方法,包括如何将工作分割为等大小的部分、数据分布式存储、任务调度、数据局部优化、结果合并以及使用Hadoop Streaming进行文本处理的基本步骤。
563

被折叠的 条评论
为什么被折叠?



