Chapter 2 MapReduce
Analyzing the Data with Unix Tools
Problems
1. dividing work into equal-size isn't always easy or obvious.
2. Combining the results from independent processes may require further processing.
3. Still limited by the processing capacity of a single machine
Analyzing the Data with Hadoop
MapReduce
Scaling Out
To scale out, we need to store the data in a distributed filesystem(typically HDFS)
Tasks are scheduled by YARN.
Hadoop divides the input to a MapReduce job into fixed-size pieces called inputsplits. For most jobs, a good split size tends to be the size of anHDFS block, which is 128MB by default.
Data Locality Optimization: Hadoop does it best to run the map task on a node where the input data resides in HDFS. / Reduce task don't have the advantage of data locality.
Combiner reduce data transfered
Hadoop Streaming
Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can read standard input and write to standard output to write your MapReduce program. (Good for text processing)