------------------------------------------------------------------------------------
简介
------------------------------------------------------------------------------------
(1) 适用于大规模数据并行处理,可扩展到成千上百台机器,每台机器拥有多核。
适用于在一系列机器上处理大量分布式任务。
(2) Hadoop includes a DFS (distributed file system) which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.
------------------------------------------------------------------------------------
challenges
------------------------------------------------------------------------------------
(5) Hadoop is designed to handle hardware failure and data congestion issues very robustly.
(6) In addition to worrying about these sorts of bugs and challenges, there is also the fact that the compute hardware has finite resources available to it. The major resources include:
a)Processor time
b)Memory
c)Hard drive space
d)Network bandwidth
(7) To be successful, a large-scale distributed system must be able to manage the above mentioned resources efficiently. Furthermore, it must allocate some of these resources toward maintaining the system as a whole, while devoting as much time as possible to the actual core computation.
(8) Synchronization between multiple machines remains the biggest challenge in distributed system design.
------------------------------------------------------------------------------------
Data Distribution
------------------------------------------------------------------------------------
(9) HDFS will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable. An active monitoring system then re-replicates the data in response to system failures which can result in partial storage. Even though the file chunks are replicated and distributed across several machines, they form a single namespace, so their contents are universally accessible.
(10) Each process running on a node in the cluster then processes a subset of these records. The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system. Since files are spread across the distributed file system as chunks, each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers. This strategy of moving computation to the data, instead of moving the data to the computation allows Hadoop to achieve high data locality which in turn results in high performance.
------------------------------------------------------------------------------------
MapReduce: Isolated Processes
------------------------------------------------------------------------------------
(11)Hadoop limits the amount of communication which can be performed by the processes, as each individual record is processed by a task in isolation from one another.
(12)
In MapReduce, records are processed in isolation by tasks called Mappers. The output from the Mappers is then brought together into a second set of tasks called Reducers, where results from different mappers can be merged together.
Mapping and reducing tasks run on nodes where individual records of data are already present.
(13)Pieces of data can be tagged with key names which inform Hadoop how to send related bits of information to a common destination node. Hadoop internally manages all of the data transfer and cluster topology issues.
(14)By restricting the communication between nodes, Hadoop makes the distributed system much more reliable.
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
(15) 机器数目少时(10台以内),MPI也许会快些。然而,话费在机器和开发、重开发上精力巨大,可扩展性极差。
(16) 然而Hadoop的可扩展性极强。
转载于:https://blog.51cto.com/amao99/211227