(1) Hadoop简介

最新推荐文章于 2025-09-06 21:58:22 发布

转载最新推荐文章于 2025-09-06 21:58:22 发布 · 102 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：http://blog.51cto.com/amao99/211227

文章标签：

#大数据

本文介绍Hadoop如何通过连接大量低成本计算机实现大规模数据的高效并行处理。它利用分布式文件系统将数据分散存储，并通过MapReduce模型进行计算。文章还探讨了Hadoop在处理硬件故障、数据拥塞及资源管理方面的挑战。

------------------------------------------------------------------------------------

简介

------------------------------------------------------------------------------------

(1) 适用于大规模数据并行处理，可扩展到成千上百台机器，每台机器拥有多核。

适用于在一系列机器上处理大量分布式任务。

(2) Hadoop includes a DFS (distributed file system) which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.

(3)Hadoop is designed to efficiently process large volumes of information by connecting many commodity computers together to work in parallel.

------------------------------------------------------------------------------------

challenges

------------------------------------------------------------------------------------

(4)In a distributed environment, however, partial failures are an expected and common occurrence. Networks can experience partial or total failure if switches and routers break down. Data may not arrive at a particular point in time due to unexpected network congestion. Individual compute nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space. Data may be corrupted, or maliciously or improperly transmitted. Multiple implementations or versions of client software may speak slightly different protocols from one another. Clocks may become desynchronized, lock files may not be released, parties involved in distributed atomic transactions may lose their network connections part-way through, etc. In each of these cases, the rest of the distributed system should be able to recover from the component failure or transient error condition and continue to make progress. Of course, actually providing such resilience is a major software engineering challenge.

(5) Hadoop is designed to handle hardware failure and data congestion issues very robustly.

(6) In addition to worrying about these sorts of bugs and challenges, there is also the fact that the compute hardware has finite resources available to it. The major resources include:

a)Processor time

b)Memory

c)Hard drive space

d)Network bandwidth

(7) To be successful, a large-scale distributed system must be able to manage the above mentioned resources efficiently. Furthermore, it must allocate some of these resources toward maintaining the system as a whole, while devoting as much time as possible to the actual core computation.

(8) Synchronization between multiple machines remains the biggest challenge in distributed system design.

------------------------------------------------------------------------------------

Data Distribution

------------------------------------------------------------------------------------

(9) HDFS will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable. An active monitoring system then re-replicates the data in response to system failures which can result in partial storage. Even though the file chunks are replicated and distributed across several machines, they form a single namespace, so their contents are universally accessible.

(10) Each process running on a node in the cluster then processes a subset of these records. The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system. Since files are spread across the distributed file system as chunks, each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers. This strategy of moving computation to the data, instead of moving the data to the computation allows Hadoop to achieve high data locality which in turn results in high performance.

------------------------------------------------------------------------------------

MapReduce: Isolated Processes

------------------------------------------------------------------------------------

(11)Hadoop limits the amount of communication which can be performed by the processes, as each individual record is processed by a task in isolation from one another.

(12)

In MapReduce, records are processed in isolation by tasks called Mappers. The output from the Mappers is then brought together into a second set of tasks called Reducers, where results from different mappers can be merged together.

Mapping and reducing tasks run on nodes where individual records of data are already present.

(13)Pieces of data can be tagged with key names which inform Hadoop how to send related bits of information to a common destination node. Hadoop internally manages all of the data transfer and cluster topology issues.

(14)By restricting the communication between nodes, Hadoop makes the distributed system much more reliable.

------------------------------------------------------------------------------------

Flat Scalability

------------------------------------------------------------------------------------

(15) 机器数目少时(10台以内)，MPI也许会快些。然而，话费在机器和开发、重开发上精力巨大，可扩展性极差。

(16) 然而Hadoop的可扩展性极强。

转载于:https://blog.51cto.com/amao99/211227