HDFS

最新推荐文章于 2025-09-04 12:00:53 发布

amaowolf

最新推荐文章于 2025-09-04 12:00:53 发布

阅读量563

点赞数

分类专栏： Storage 文章标签： system file mapreduce access parallel server

Storage 专栏收录该内容

19 篇文章

订阅专栏

本文深入探讨了分布式文件系统的概念，从简单易用的NFS出发，对比介绍了其局限性，进而引入了更为强大的HDFS。阐述了HDFS的设计原则、优势特性，包括数据分布、副本机制、同步过程以及NameNode和DataNode的角色，旨在为读者提供全面的分布式文件系统知识。

------------------------------------------------------------------------------------

简介

------------------------------------------------------------------------------------

(1)Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications.

------------------------------------------------------------------------------------

Distributed File System Basics

------------------------------------------------------------------------------------

(2)NFS

While its design is straightforward, it is also very constrained. NFS provides remote access to a single logical volume stored on a single machine. An NFS server makes a portion of its local file system visible to external clients. The clients can then mount this remote file system directly into their own Linux file system, and interact with it as though it were part of the local drive.

One of the primary advantages of this model is its transparency.

缺点：(a)The files in an NFS volume all reside on a single machine

(b) all the clients must go to this machine to retrieve their data. This can overload the server if a large number of clients must be handled

------------------------------------------------------------------------------------

(3)HDFS

(a)HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS.

(b)HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available.
(c)HDFS should provide fast, scalable access to this information. It should be possible to serve a larger number of clients by simply adding more machines to the cluster.

(d)HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.

(4) HDFS三特征：分片(block64MB), 副本(3 replica)，同步

(5) NameNode 存元数据，主要放在主存中， DataNode存数据片

(6) NameNode的可靠性非常重要，通过冗余来完成