HDFS Architecture

最新推荐文章于 2025-04-24 21:59:50 发布

s127838498

最新推荐文章于 2025-04-24 21:59:50 发布

阅读量322

点赞数

CC 4.0 BY-SA版权

分类专栏：大数据文章标签：官网翻译 hdfs

本文链接：https://blog.youkuaiyun.com/s127838498/article/details/83997659

大数据专栏收录该内容

13 篇文章

订阅专栏

Introduction

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/

Hadoop分布式文件系统（HDFS）是被设计来运行在在商品硬件上的分布式文件系统。它与现有的分布式文件系统有许多相似之处。然而，与其他分布式文件系统的区别是显著的。HDFS是高度容错的，并被设计为部署在低成本的硬件上。HDFS提供了对应用数据的高吞吐量访问，并且适合于具有大数据集的应用。HDFS放宽了一些POSIX要求，允许对文件系统数据进行流式访问。HDFS最初是作为Apache Nutch网络搜索引擎项目的基础设施而构建的。HDFS是Apache Hadoop核心项目的一部分。项目URL是http://hadoop.apache.org/

Assumptions and Goals(假设与目标)

Hardware Failure

Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
硬件故障是常态，而不是例外。一个HDFS实例可以由成百上千个服务器机器组成，每个服务器都存储文件系统的数据的一部分。事实上，存在大量的组件，并且每个组件都有很高的失败概率，这意味着HDFS的某些组件总是不能正常工作的。因此，故障检测和快速、自动恢复是HDFS的核心架构目标。

Streaming Data Access(流媒体数据访问)

Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.

运行在HDFS上的应用程序需要对它们的数据集进行流式访问。它们不是一般运行在通用文件系统上的通用应用程序。HDFS被设计用于批量处理，而不是由用户交互使用。重点是数据访问的高吞吐量，而不是数据访问的低延迟。POSIX强加了许多针对HDFS的应用所不需要的硬要求。POSIX语义在几个关键领域已被交易(?这里不是很理解)，以提高数据吞吐率。

Large Data Sets(大数据集)

Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

在HDFS上运行的应用程序具有很大的数据集。HDFS中的一个典型文件是GB字节到TB字节的大小。因此，HDFS被调谐以支持超大文件。它应该提供高聚合数据带宽，并且一个集群可以拓展到数百个节点。它应该一个实例中支持数以千万计的文件。

Simple Coherency Model(一致性模型)

HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model.

HDFS应用程序需要一个一次写入，多读取的文件模型。一个文件一旦被创建，写入并且关闭，就不需要被更改，除非需要追加或者删除。将内容追加到文件的末尾是支持的，但不能在任意点更新。这种假设简化了数据一致性问题，并实现了高吞吐量的数据访问。MapReduce应用程序或Web爬虫应用程序完全符合该模型。

Moving Computation is Cheaper than Moving Data(移动计算比移动数据更划算)

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

应用程序请求的计算如果在其运行的数据附近执行，则效率更高。当数据集的特别巨大时尤其如此。这最小化了网络拥塞并增加了系统的整体吞吐量。这里的假设是将计算迁移到更接近数据所在的位置，而不是将数据移动到应用程序运行的地方。HDFS为应用程序提供了将计算迁移至更接近数据所在位置的接口。

Portability Across Heterogeneous Hardware and Software Platforms(跨异构硬件和软件平台的移植性)

HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.

HDFS已经被设计成易于从一个平台移植到另一个平台。这有助于HDFS作为广大平台应用的选择。

NameNode and DataNodes

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

HDFS是一种主从结构。HDFS集群由单个NameNode（管理文件系统名称空间并管理客户端对文件的访问的主服务器）组成。此外，还有许多DataNode，通常是集群中的每个节点一个，它们管理连接到其上运行的节点的存储。HDFS公开了文件系统命名空间，并允许用户数据存储在文件中。在内部，文件被分割成一个或多个块，这些块被存储在一组DataNodes中。NameNode执行文件系统命名空间操作，如打开、关闭和重命名文件和目录。它还决定了块到数据元的映射。DataNodes负责服务文件客户端发送的读写请求。DataNodes同样响应NameNode的指令，例如块创建、删除和复制等。

在这里插入图片描述

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

NameNode和DataNode是被设计用于在商品机器上运行的软件。这些机器通常运行GNU/Linux操作系统（OS）。HDFS使用Java语言构建；支持Java的任何机器都可以运行NameNode或DataNode软件。使用高度可移植的Java语言意味着HDFS可以部署在各种各样的机器上。一个典型的部署有一个专用的机器，它只运行NameNode软件。群集中的每个其他机器运行DataNode软件的一个实例。该体系结构并不排除在同一台计算机上运行多个DataNode，但是这种操作在的实际部署中很少见。

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

集群中单个NameNode的存在极大地简化了系统的体系结构。NameNode是所有HDFS的仲裁器并且存储了所有的元数据(地址，备份数等)。该系统以这样的方式设计，使用户数据从不流过NameNode。

The File System Namespace（文件的命名空间）

HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS supports user quotas and access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.

HDFS支持传统的分层文件组织。用户或应用程序可以在这些目录中创建目录并存储文件。文件系统命名空间层次结构与大多数其他现有文件系统相似；可以创建和删除文件，将文件从一个目录移动到另一个目录，或者重命名文件。HDFS支持用户配额和访问权限。HDFS不支持硬链接或软链接。然而，HDFS体系结构并不妨碍实现这些特征

The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.

NameNode维护着文件系统命名空间。NameNode记录文件系统命名空间或其属性的任何更改。应用程序可以指定应该由HDFS维护的文件副本的数量。文件的拷贝数称为该文件的复制因子。此信息由NameNode存储。

Data Replication(数据复制)

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.

HDFS被设计成在大型集群中在机器上可靠地存储非常大的文件。它将每个文件存储为一个块序列。文件的块被复制用于容错。每个文件的块大小和复制因子是可配置的。

All blocks in a file except the last block are the same size, while users can start a new block without filling out the last block to the configured block size after the support for variable length block was added to append and hsync.

文件中除了最后一个块之外的所有块都是相同的大小，而用户可以在添加对可变长度块的支持以追加和hsync之后，在不将最后一个块填充到配置的块大小的情况下启动新块。

An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time.

应用程序可以指定文件副本的数量。复制因子可以在文件创建时指定，并且可以在以后更改。HDFS中的文件是一次写入（除了追加和删除），并且在任何时候都有严格的一个写入器。

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

关于所有block的复制由NameNode决定。NameNode周期性的收到集群中的DataNode的心跳和关于block的汇报收到心跳暗示着DataNode运行一切正常。一个关于block的汇报中包含了这个DataNode中所有block的列表。

在这里插入图片描述

Replica Placement: The First Baby Steps(复制点的选择)

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.

副本的放置对HDFS的可靠性和性能至关重要。优化副本放置将HDFS与大多数其他分布式文件系统区分开来。这是一个需要大量调整和体验的特性。机架感知副本放置策略的目的是提高数据的可靠性、可用性和网络带宽利用率。副本放置策略的当前实现是在这个方向上的第一次努力。实施这一政策的短期目标是在生产系统上验证它，了解它的行为，并为测试和研究更复杂的策略打下基础。

Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.

大型HDFS实例运行的计算机集群，通常分布在许多机架上。在不同机架之间的两个节点之间的通信必须通过交换机。在大多数情况下，同一机架中的机器之间的网络带宽大于不同机架中的机器之间的网络带宽。

The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.

NameNode通过机架感知决定了每个DataNode的机架ID。一个简单但非最优的策略是将副本放在唯一的机架上。这防止了当整个机架失败时丢失数据，并且允许在读取数据时使用来自多个机架的带宽。该策略在集群中均匀分布副本，使得组件负载上的负载平衡变得容易。然而，由于写入需要将块传送到多个机架，因此该策略增加了写入的成本。

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.

对于常见情况，当复制因子为3时，如果写入器在数据节点上，HDFS的放置策略是将一个副本放在本地机器上，否则放在随机数据节点上，另一个副本放在不同的（远程）机架上的节点上，最后一个副本放在同一个远程机架中的不同节点上。该策略减少了帧间写入流量，一般提高了写入性能。机架故障的概率远小于节点故障的概率；此策略不影响数据的可靠性和可用性保证。但是，它确实减少了在读取数据时使用的聚合网络带宽，因为块只放在两个唯一的机架中，而不是三个机架中。使用此策略，文件的副本不会均匀分布在机架上。三分之一的副本在一个节点上，三分之二的副本在一个机架上，而三分之一的副本均匀地分布在其他机架上。此策略在不损害数据可靠性或读取性能的情况下提高写入性能。

If the replication factor is greater than 3, the placement of the 4th and following replicas are determined randomly while keeping the number of replicas per rack below the upper limit (which is basically (replicas - 1) / racks + 2).

如果复制因子大于3，则随机确定第4个副本和后续副本的位置，同时将每个机架的复制数量保持在上限（基本上是（replicas-1）/机架+2）以下。

Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time.

因为NameNode不允许DataNode具有同一块的多个副本，所以创建的最大副本数量是当时DataNode的总数。

After the support for Storage Types and Storage Policies was added to HDFS, the NameNode takes the policy into account for replica placement in addition to the rack awareness described above. The NameNode chooses nodes based on rack awareness at first, then checks that the candidate node have storage required by the policy associated with the file. If the candidate node does not have the storage type, the NameNode looks for another node. If enough nodes to place replicas can not be found in the first path, the NameNode looks for nodes having fallback storage types in the second path.

在将存储类型和存储策略的支持添加到HDFS之后，NameNode除了上面描述的机架感知之外，还考虑复制放置策略。NameNode首先基于机架感知选择节点，然后检查候选节点是否具有与文件相关联的策略所需的存储。如果候选节点没有存储类型，则NAMENODE查找另一个节点。如果在第一路径中找不到放置副本的足够节点，则NameNode在第二路径中查找具有回退存储类型的节点。

The current, default replica placement policy described here is a work in progress.

当前描述的默认副本放置策略是一项正在进行中的工作。

Replica Selection(副本的选择)

To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.

为了最小化全局带宽消耗和读取延迟，HDFS试图使用来自最接近读取器的副本满足读取请求。如果读取器所在的机架上存在副本，则该副本优先地满足读取请求。如果HDFS集群跨越多个数据中心，则驻留在本地数据中心的副本优于任何远程副本。

Safemode(安全模式)

On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.

在启动时NameNode进入一个特殊的状态，称为Safemode。当NameNode处于Safemode状态时，不会发生数据块的复制。NameNode从DataNodes接收心跳和Blockreport消息。Blockreport包含DataNode所承载的数据块列表。每个块具有指定的最小数量的副本。当该数据块的最小副本数与NameNode记录的副本数一致时，块被认为是安全复制的。在可配置的安全比例达到NameNode记录的数字时，（加上额外的30秒），NameNode退出安全模式状态。然后，它确定了副本数仍然小于配置数的数据块（如果有的话）的列表。然后，NameNode将这些块复制到其他DataNodes。

The Persistence of File System Metadata(文件系统元数据的持久化)

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.

HDFS命名空间由NameNode存储。NameNode使用一个称为EdEdlog的事务日志来持久记录文件系统元数据所发生的每一个更改。例如，在HDFS中创建一个新文件会导致NameNode将一条记录插入到编辑日志中，表示这一点。类似地，更改文件的复制因子会导致新的记录插入到编辑日志中。NameNode使用本地主机OS文件系统中的一个文件来存储EddieLoG。整个文件系统命名空间，包括块到文件和文件系统属性的映射，都存储在名为FsImage的文件中。FSimGe也作为文件存储在NameNode的本地文件系统中。

The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. When the NameNode starts up, or a checkpoint is triggered by a configurable threshold, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. The purpose of a checkpoint is to make sure that HDFS has a consistent view of the file system metadata by taking a snapshot of the file system metadata and saving it to FsImage. Even though it is efficient to read a FsImage, it is not efficient to make incremental edits directly to a FsImage. Instead of modifying FsImage for each edit, we persist the edits in the Editlog. During the checkpoint the changes from Editlog are applied to the FsImage. A checkpoint can be triggered at a given time interval (dfs.namenode.checkpoint.period) expressed in seconds, or after a given number of filesystem transactions have accumulated (dfs.namenode.checkpoint.txns). If both of these properties are set, the first threshold to be reached triggers a checkpoint.

NameNode在内存中保存整个文件系统命名空间和文件块映射的图像。当NameNode启动时，或者检查点由可配置的阈值触发时，它从磁盘读取FsImage和EditLog，将EditLog中的所有事务应用于FsImage的内存表示，并将这个新版本刷新到磁盘上的新FsImage中。它可以删除旧的编辑日志，因为它的事务已应用于持久性FsImage中。这个过程称为检查点。检查点的目的是通过获取文件系统元数据的快照并将其保存到FsImage，确保HDFS对文件系统元数据具有一致的视图。即使读取FsImage是有效的，但是直接对FsImage进行增量编辑是没有效率的.我们将对文件的编辑保存到Editlog中，而不是修改每个编辑的FsImage。在检查点期间，将把EdtLogo的更改应用到FsImage。检查点可以在以秒表示的给定时间间隔（dfs.namenode.checkpoint…）或在累积了给定数量的文件系统事务之后（dfs.namenode.checkpoint.txns）触发。如果设置了这两个属性，则要达到的第一个阈值触发检查点。

The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and creates subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files, and sends this report to the NameNode. The report is called the Blockreport.

DataNode在本地文件系统中存储HDFS中数据。DataNode对HDFS文件整体并不清楚。它将每个块的HDFS数据存储在其本地文件系统中的一个单独文件中。DataNode不在同一目录中创建所有文件。相反，它使用启发式来确定每个目录的最佳文件数，并适当地创建子目录。在同一个目录中创建所有本地文件并不最佳，因为本地文件系统可能无法有效地支持单个目录中的大量文件。当DataNode启动时，它扫描其本地文件系统，生成与每个本地文件对应的所有HDFS数据块的列表，并将此报告发送到NameNode。这份报告被称为Blockreport。

The Communication Protocols(通信协议)

All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.

所有的HDFS通信协议都基于TCP/IP协议之上的。客户端与NameNode通过NameNode上一个可以配置的TCP端口建立链接。它通过ClientProtocol与NameNode通信。DataNodes使DataNodes的协议与NameNode进行对话。远程过程调用（RPC）抽象将客户端协议和数据阳极协议封装在一起。通过设计，NameNode从不启动任何RPC。相反，它只响应由DataNodes或客户端发布的RPC请求。

Robustness

The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.

HDFS的主要目的是即使在故障的情况下也能可靠地存储数据。三种常见的故障类型是NameNode故障、DataNode故障和网络隔离。

Data Disk Failure, Heartbeats and Re-Replication(数据磁盘故障、心跳和再复制)

Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.

每个DataNode周期性地向NameNode发送心跳消息。网络隔离可以导致DataNodes的子集失去与NameNode的连接。NameNode通过缺少心跳消息来检测这种情况。NameNode标记最近没有心跳的DataNode，并且z之后不转发任何新的IO请求给它们。任何已注册到NameNode死亡名单中的DataNode不再用于HDFS。DataNode死亡可能导致某些块的复制因子低于其指定值。NameNode不断跟踪需要复制哪些块，并在必要时发起复制。许多原因，可能导致需要重新复制：DataNode可能变得不可用，副本可能损坏，DataNode上的硬盘可能失败，或者文件的复制因子可能增加。

The time-out to mark DataNodes dead is conservatively long (over 10 minutes by default) in order to avoid replication storm caused by state flapping of DataNodes. Users can set shorter interval to mark DataNodes as stale and avoid stale nodes on reading and/or writing by configuration for performance sensitive workloads.

为了避免DataNode的状态拍动引起的复制风暴，标记DataNode死亡的时间是保守的（默认情况下超过10分钟）。针对性能比较敏感的工作负载，用户可以设置较短的时间间隔来标记DataNode的状态，避免在陈腐的节点上进行读或写的操作。

Cluster Rebalancing(集群再平衡)

The HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.

HDFS体系结构与数据再平衡方案兼容。如果DataNode上的空闲空间低于某个阈值，则方案可能自动将数据从一个DataNode移动到另一个DataNode。如果对某个特定文件突然出现高需求，方案可能动态地创建额外的副本并重新平衡集群中的其他数据。这些类型的数据再平衡方案尚未实现。

Data Integrity(数据完整性)

It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.

从DataNode获取的数据块可能损坏。这种损坏可能发生在存储设备、网络故障或故障软件中。HDFS客户端软件对HDFS文件的内容执行校验和检查。当客户端创建HDFS文件时，它计算文件的每个块的校验和，并将这些校验和存储在同一个HDFS名称空间中的单独的隐藏文件中。当客户端检索文件内容时，它验证它从每个DataNode接收的数据是否与存储在相关校验和文件中的校验和匹配。如果不是，那么客户端可以选择从具有该块副本的另一DataNode获取该块。

Metadata Disk Failure(元数据磁盘故障)

The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.

FsImage和EditLog日志是HDFS的核心数据结构。这些文件的损坏会导致HDFS实例不起作用。出于这个原因，NameNode可以被配置为支持维护FsImage和EditLog的多个副本。对FsImage或EditLog的任何更新都会使FSIGMES和EditLog中的每一个同步更新。FsImage和EditLog的多个副本的同步更新可能会降低NameNode可以支持的每秒处理名称空间事务的速率。然而，这种降低是可以接受的，因为即使HDFS应用程序本质上非常数据密集，但它们不是元数据密集的。当NameNodeE重新启动时，它选择最新的一致性FsImage和EditLog来使用。

Another option to increase resilience against failures is to enable High Availability using multiple NameNodes either with a shared storage on NFS or using a distributed edit log (called Journal). The latter is the recommended approach.

提高抗故障能力的另一个选项是使用多个NameNode（具有NFS上的共享存储或使用分布式编辑日志（称为Journal））启用高可用性(HA)。后者是推荐的方法。

Snapshots(快照)

Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time.

快照支持在特定时刻存储数据副本。快照特征的一种用法可以是将已损坏的HDFS实例回滚到先前已知的正常时间点。

Data Organization（数据组织）

Data Blocks（数据块）

HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.

HDFS被设计成支持非常大的文件。与HDFS兼容的应用程序是处理大型数据集的应用程序。这些应用程序只写一次数据，但它们读取一次或多次，并要求这些读取满足流速度。HDFS支持写一次读取文件上的许多语义。HDFS使用的典型块大小是128 MB。因此，HDFS文件被切成128MB的块，并且如果可能的话，每个块将驻留在不同的DataNode上。

Replication Pipelining（复制流水线，即水平复制）

When a client is writing data to an HDFS file with a replication factor of three, the NameNode retrieves a list of DataNodes using a replication target choosing algorithm. This list contains the DataNodes that will host a replica of that block. The client then writes to the first DataNode. The first DataNode starts receiving the data in portions, writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.

当客户端将数据写入复制因子为3的HDFS文件时，NameNode使用复制目标选择算法检索DataNode列表。此列表包含将承载该块的副本的DataNodes。然后客户机写入第一个DataNode。第一个DataNode开始分部分接收数据，将每个部分写入其本地存储库，并将该部分传输到列表中的第二个DataNode。第二个DataNode又开始接收数据块的每个部分，将该部分写入其存储库，然后将该部分刷新到第三个DataNode。最后，第三个DataNode将数据写入本地存储库。因此，DataNode可以从流水线中的前一个接收数据，同时将数据转发到流水线中的下一个。因此，数据从一个DataNode流水线到下一DataNode。

Accessibility

HDFS can be accessed from applications in many different ways. Natively, HDFS provides a FileSystem Java API for applications to use. A C language wrapper for this Java API and REST API is also available. In addition, an HTTP browser and can also be used to browse the files of an HDFS instance. By using NFS gateway, HDFS can be mounted as part of the client’s local file system.

HDFS可以从许多不同的应用程序中访问。从本质上讲，HDFS为应用程序提供了一个文件系统Java API。用C语音对这个api进行包装或者REST API 也是可用的。此外，还可以使用HTTP浏览器浏览HDFS实例的文件。通过使用NFS网关，HDFS可以被挂载为客户端本地文件系统的一部分。

FS Shell

HDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called FS shell that lets a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample action/command pairs:

HDFS允许用户数据以文件和目录的形式组织起来。它提供了一个命令行接口，称为FS shell，它允许用户与HDFS中的数据交互。这个命令集的语法类似于用户已经熟悉的其他shell（例如BASH，CSH）。这里有一些示例动作/命令对：

Create a directory named /foodir 
$ bin/hadoop dfs -mkdir /foodir
Remove a directory named /foodir 
$ bin/hadoop fs -rm -R /foodir
View the contents of a file named /foodir/myfile.txt 
$ bin/hadoop dfs -cat /foodir/myfile.txt

DFSAdmin

The DFSAdmin command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs:

DFSADmin命令集用于管理HDFS集群。这些是仅由HDFS管理员使用的命令。这里有一些示例动作/命令对：

Put the cluster in Safemode 	
$ bin/hdfs dfsadmin -safemode enter
Generate a list of DataNodes 
$ bin/hdfs dfsadmin -report
Recommission or decommission DataNode(s) 
$ bin/hdfs dfsadmin -refreshNodes

Space Reclamation(空间回收)

File Deletes and Undeletes

If trash configuration is enabled, files removed by FS Shell is not immediately removed from HDFS. Instead, HDFS moves it to a trash directory (each user has its own trash directory under /user/<username>/.Trash). The file can be restored quickly as long as it remains in trash.

如果启用垃圾配置，通过FS Shell移除的文件不会立刻被移除。相反，HDFS将其移动到垃圾目录（每个用户在/Upaby/<用户名>/.Trash）。只要文件处于垃圾状态，文件就可以很快恢复。

Most recent deleted files are moved to the current trash directory (/user/<username>/.Trash/Current), and in a configurable interval, HDFS creates checkpoints (under /user/<username>/.Trash/<date>) for files in current trash directory and deletes old checkpoints when they are expired. See expunge command of FS shell about checkpointing of trash.

最近删除的文件被移动到当前垃圾目录（/user//.Trash/Current），在可配置的间隔内，HDFS为当前垃圾目录中的文件创建检查点（在/user//.Trash/下），并在过期时删除旧的检查点。查看FS shell对垃圾检查点的删除命令。

After the expiry of its life in trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

当文件在垃圾回收中的时间超过设置的时间，NameNode从HDFS命名空间中删除文件。文件的删除会导致与文件关联的块被释放。注意，在用户删除文件的时间和HDFS中空闲空间相应增加的时间之间可能存在明显的时间延迟。

$ hadoop fs -rm -r delete/test1
Moved: hdfs://localhost:8020/user/hadoop/delete/test1 to trash at: hdfs://localhost:8020/user/hadoop/.Trash/Current

$ hadoop fs -rm -r -skipTrash delete/test2
Deleted delete/test2

Decrease Replication Factor(降低备份数)

When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.

当文件的复制因子减少时，NameNode选择可以删除的多余副本。下一心跳将此信息传送到DataNode。然后，DataNode移除相应的块，在集群中出现相应的自由空间。同样，setReplication API调用的完成与集群中空闲空间的出现之间可能存在时间延迟。