Hadoop Definitive Guide --- Chapter 3. The Hadoop Distributed Filesystem

最新推荐文章于 2021-08-24 23:17:55 发布

原创最新推荐文章于 2021-08-24 23:17:55 发布 · 345 阅读

0 ·

CC 4.0 BY-SA版权

本文介绍了HDFS的设计目标和概念，包括其针对大规模集群存储海量数据的特点、块大小、NameNode与DataNode的角色等。此外，还详细列举了HDFS命令行接口的各种操作命令，帮助读者更好地管理和使用HDFS。

The Design of HDFS

HDFS的设计目标是在普通的大规模集群上存储海量的数据。

Very large files

非常大的文件意味着数百个M,G,T级别的数据，hadoop集群现在能存储PB级别的数据。

Streaming data access

HDFS建立在最高效的一次写，多次读的数据处理方式，所以采用流式数据读取。

Commodity hardware

Hadoop不需要很昂贵的高性能的机器，设计的初衷就是利用集群来达到海量存储和快速计算。

HDFS Concepts

Blocks

普通的基于单个硬盘的文件系统的块大小是512bytes，HDFS分布式文件系统的块大小是64MB。

查看组成文件的数据块命令： hadoop fsck / -files -blocks

NameNodes and DataNodes

一个HDFS集群由单个的NameNode和多个的DataNode组成。

名称节点管理文件系统的命名空间，它维护着文件系统树及其元数据。这些信息永久的保存在本地硬盘上，由如下两个文件存储，the namespace image 和 edit log。数据节点负责所在物理节点的存储管理。

由于名称节点是个单点，如果此单点发生故障的话将导致整个集群崩溃，hadoop提供以下两种方案来解决名称节点的单点故障问题。

第一种方案是将组成文件系统的元数据备份到本地硬盘，第二种方案是运行secondary namenode,它会周期性的合并namespace image和edit log，并会保存合并后的镜像文件用于当名称节点失败的时候可以恢复。

HDFS Federation

名称节点在内存中保存着文件和数据块的链接关系，这将意味着在大规模的集群中，名称节点的内存将会成为集群规模扩大的瓶颈。在Hadoop 2.x中，允许添加多个名称节点，每个名称节点管理文件系统的不同部分，

比如第一个管理/usr目录下的内容，第二个管理/share目录下的内容。

HDFS High-Availability

由于名称节点的单点问题，即使有secondary namenode来周期性的创建检查点，一旦名称节点失败的话，管理员需要启动一个新的名称节点，它需要做如下的启动工作。

将命名空间的镜像文件加载到内存中，恢复编辑日志文件，收集来自数据节点的块报告，通常在大规模的集群中启动一个新的名称节点的话需要30分钟或更多。

在Hadoop 2.x中，将会有一对名称节点active和standby。从架构上设计，这一对节点共享edit log，数据节点要将块报告发送给这一对名称节点。通常standby的节点能在数十秒内接管工作。

The Command-Line Interface

http://hadoop.apache.org/docs/r1.2.1/file_system_shell.html

cat

Usage: hdfs dfs -cat URI [URI …]

Copies source paths to stdout.

Example:

hdfs dfs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hdfs dfs -cat file:///file3 /user/hadoop/file4

Exit Code:
Returns 0 on success and -1 on error.

chgrp

Usage: hdfs dfs -chgrp [-R] GROUP URI [URI …]

Change group association of files. With -R, make the change recursively through the directory structure. The user must be the owner of files, or else a super-user. Additional information is in the Permissions Guide.

chmod

Usage: hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI …]

Change the permissions of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user. Additional information is in the Permissions Guide.

chown

Usage: hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

Change the owner of files. With -R, make the change recursively through the directory structure. The user must be a super-user. Additional information is in the Permissions Guide.

copyFromLocal

Usage: hdfs dfs -copyFromLocal <localsrc> URI

Similar to put command, except that the source is restricted to a local file reference.

copyToLocal

Usage: hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Similar to get command, except that the destination is restricted to a local file reference.

count

Usage: hdfs dfs -count [-q] <paths>

Count the number of directories, files and bytes under the paths that match the specified file pattern.

The output columns with -count are:

DIR_COUNT, FILE_COUNT, CONTENT_SIZE FILE_NAME

The output columns with -count -q are:

QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME

Example:

hdfs dfs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hdfs dfs -count -q hdfs://nn1.example.com/file1

Exit Code:

Returns 0 on success and -1 on error.

cp

Usage: hdfs dfs -cp URI [URI …] <dest>

Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.
Example:

hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

Exit Code:

Returns 0 on success and -1 on error.

du

Usage: hdfs dfs -du [-s] [-h] URI [URI …]

Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

Options:

The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
The -h option will format file sizes in a "human-readable" fashion (e.g 64.0m instead of 67108864)

Example:
hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1
Exit Code:
Returns 0 on success and -1 on error.

dus

Usage: hdfs dfs -dus <args>

Displays a summary of file lengths. This is an alternate form of hdfs dfs -du -s.

expunge

Usage: hdfs dfs -expunge

Empty the Trash. Refer to the HDFS Architecture Guide for more information on the Trash feature.

get

Usage: hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>

Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.

Example:

hdfs dfs -get /user/hadoop/file localfile
hdfs dfs -get hdfs://nn.example.com/user/hadoop/file localfile

Exit Code:

Returns 0 on success and -1 on error.

getmerge

Usage: hdfs dfs -getmerge <src> <localdst> [addnl]

Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.

ls

Usage: hdfs dfs -ls <args>

For a file returns stat on the file with the following format:

permissions number_of_replicas userid groupid filesize modification_date modification_time filename

For a directory it returns list of its direct children as in unix.A directory is listed as:

permissions userid groupid modification_date modification_time dirname

Example:

hdfs dfs -ls /user/hadoop/file1

Exit Code:

Returns 0 on success and -1 on error.

lsr

Usage: hdfs dfs -lsr <args>
Recursive version of ls. Similar to Unix ls -R.

mkdir

Usage: hdfs dfs -mkdir <paths>

Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path.

Example:

hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
hdfs dfs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

Exit Code:

Returns 0 on success and -1 on error.

moveFromLocal

Usage: dfs -moveFromLocal <localsrc> <dst>

Similar to put command, except that the source localsrc is deleted after it's copied.

moveToLocal

Usage: hdfs dfs -moveToLocal [-crc] <src> <dst>

Displays a "Not implemented yet" message.

mv

Usage: hdfs dfs -mv URI [URI …] <dest>

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.
Example:

hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2
hdfs dfs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1

Exit Code:

Returns 0 on success and -1 on error.

put

Usage: hdfs dfs -put <localsrc> ... <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.

hdfs dfs -put localfile /user/hadoop/hadoopfile
hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
hdfs dfs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile
Reads the input from stdin.

Exit Code:

Returns 0 on success and -1 on error.

rm

Usage: hdfs dfs -rm [-skipTrash] URI [URI …]

Delete files specified as args. Only deletes files. If the -skipTrash option is specified, the trash, if enabled, will be bypassed and the specified file(s) deleted immediately. This can be useful when it is necessary to delete files from an over-quota directory. Refer to rmr for recursive deletes.
Example:

hdfs dfs -rm hdfs://nn.example.com/file

Exit Code:

Returns 0 on success and -1 on error.

rmr

Usage: hdfs dfs -rmr [-skipTrash] URI [URI …]

Recursive version of delete. The rmr command recursively deletes the directory and any content under it. If the -skipTrash option is specified, the trash, if enabled, will be bypassed and the specified file(s) deleted immediately. This can be useful when it is necessary to delete files from an over-quota directory.
Example:

hdfs dfs -rmr /user/hadoop/dir
hdfs dfs -rmr hdfs://nn.example.com/user/hadoop/dir

Exit Code:

Returns 0 on success and -1 on error.

setrep

Usage: hdfs dfs -setrep [-R] <path>

Changes the replication factor of a file. -R option is for recursively increasing the replication factor of files within a directory.

Example:

hdfs dfs -setrep -w 3 -R /user/hadoop/dir1

Exit Code:

Returns 0 on success and -1 on error.

stat

Usage: hdfs dfs -stat URI [URI …]

Returns the stat information on the path.

Example:

hdfs dfs -stat path

Exit Code:
Returns 0 on success and -1 on error.

tail

Usage: hdfs dfs -tail [-f] URI

Displays last kilobyte of the file to stdout. -f option can be used as in Unix.

Example:

hdfs dfs -tail pathname

Exit Code:
Returns 0 on success and -1 on error.

test

Usage: hdfs dfs -test -[ezd] URI

Options:
-e check to see if the file exists. Return 0 if true.
-z check to see if the file is zero length. Return 0 if true.
-d check to see if the path is directory. Return 0 if true.

Example: