原文https://blog.youkuaiyun.com/duheaven/article/details/17038679
HDFS High Availability Using the Quorum Journal Manager
- Purpose
- Note: Using the Quorum Journal Manager or Conventional Shared Storage
- Background
- Architecture
- Hardware resources
- Deployment
- Automatic Failover
- Automatic Failover FAQ
- HDFS Upgrade/Finalization/Rollback with HA Enabled
Purpose
This guide provides an overview of the HDFS High Availability (HA) feature and how to configure and manage an HA HDFS cluster, using the Quorum Journal Manager (QJM) feature.
This document assumes that the reader has a general understanding of general components and node types in an HDFS cluster. Please refer to the HDFS Architecture guide for details.
目的
本指南概述HDFS的高可用性(HA)的特性,以及如何配置和管理HA HDFS集群,使用QJM特性。
本文假设读者有一个大致了解通用组件和一个HDFS集群中的节点类型。详情请参阅HDFS架构指南。
Note: Using the Quorum Journal Manager or Conventional Shared Storage
This guide discusses how to configure and use HDFS HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes. For information on how to configure HDFS HA using NFS for shared storage instead of the QJM, please see this alternative guide.
注意:QJM或者共享存储
本指南将要讨论如何配置并利用QJM实现HA,HA是通过在活动的NameNode与备份的NameNode之间共享edit日志,对于如何通过共享存储代替QJM实现HA的信息请参照下篇博客。
Background
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.
背景
Hadoop 2.0.0之前,NameNode是在一个HDFS集群一个单点故障(SPOF)。每个集群有一个NameNode,如果那台机器坏掉,集群作为一个整体将不可用,直到NameNode启动或在另一个单独的机器上配置的。
This impacted the total availability of the HDFS cluster in two major ways:
这种设计将影响HDFS集群的可用性在两个主要方面:
计划外的事件,如NameNode宕机导致集群不可用直到操作员重新启动NameNode。
-
In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.
-
Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.
计划内的维护活动,比如NameNode机器上的软件或硬件升级导致集群停机。
The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
HDFS高可用性特性解决了上面的问题,可以通过在同一集群上配置运行两个冗余的NameNodes,做到主动/被动的热备份。这将允许当一个机器宕机时,快速转移到一个新的NameNode,或管理员进行利用故障转移达到优雅的系统升级的目的。
Architecture
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
一个典型的HA集群,NameNode会被配置在两台独立的机器上.在任何的时间上,一个NameNode处于活动状态,而另一个在备份状态,活动状态的NameNode会响应集群中所有的客户端,同时备份的只是作为一个副本,保证在必要的时候提供一个快速的转移。
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
为了使备份的节点和活动的节点保持一致,两个节点通过一个特殊的守护线程相连,这个线程叫做“JournalNodes”(JNs)。当活动状态的节点修改任何的命名空间,他都会通过这些JNs记录日志,备用的节点可以监控edit日志的变化,并且通过JNs读取到变化。备份节点查看edits可以拥有专门的namespace。在故障转移的时候备份节点将在切换至活动状态前确认他从JNs读取到的所有edits。这个确认的目的是为了保证Namespace的状态和迁移之前是完全同步的。
In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
为了提供一个快速的转移,备份NameNode要求保存着最新的block在集群当中的信息。为了能够得到这个,DataNode都被配置了所有的NameNode的地址,并且发送block的地址信息和心跳给两个node。
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.
保证只有一个活跃的NameNode在集群当中是一个十分重要的一步。否则namespace状态在两个节点间不同会导致数据都是或者其他一些不正确的结果。为了确保这个,防止所谓split - brain场景,JournalNodes将只允许一个NameNode进行写操作。故障转移期间,NameNode成为活跃状态的时候会接管JournalNodes的写权限,这会有效防止其他NameNode持续处于活跃状态,允许新的活动节点安全进行故障转移。
Hardware resources
In order to deploy an HA cluster, you should prepare the following:
硬件
为了部署一个HA集群,你应该按照以下准备:
NameNode机器:机器负责运行活动和和备份的NameNode,两台机器应该有着完全一样的硬件,同样的硬件应该和没有HA的硬件完全一致。
-
NameNode machines - the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.
-
JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally.
JournalNode机器:这个机器用来运行JNs,JNs的守护线程相对较轻,所以可以和Hadoop的其他守护线程放到一起,比如NameNodes, the JobTracker, 或者 YARN ResourceManager。注意至少需要3个JNs的守护线程,因为edit日志的编辑和修改必须写入大多数的JNs。这将允许系统在单机上失败。你可能运行多余3个的jns,但是为了能够判定失败的数目,应该运行一个单数的JNs(比如3,5,7等)。注意当运行N个jns,系统最多可以容忍(N - 1)/ 2失败,并继续正常运转。
Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.
注意:在一个HA集群中备份的NameNode也要坚持namespace的状态,那么就没有必要去运行一个Secondary NameNode, CheckpointNode, 或者是BackupNode在集群当中,事实上这么做的话有可能会是一个错误。为了允许有一个重新配置的非HA的集群可以实现HA,并且实现硬件的重用,所以把以前的secondary NameNode的机器作为这样一个机器。
Deployment
Configuration overview
Similar to Federation configuration, HA configuration is backward(向后的) compatible(兼容) and allows existing single NameNode configurations to work without change. The new configuration is designed such that all the nodes in the cluster may have the same configuration without the need for deploying different configuration files to different machines based on the type of the node.
部署
配置综述
HA的配置向后兼容允许既存的单NameNode配置在没有任何改动的情况下工作,新的配置被设计成集群当中的所有节点拥有着相同的配置并且并不需要为不同的机器设置不同的配置文件。
Like HDFS Federation, HA clusters reuse the nameservice ID to identify a single HDFS instance that may in fact consist of multiple HA NameNodes. In addition, a new abstraction called NameNode IDis added with HA. Each distinct NameNode in the cluster has a different NameNode ID to distinguish it. To support a single configuration file for all of the NameNodes, the relevant configuration parameters are suffixed with the nameservice ID as well as the NameNode ID.
如HDFS Federation,HA集群重用nameserviceID去标示一个HDFS实例这个实例可能实际上包含了很多的HA NameNodes。另外一个新的抽象叫做NameNode ID被添置到了HA。在集群中每个不同的NameNode有着不同的NameNode ID 去标示他,所有的NameNode采用同一个配置文件,相关的配置参数都被用nameservice ID和NameNode ID作为后缀。
Configuration details
To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml configuration file.
配置细节
配置HA NameNodes,你必须添加几个配置选项到你的hdfs-site.xml配置文件当中。
The order(次序) in which you set these configurations is unimportant, but the values you choose for dfs.nameservices and dfs.ha.namenodes.[nameservice ID] will determine the keys of those that follow. Thus, you should decide on these values before setting the rest of the configuration options.
这是这些配置的顺序并不重要,但是你为dfs.nameservices和dfs.ha.namenodes.[nameservice ID] 设置的值将要决定下面步骤的key,因此,你应该决定这些值在设置这些选项之前。
-
dfs.nameservices - the logical name for this new nameservice
Choose a logical name for this nameservice, for example “mycluster”, and use this logical name for the value of this config option. The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.
-
Note: If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.
dfs.nameservices-新的nameservice的逻辑名字
为nameservice选择一个名字,例如“mycluster”,并且用这个名字的值来作为一些配置项的值,这个名字是任意的。他将被用来配置在集群中作为HDFS的绝对路径组件。
注意如果你也采用HDFS系统,这个设置应该也包括一个包含其他nameservices,HA或者一个逗号分隔的列表
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
注意:在每一个nameservice中最多只能有两个NameNode可以被配置
注意:如果您愿意,您可能同样配置“servicerpc-address”设置
注意:如果Hadoop的安全特征被开启,你应该类似的设置https-address为每一个NameNode
-
dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice
Configure with a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used “mycluster” as the nameservice ID previously, and you wanted to use “nn1” and “nn2” as the individual IDs of the NameNodes, you would configure this as such:
dfs.ha.namenodes.[nameservice ID] -为在nameservice中的每一个NameNode设置唯一标示符。
配置一个逗号分隔的 NameNode ID列表。这将是被DataNode识别为所有的NameNode。例如,如果使用“mycluster”作为nameservice ID,并且使用“nn1”和“nn2”作为NameNodes标示符,你应该如下配置:<property> <name>dfs.ha.namenodes.mycluster</name> <value>nn1,nn2</value> </property>
Note: Currently, only a maximum of two NameNodes may be configured per nameservice.
-
dfs.namenode.rpc-address.[nameservice ID].[name node ID] - the fully-qualified RPC address for each NameNode to listen on
For both of the previously-configured NameNode IDs, set the full address and IPC port of the NameNode processs. Note that this results in two separate configuration options. For example:
dfs.namenode.rpc-address.[nameservice ID].[name node ID] - 每个NameNode监听的完整正确的RPC地址
对于先前配置的NameNode ID,设置全地址和IP端口的NameNode进程,注意配置两个独立的配置选项例如:<property> <name>dfs.namenode.rpc-address.mycluster.nn1</name> <value>machine1.example.com:8020</value> </property> <property> <name>dfs.namenode.rpc-address.mycluster.nn2</name> <value>machine2.example.com:8020</value> </property>
Note: You may similarly configure the “servicerpc-address” setting if you so desire.
-
dfs.namenode.http-address.[nameservice ID].[name node ID] - the fully-qualified HTTP address for each NameNode to listen on
Similarly to rpc-address above, set the addresses for both NameNodes’ HTTP servers to listen on. For example:
dfs.namenode.http-address.[nameservice ID].[name node ID] -每个NameNode监听的完整正确的HTTP地址
和上面设置rpc-address一样,设置NameNode的HTTP服务,例如:<property> <name>dfs.namenode.http-address.mycluster.nn1</name> <value>machine1.example.com:50070</value> </property> <property> <name>dfs.namenode.http-address.mycluster.nn2</name> <value>machine2.example.com:50070</value> </property>
Note: If you have Hadoop’s security features enabled, you should also set the https-address similarly for each NameNode.
-
dfs.namenode.shared.edits.dir - the URI which identifies the group of JNs where the NameNodes will write/read edits
This is where one configures the addresses of the JournalNodes which provide the shared edits storage, written to by the Active nameNode and read by the Standby NameNode to stay up-to-date with all the file system changes the Active NameNode makes. Though you must specify several JournalNode addresses, you should only configure one of these URIs. The URI should be of the form: qjournal://*host1:port1*;*host2:port2*;*host3:port3*/*journalId*. The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to provide storage for multiple federated namesystems. Though not a requirement, it’s a good idea to reuse the nameservice ID for the journal identifier.
dfs.namenode.shared.edits.dir - NameNode读/写edits的URL,为JNs服务
这个配置是为了JournalNodes配置,提供edits的共享存储地址,这个地址被活动状态的nameNode写入被备份状态的nameNode读取用来保存活动状态写入的整个文件系统的最新的改变。尽管你必须指定几个JournalNode地址,但是你只要一个配置为这些URL,这个URI应该是这样的“qjournal://host1:port1;host2:port2;host3:port3/journalId”,Journal ID是这个nameservice的唯一标示。它允许一套JournalNodes为多个namesystems提供存储。虽然不是必须的,但是建议重用nameservice ID 作为journal 的标示。For example, if the JournalNodes for this cluster were running on the machines “node1.example.com”, “node2.example.com”, and “node3.example.com” and the nameservice ID were “mycluster”, you would use the following as the value for this setting (the default port for the JournalNode is 8485):
例如如果JournalNodes正运行在"node1.example.com", "node2.example.com", 和"node3.example.com",并且集群的nameservice ID是 mycluster,你应该用下面作为设置(JournalNode 默认的端口是8485)
<property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value> </property>
-
dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients use to contact the Active NameNode
Configure the name of the Java class which will be used by the DFS Client to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. The two implementations which currently ship with(提供) Hadoop are the ConfiguredFailoverProxyProvider and the RequestHedgingProxyProvider (which, for the first call, concurrently invokes all namenodes to determine the active one, and on subsequent requests, invokes the active namenode until a fail-over happens), so use one of these unless you are using a custom proxy provider. For example:
dfs.client.failover.proxy.provider.[nameservice ID] -HDFS客户端使用的Java类与活跃的NameNode联系
配置一个类的名字用来被DFS客户端确定那个NameNode是目前活跃的那个NameNode现在正在提供响应。当前Hadoop提供的两个实现是ConfiguredFailoverProxyProvider 和RequestHedgingProxyProvider (对于第一个调用,它同时调用所有namenodes 来确定活动的namenode,并且在随后的请求中调用活动的namenode,直到发生故障转移为止)。因此请使用其中一个,除非您是使用自定义代理提供程序。例如:<property> <name>dfs.client.failover.proxy.provider.mycluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property>
-
dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover
dfs.ha.fencing.methods -一个脚本或者java类列表用来筛选活动NameNode在故障转移的期间
It is desirable for correctness of the system that only one NameNode be in the Active state at any given time. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager. However, to improve the availability of the system in the event the fencing mechanisms fail, it is advisable to configure a fencing method which is guaranteed to return success as the last fencing method in the list. Note that if you choose to use no actual fencing methods, you still must configure something for this setting, for example “shell(/bin/true)”.
在任何时间只有一个活动的NameNode 都是系统所需的。当使用Quorum Journal管理器时,只有一个NameNode将被允许写入到JournalNodes,所以没有可能破坏文件系统的元数据从一个split-brain的场景。但是当一个迁移法伤的时候,前一个活动的NameNode还是可能读取客户端的请求并为其提供服务,当他企图写入JournalNodes时候他可能会超时,直到这个NameNode停止。因为这个原因,所以需要配置几个过滤的方法当使用Quorum Journal管理器时候使用。但是为了提高当一个过滤机制失败时系统的可用性,建议配置一个过滤方法保证返回成果的作为这个过滤方法列表的最后一项,助于如果你选择使用一个没有实际效果的过滤方法你也必须配置一些东西为这个设置比如“shell(/bin/true)”。
The fencing methods used during a failover are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. There are two methods which ship with(提供) Hadoop: shell and sshfence. For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.
过滤方法被配置为carriage-return-separated列表,会在故障转移的时候被调用,直到一个过滤返回success。有两个方法比Hadoop使用:
shell和sshfence。如果想自定义可以看org.apache.hadoop.ha.NodeFencer类。sshfence - SSH to the Active NameNode and kill the process
The sshfence option SSHes to the target node and uses fuser to kill the process listening on the service’s TCP port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, one must also configure the dfs.ha.fencing.ssh.private-key-files option, which is a comma-separated list of SSH private key files. For example:
sshfence -利用SSH连接到NameNode服务器并杀死进程
sshfence选项SSH连接到目标节点,并使用fuser杀死进程。为了这个过滤选项可以工作,他需要通过通过SSH的无秘钥登陆到目标节点,那么需要配置一个dfs.ha.fencing.ssh.private-key-files选项,他以逗号分隔,提供SSH的key文件比如:
<property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/exampleuser/.ssh/id_rsa</value> </property>
Optionally, one may configure a non-standard username or port to perform the SSH. One may also configure a timeout, in milliseconds, for the SSH, after which this fencing method will be considered to have failed. It may be configured like so:
可以通过不标准username或者port来实现SSH,我们也可以配置一个超时时间(毫秒),如果这个时间没有连接上那么会返回一个失败。例如:
<property> <name>dfs.ha.fencing.methods</name> <value>sshfence([[username][:port]])</value> </property> <property> <name>dfs.ha.fencing.ssh.connect-timeout</name> <value>30000</value> </property>
shell - run an arbitrary shell command to fence the Active NameNode
The shell fencing method runs an arbitrary shell command. It may be configured like so:
shell - 运行任何的shell命令去过滤活动的NameNode
shell过滤方法运行任何的shell命令,他可以被配置例如:
<property> <name>dfs.ha.fencing.methods</name> <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value> </property>
The string between ‘(’ and ‘)’ is passed directly to a bash shell and may not include any closing parentheses.
The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, with the ‘_’ character replacing any ‘.’ characters in the configuration keys. The configuration used has already had any namenode-specific configurations promoted to their generic forms – for example dfs_namenode_rpc-address will contain the RPC address of the target node, even though the configuration may specify that variable as dfs.namenode.rpc-address.ns1.nn1.
Additionally, the following variables referring to the target node to be fenced are also available:
括号之间的字符串会被直接传递给shell脚本并且不要有任何的闭括号。
这个脚本将会带着环境建立时候所有的hadoop配置变量运行,在配置的key的变量中的‘_’会被替换成'.'这种方式已经被使用在NameNode的配置中如dfs_namenode_rpc-address 是为了涵盖目标节点的RPC地址,配置dfs.namenode.rpc-address.ns1.nn1也可以指定变量
另外,以下变量也可以使用
$target_host hostname of the node to be fenced $target_port IPC port of the node to be fenced $target_address the above two, combined as host:port $target_nameserviceid the nameservice ID of the NN to be fenced $target_namenodeid the namenode ID of the NN to be fenced These environment variables may also be used as substitutions in the shell command itself. For example:
这些环境变量也可以用来替换shell命令,例如
<property> <name>dfs.ha.fencing.methods</name> <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value> </property>
If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method in the list will be attempted.
Note: This fencing method does not implement any timeout. If timeouts are necessary, they should be implemented in the shell script itself (eg by forking a subshell to kill its parent in some number of seconds).
如果返回0,过滤方法并认为是成功,其他则认为不成功会调用下一个过滤方法。
注意:这个方式不能实现超时功能,如果想实现,应该通过shell脚本自己实现(比如,通过子分支强制杀死其母分支在一定的时间后)
-
fs.defaultFS - the default path prefix used by the Hadoop FS client when none is given
Optionally, you may now configure the default path for Hadoop clients to use the new HA-enabled logical URI. If you used “mycluster” as the nameservice ID earlier, this will be the value of the authority portion of all of your HDFS paths. This may be configured like so, in your core-site.xml file:
fs.defaultFS - 当什么都没有给定的时候Hadoop文件系统客户端默认的路径前缀
可选项,你可以配置一个默认的hadoop客户端路径作为新的HA的逻辑URI。如果你用“mycluster”作为 nameservice ID。这个值将作为你的HDFS路径的部分。可以通过core-site.xml文件进行如下配置。
<property> <name>fs.defaultFS</name> <value>hdfs://mycluster</value> </property>
-
dfs.journalnode.edits.dir - the path where the JournalNode daemon will store its local state
This is the absolute path on the JournalNode machines where the edits and other local state used by the JNs will be stored. You may only use a single path for this configuration. Redundancy for this data is provided by running multiple separate JournalNodes, or by configuring this directory on a locally-attached RAID array. For example:
<property> <name>dfs.journalnode.edits.dir</name> <value>/path/to/journal/node/local/data</value> </property>
dfs.journalnode.edits.dir -JournalNode守护线程报错他本地状态的路径
这是一个在JNs主机的绝对路径用来保存JNs使用的edits和其他本地状态。你可能只用一个单一的路径为这个配置。如果需要一些冗余的本分可以通过运行多个独立的JNs或者是使用RAID实现,例如:
Deployment details
After all of the necessary configuration options have been set, you must start the JournalNode daemons on the set of machines where they will run. This can be done by running the command “hadoop-daemon.sh start journalnode” and waiting for the daemon to start on each of the relevant machines.
Once the JournalNodes have been started, one must initially synchronize the two HA NameNodes’ on-disk metadata.
部署细节:
在完成必要的配置之后,你必须在机器上启动JNs的守护线程,这需要运行”hdfs-daemon.sh journalnode“命令并等到守护进程运行在每一个相关的机器上。
一旦JNs启动,必须进行一次初始化同步在两个HA的NameNode,主要是为了元数据。
-
If you are setting up a fresh HDFS cluster, you should first run the format command (hdfs namenode -format) on one of NameNodes.
-
If you have already formatted the NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you should now copy over the contents of your NameNode metadata directories to the other, unformatted NameNode by running the command “hdfs namenode -bootstrapStandby” on the unformatted NameNode. Running this command will also ensure that the JournalNodes (as configured by dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both NameNodes.
-
If you are converting a non-HA NameNode to be HA, you should run the command “hdfs namenode -initializeSharedEdits”, which will initialize the JournalNodes with the edits data from the local NameNode edits directories.
如果你建立一个新的HDFS集群你应该首先运行一下format命令(hdfs namenode -format)在其中一个NameNode上
如果你已经进行过Format NameNode,或者是正在将一个非HA的集群转换为一个HA的集群,你应该拷贝你的NameNode上的元数据文件夹的内容到另一个没有被格式化的NameNode,运行”hdfs namenode -bootstrapStandby“在没有格式化的NameNode上。运行这个命令应该确定JournalNodes (被配置在dfs.namenode.shared.edits.dir)包含了足够的edits事物可以启动NameNodes。
如果你讲一个不是HA的集群转换为一个HA的集群,你应该运行”hdfs -initializeSharedEdits“命令,他会使用NameNode本地edits数据初始化JNS。
At this point you may start both of your HA NameNodes as you normally would start a NameNode.
You can visit each of the NameNodes’ web pages separately by browsing to their configured HTTP addresses. You should notice that next to the configured address will be the HA state of the NameNode (either “standby” or “active”.) Whenever an HA NameNode starts, it is initially in the Standby state.
这是你需要启动两个HA的NameNode作为你日常的NameNode启动
你能访问每一个NameNode的web页面通过配置的HTTP地址,你应该注意到在配置的地址旁边就是HA的状态(‘active’或者‘standby’)什么时候一个HA NameNode 启动它会被初始化为备份状态。
Administrative commands
Now that your HA NameNodes are configured and started, you will have access to some additional commands to administer your HA HDFS cluster. Specifically, you should familiarize yourself with all of the subcommands of the “hdfs haadmin” command. Running this command without any additional arguments will display the following usage information:
管理命令
现在HA的NameNode被配置并且启动了,你将有权限去利用命令管理你的HA HDFS集群,你应该熟悉你的所有的”hdfs haadmin“命令。运行这个命令没有任何的参数你会看见一下内容:
Usage: haadmin
[-transitionToActive <serviceId>]
[-transitionToStandby <serviceId>]
[-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
[-getServiceState <serviceId>]
[-getAllServiceState]
[-checkHealth <serviceId>]
[-help <command>]
This guide describes high-level uses of each of these subcommands. For specific usage information of each subcommand, you should run “hdfs haadmin -help <command>”.
-
transitionToActive and transitionToStandby - transition the state of the given NameNode to Active or Standby
These subcommands cause a given NameNode to transition to the Active or Standby state, respectively. These commands do not attempt to perform any fencing, and thus should rarely be used. Instead, one should almost always prefer to use the “hdfs haadmin -failover” subcommand.
-
failover - initiate a failover between two NameNodes
This subcommand causes a failover from the first provided NameNode to the second. If the first NameNode is in the Standby state, this command simply transitions the second to the Active state without error. If the first NameNode is in the Active state, an attempt will be made to gracefully transition it to the Standby state. If this fails, the fencing methods (as configured by dfs.ha.fencing.methods) will be attempted in order until one succeeds. Only after this process will the second NameNode be transitioned to the Active state. If no fencing method succeeds, the second NameNode will not be transitioned to the Active state, and an error will be returned.
-
getServiceState - determine whether the given NameNode is Active or Standby
Connect to the provided NameNode to determine its current state, printing either “standby” or “active” to STDOUT appropriately. This subcommand might be used by cron jobs or monitoring scripts which need to behave differently based on whether the NameNode is currently Active or Standby.
-
getAllServiceState - returns the state of all the NameNodes
Connect to the configured NameNodes to determine the current state, print either “standby” or “active” to STDOUT appropriately.
-
checkHealth - check the health of the given NameNode
Connect to the provided NameNode to check its health. The NameNode is capable of performing some diagnostics on itself, including checking if internal services are running as expected. This command will return 0 if the NameNode is healthy, non-zero otherwise. One might use this command for monitoring purposes.
Note: This is not yet implemented, and at present will always return success, unless the given NameNode is completely down.
这个描述了常用的命令,每个子命令的详细信息你应该运行”hdfs haadmin -help <command>“
transitionToActive and transitionToStandby - 切换NameNode的状态(Active或者Standby)
这些子命令会使NameNode分别转换状态,这种方式不会去调用人任何的过滤所以很少会被使用,想法人们应该选择“hdfs haadmin -failover”子命令
failover - 启动两个NameNode之间的故障迁移
这个子命令会从第一个NameNode迁移到第二个,如果第一个NameNode处于备用状态,这个命令只是没有错误的转换第二个节点到活动状态。如果第一个NameNode处于活跃状态,试图将优雅地转换到备用状态。如果失败,过滤方法(如由dfs.ha.fencing.methods配置)将尝试过滤直到成功。只有在这个过程之后第二个NameNode会转换为活动状态,如果没有过滤方法成功,第二个nameNode将不会活动并返回一个错误
getServiceState -判定NameNode的状态
连接到NameNode,去判断现在的状态打印“standby”或者“active”去标准的输出。这个子命令可以被corn jobs或者是监控脚本使用,为了针对不同专题的NameNode采用不同的行为
checkHealth -检查NameNode的健康
连接NameNode检查健康,NameNode能够执行一些诊断,包括检查如果内部服务正在运行。如果返回0表明NameNode健康,否则返回非0.可以使用此命令用于监测目的。
注意:这个功能实现的不完整,目前除了NameNode完全的关闭,其他全部返回成功。
Automatic Failover
Introduction
The above sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the active node has failed. This section describes how to configure and deploy automatic failover.
自动转移
介绍
上面的小节介绍的如何手动的配置迁移,在那种末实现即使活动的节点已经失败了,系统也不会自动的迁移到备用的节点,这个小节描述如何自动的配置和部署故障转移
Components
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:
-
Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered.
-
Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next active.
组件
自动的故障转移添加了两周新的组件到HDFS部署中:一个是ZooKeeper quorum一个是the ZKFailoverController process(缩写是ZKFC)。
Apache ZooKeeper是一个通过少量的协作的数据,通知客户端的变化,并且监控客户端失败的高可用协调系统。实现HDFS的自动故障转移需要ZooKeeper做下面的事情:
失败保护-集群当中每一个NameNode机器都会在ZooKeeper维护一个持久的session,如果机器宕机,那么就会session过期,故障迁移会被触发。
活动的NameNode选择-ZooKeeper提供了一个简单的机制专门用来选择一个活跃的节点。如果现在的活跃的NameNode宕机其他的节点可以向ZooKeeper申请排他所成为下一个active的节点。
The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:
-
Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.
-
ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.
-
ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.
For more details on the design of automatic failover, refer to the design document attached to HDFS-2185 on the Apache HDFS JIRA.
ZKFailoverController (ZKFC)是一个新的组件他是一个ZooKeeper的客户端也是用来监控和管理NameNode的状态的。每一个机器运行一个NameNode同时也要运行一个ZKFC,ZKFC是为了:
健康监控-ZKFC周期性的连接本地的NameNode运行一个健康检查命令,只要NameNode予以相应则ZKFC认为是健康的,如果节点宕机,冻结或者进入一个不健康的状态,那么健康监控器就会标示他是不健康的
ZooKeeper session管理-当本地NameNode是健康的时候,ZKFC会在ZooKeeper中保持一个开着的session。如果本地的NameNode是活跃的,他也会保持一个特殊的所在znode当中,这个锁用来使得ZooKeeper支持"ephemeral"节点,如果session超期那么这个锁会被删除。
ZooKeeper-based选择 - 如果本地的NameNode是健康的,ZKFC没有发现其他节点去持有锁那么他会申请锁。如果他成功了,它已经赢得了选举,并负责运行故障转移使其本地NameNode活跃。迁移过程和手动迁移类似:首先执行必要的过滤,然后将本地的NameNode转换为活动的,
更多的细节请查看自动迁移的设计HADOOP的JIRA的HDFS-2185.
Deploying ZooKeeper
In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.
The setup of ZooKeeper is out of scope for this document. We will assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZK CLI.
部署ZooKeeper
一个部署中,ZooKeeper的守护线程被配置到3或者5个节点,因为ZooKeeper占用的资源比较轻量级,他可以和HDFS的NameNode和备份Node部署在同一硬件环境上。很多运行商部署选择部署第三个ZooKeeper进程和YARN的RM在同一个节点上,建议配置ZooKeeper节点去储存他们的数据在HDFS元数据分离的硬盘上为了更好地性能。
ZooKeeper 的安装不包括在本文档中,我们假设你已经建立了ZooKeeper集群运行了3个或3个以上的节点并且验证他可以被ZK CLI连接并且正确的操作。
Before you begin
Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.
Configuring automatic failover
The configuration of automatic failover requires the addition of two new parameters to your configuration. In your hdfs-site.xml file, add:
开始之前
在你开始配置自动转移之前,你应该关闭你的集群,目前不能在集群运行当中实现手动到自动的转换
配置
配置自动故障转移要求添加两个新的配置到hdfs-site.xml,如下
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
This specifies that the cluster should be set up for automatic failover. In your core-site.xml file, add:
这个配置需要添加到core-site.xml文件中,如下:
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>
This lists the host-port pairs running the ZooKeeper service.
As with the parameters described earlier in the document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting dfs.ha.automatic-failover.enabled.my-nameservice-id.
There are also several other configuration parameters which may be set to control the behavior of automatic failover; however, they are not necessary for most installations. Please refer to the configuration key specific documentation for details.
这是一个主机端口的匹配在ZooKeeper 服务中。
如先前的文档中关于参数的描述,这些设置可能需要被配置先前的nameservice并用nameservice ID作为后缀,比如在集群中,你可能期望自动故障转移只是发生在一个nameservice上那么可以设置dfs.ha.automatic-failover.enabled.my-nameservice-id.还有一些参数也可以设置并且对于自动迁移或一些影响,但是他们对于绝大多数的安装来说不是必要的。详细内容请参考配置主键key文档。
Initializing HA state in ZooKeeper
After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.
[hdfs]$ $HADOOP_PREFIX/bin/hdfs zkfc -formatZK
This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.
在ZooKeeper初始化HA状态
在配置的主键被添加之后,下一步就是在ZooKeeper中初始化需要的状态,你可以在一个NameNode的主机上运行下面的命令:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs zkfc -formatZK
他会在ZooKeeper 创建一个Znode代替用来为自动迁移储存数据。
Starting the cluster with start-dfs.sh
Since automatic failover has been enabled in the configuration, the start-dfs.sh script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active.
使用start-dfs.sh启动集群
因为自动迁移被配置在文件中,start-dfs.sh脚本会自动启动一个ZKFC守护线程在NameNode运行的机器上,当ZKFC启动时,他会自动选择一个NameNode作为活动的。
Starting the cluster manually
If you manually manage the services on your cluster, you will need to manually start the zkfc daemon on each of the machines that runs a NameNode. You can start the daemon by running:
使用start-dfs.sh启动集群
因为自动迁移被配置在文件中,start-dfs.sh脚本会自动启动一个ZKFC守护线程在NameNode运行的机器上,当ZKFC启动时,他会自动选择一个NameNode作为活动的。
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --script $HADOOP_PREFIX/bin/hdfs start zkfc
Securing access to ZooKeeper
If you are running a secure cluster, you will likely want to ensure that the information stored in ZooKeeper is also secured. This prevents malicious clients from modifying the metadata in ZooKeeper or potentially triggering a false failover.
In order to secure the information in ZooKeeper, first add the following to your core-site.xml file:
安全的进入ZooKeeper
如果你运行的是一个安全的集群,你会希望在ZooKeeper储存的信息也是安全的这会防止恶意的客户端改变ZooKeeper的元数据或者触发错误的故障转移。
为了实现安全的信息,首先需要在core-site.xml添加下面的内容:
<property>
<name>ha.zookeeper.auth</name>
<value>@/path/to/zk-auth.txt</value>
</property>
<property>
<name>ha.zookeeper.acl</name>
<value>@/path/to/zk-acl.txt</value>
</property>
Please note the ‘@’ character in these values – this specifies that the configurations are not inline, but rather point to a file on disk.
The first configured file specifies a list of ZooKeeper authentications, in the same format as used by the ZK CLI. For example, you may specify something like:
请注意‘@’字符这个配置不是指向内部,而是指向磁盘上的文件。
第一个配置指定了ZooKeeper证书文件列表,和ZK CLI使用相同的格式,例如你可以这样声明
digest:hdfs-zkfcs:mypassword
…where hdfs-zkfcs is a unique username for ZooKeeper, and mypassword is some unique string used as a password.
Next, generate a ZooKeeper ACL that corresponds to this authentication, using a command like the following:
--hdfs-zkfcs是一个ZooKeeper用户名,mypassword是一个密码
下一步用下面的命令生成一个ZooKeeper ACL验证这个证书:
[hdfs]$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword
output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=
Copy and paste the section of this output after the ‘->’ string into the file zk-acls.txt, prefixed by the string “digest:”. For example:
复制粘贴->后面的字符串到zk-acls.tx,“digest:”作为前缀,比如:
digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda
In order for these ACLs to take effect, you should then rerun the zkfc -formatZK command as described above.
After doing so, you may verify the ACLs from the ZK CLI as follows:
为了这些ACL起作用,你应该运行zkfc -format命令如上描述。
这样做之后,你可以验证的acl ZK CLI如下:
[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha
'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=
: cdrwa
Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces – each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 <pid of NN> to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.
验证自动迁移
一旦自动迁移被建立,你应该测试一下他。这么做,首先定位活跃的NameNode,你可以通过访问NameNode的web页面查看哪个node是活跃的在首页上你能看见HA的状态。
一旦你找到活跃的NameNode,你可以在这个节点上造一些错误。例如,你可能使用kill -9 <NN的进程>模拟JVM崩溃。或者重启机器拔掉网线模拟各种的异常。触发中断后,你希望几秒后其他的NameNode自动的编程活动的。自动转移的时间你可以通过一个配置ha.zookeeper.session-timeout.ms来设定,默认是5秒。
如果这个测试没有成功,你可能是一些地方配置错了,你需要检查zkfc守护进程的日志以及NameNode守护进程进一步诊断问题。
Automatic Failover FAQ
-
Is it important that I start the ZKFC and NameNode daemons in any particular order?
No. On any given node you may start the ZKFC before or after its corresponding NameNode.
-
What additional monitoring should I put in place?
You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and should be restarted to ensure that the system is ready for automatic failover.
Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes, then automatic failover will not function.
-
What happens if ZooKeeper goes down?
If the ZooKeeper cluster crashes, no automatic failovers will be triggered. However, HDFS will continue to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with no issues.
-
Can I designate one of my NameNodes as primary/preferred?
No. Currently, this is not supported. Whichever NameNode is started first will become active. You may choose to start the cluster in a specific order such that your preferred node starts first.
-
How can I initiate a manual failover when automatic failover is configured?
Even if automatic failover is configured, you may initiate a manual failover using the same hdfs haadmin command. It will perform a coordinated failover.
自动转移FAQ
- 开始ZKFC和NameNode守护进程的启动顺序很重要吗?
不,任何定节点上你可能会在启动ZKFC之前或之后启动相应的NameNode。
- 额外的监控应该放到什么地方?
你应该为每一个NameNode的主机添加一个监控确定ZKFC运行。如果ZooKeeper失败,比如,ZKFC意外退出,应该重新启动,确保系统自动故障转移。
此外,你应该监控ZooKeeper quorum的每一个服务器,如果ZooKeeper关闭了,那么自动迁移不能工作
- 如果ZooKeeper关闭了怎么办?
如果ZooKeeper 机器崩溃,自动迁移将不会在工作,但是HDFS不会有影响。当ZooKeeper 重新启动,HDFS会重新连上。
- 我可以指定NameNode的优先级吗?
不。目前这个不被支持。
- 当自动转移被配置如何进行手工转移?
即使配置了自动转移,你也可以使用手动转移
HDFS Upgrade/Finalization/Rollback with HA Enabled
When moving between versions of HDFS, sometimes the newer software can simply be installed and the cluster restarted. Sometimes, however, upgrading the version of HDFS you’re running may require changing on-disk data. In this case, one must use the HDFS Upgrade/Finalize/Rollback facility after installing the new software. This process is made more complex in an HA environment, since the on-disk metadata that the NN relies upon is by definition distributed, both on the two HA NNs in the pair, and on the JournalNodes in the case that QJM is being used for the shared edits storage. This documentation section describes the procedure to use the HDFS Upgrade/Finalize/Rollback facility in an HA setup.
To perform an HA upgrade, the operator must do the following:
-
Shut down all of the NNs as normal, and install the newer software.
-
Start up all of the JNs. Note that it is critical that all the JNs be running when performing the upgrade, rollback, or finalization operations. If any of the JNs are down at the time of running any of these operations, the operation will fail.
-
Start one of the NNs with the '-upgrade' flag.
-
On start, this NN will not enter the standby state as usual in an HA setup. Rather, this NN will immediately enter the active state, perform an upgrade of its local storage dirs, and also perform an upgrade of the shared edit log.
-
At this point the other NN in the HA pair will be out of sync with the upgraded NN. In order to bring it back in sync and once again have a highly available setup, you should re-bootstrap this NameNode by running the NN with the '-bootstrapStandby' flag. It is an error to start this second NN with the '-upgrade' flag.
Note that if at any time you want to restart the NameNodes before finalizing or rolling back the upgrade, you should start the NNs as normal, i.e. without any special startup flag.
To finalize an HA upgrade, the operator will use the `hdfs dfsadmin -finalizeUpgrade' command while the NNs are running and one of them is active. The active NN at the time this happens will perform the finalization of the shared log, and the NN whose local storage directories contain the previous FS state will delete its local state.
To perform a rollback of an upgrade, both NNs should first be shut down. The operator should run the roll back command on the NN where they initiated the upgrade procedure, which will perform the rollback on the local dirs there, as well as on the shared log, either NFS or on the JNs. Afterward, this NN should be started and the operator should run `-bootstrapStandby' on the other NN to bring the two NNs in sync with this rolled-back file system state.