环境介绍
OS: Ubuntu 10.10 Server 64-bit
Servers:
hadoop-master:10.6.1.150
- namenode,jobtracker;hbase-master,hbase-thrift;
- secondarynamenode;
- hive-master,hive-metastore;
- zookeeper-server;
- flume-master
- flume-node
- datanode,taskTracker
hadoop-node-1:10.6.1.151
- datanode,taskTracker;hbase-regionServer;
- zookeeper-server;
- flume-node
hadoop-node-2:10.6.1.152
- dataNode,taskTracker;hbase-regionServer;
- zookeeper-server;
- flume-node
以上环境的配置过程请参见:Hadoop集群实践 之 (0) 完整架构设计
下面将增加两台zookeeper-server
zookeeper-single-1:10.6.1.161
- zookeeper-server;
zookeeper-single-2:10.6.1.162
- zookeeper-server;
本文定义的规范,避免在配置多台服务器上产生理解上的混乱:
所有直接以 $ 开头,没有跟随主机名的命令,都代表需要在所有的服务器上执行,除非后面有单独的//开头的说明。
而以dongguo@zookeeper-single-1:~$开头的命令,则需要在zookeeper-single-2上也做相同的执行,除非有dongguo@zookeeper-single-2:~$同时出现。
1. 配置/etc/hosts
$ sudo vim /etc/hosts
3 | 10.6.1.150 hadoop-master |
4 | 10.6.1.151 hadoop-node-1 |
5 | 10.6.1.152 hadoop-node-2 |
6 | 10.6.1.153 hadoop-node-3 |
7 | 10.6.1.161 zookeeper-single-1 |
8 | 10.6.1.162 zookeeper-single-2 |
2. 修改主机名
dongguo@zookeeper-single-1:~$ sudo vim /etc/hostname
dongguo@zookeeper-single-1:~$ sudo hostname zookeeper-single-1
dongguo@zookeeper-single-2:~$ sudo vim /etc/hostname
dongguo@zookeeper-single-2:~$ sudo hostname zookeeper-single-2
3. 安装Java环境
添加匹配的Java版本的APT源
dongguo@zookeeper-single-1:~$ sudo apt-get install python-software-properties
dongguo@zookeeper-single-1:~$ sudo vim /etc/apt/sources.list.d/sun-java-community-team-sun-java6-maverick.list
安装sun-java6-jdk
dongguo@zookeeper-single-1:~$ sudo add-apt-repository ppa:sun-java-community-team/sun-java6
dongguo@zookeeper-single-1:~$ sudo apt-get update
dongguo@zookeeper-single-1:~$ sudo apt-get install sun-java6-jdk
4. 配置Cloudera的Hadoop安装源
dongguo@zookeeper-single-1:~$ sudo vim /etc/apt/sources.list.d/cloudera.list
dongguo@zookeeper-single-1:~$ sudo apt-get install curl
dongguo@zookeeper-single-1:~$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
dongguo@zookeeper-single-1:~$ sudo apt-get update
5. 安装Zookeeper
dongguo@zookeeper-single-1:~$ sudo apt-get install hadoop-zookeeper-server
6. 配置Zookeeper
$ sudo vim /etc/zookeeper/zoo.cfg
04 | dataDir=/data/zookeeper |
07 | server.1=hadoop-master:2888:3888 |
08 | server.2=hadoop-node-1:2888:3888 |
09 | server.3=hadoop-node-2:2888:3888 |
10 | server.4=zookeeper-single-1:2888:3888 |
11 | server.5=zookeeper-single-2:2888:3888 |
dongguo@zookeeper-single-1:~$ sudo mkdir -p /data/zookeeper
dongguo@zookeeper-single-1:~$ sudo chown zookeeper:zookeeper /data/zookeeper
创建myid文件
dongguo@zookeeper-single-1:~$ sudo -u zookeeper vim /data/zookeeper/myid
dongguo@zookeeper-single-2:~$ sudo -u zookeeper vim /data/zookeeper/myid
7. 修改整个集群中与Zookeeper相关的系统配置
7.1 修改Hbase相关的Zookeeper配置
$ sudo vim /etc/hbase/conf/hbase-site.xml //仅在hadoop-master,hadoop-node-1与hadoop-node-2上执行
02 | <? xml-stylesheet type = "text/xsl" href = "configuration.xsl" ?> |
06 | < name >hbase.rootdir</ name > |
10 | < name >hbase.cluster.distributed</ name > |
15 | < name >hbase.zookeeper.quorum</ name > |
16 | < value >hadoop-master,hadoop-node-1,hadoop-node-2,zookeeper-single-1,zookeeper-single-2</ value > |
7.2 修改Flume相关的Zookeeper配置
$ sudo vim /etc/flume/conf/flume-site.xml //仅在hadoop-master,hadoop-node-1与hadoop-node-2上执行
02 | <? xml-stylesheet type = "text/xsl" href = "configuration.xsl" ?> |
06 | < name >flume.master.servers</ name > |
07 | < value >hadoop-master</ value > |
10 | < name >flume.master.store</ name > |
11 | < value >zookeeper</ value > |
14 | < name >flume.master.zk.use.external</ name > |
18 | < name >flume.master.zk.servers</ name > |
19 | < value >hadoop-master:2181,hadoop-node-1:2181,hadoop-node-2:2181,zookeeper-single-1:2181,zookeeper-single-2:2181</ value > |
7.3 修改Hive相关的Zookeeper配置
dongguo@hadoop-master:~$ sudo vim /etc/hive/conf/hive-site.xml
02 | <? xml-stylesheet type = "text/xsl" href = "configuration.xsl" ?> |
15 | < name >javax.jdo.option.ConnectionURL</ name > |
20 | < name >javax.jdo.option.ConnectionDriverName</ name > |
21 | < value >com.mysql.jdbc.Driver</ value > |
25 | < name >javax.jdo.option.ConnectionUserName</ name > |
26 | < value >hiveuser</ value > |
30 | < name >javax.jdo.option.ConnectionPassword</ name > |
31 | < value >password</ value > |
35 | < name >datanucleus.autoCreateSchema</ name > |
40 | < name >datanucleus.fixedDatastore</ name > |
45 | < name >hive.aux.jars.path</ name > |
52 | < name >hbase.zookeeper.quorum</ name > |
53 | < value >hadoop-master,hadoop-node-1,hadoop-node-2,zookeeper-single-1,zookeeper-single-2</ value > |
8. 重启Zookeeper服务
$ sudo /etc/init.d/hadoop-zookeeper-server restart
9. 重启所有与Zookeeper相关的服务
9.1 重启Hbase相关服务
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hbase-master restart
dongguo@hadoop-node-1:~$ sudo /etc/init.d/hadoop-hbase-regionserver restart
dongguo@hadoop-node-2:~$ sudo /etc/init.d/hadoop-hbase-regionserver restart
9.2 重启Flume相关服务
dongguo@hadoop-master:~$ sudo /etc/init.d/flume-master restart
$ sudo /etc/init.d/flume-node restart //仅在hadoop-master,hadoop-node-1与hadoop-node-2上执行
9.3 重启Hive相关服务
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-server restart
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-metastore restart
10. 检查Zookeeper状态
10.1 通过命令检查配置是否同步
dongguo@zookeeper-single-1:~$ zookeeper-client
1 | Connecting to localhost:2181 |
2 | [zk: localhost:2181(CONNECTED) 0] ls / |
3 | [flume-cfgs, counters-config_version, flume-chokemap, hbase, zookeeper, flume-nodes] |
4 | [zk: localhost:2181(CONNECTED) 1] |
可以看到所有相关系统的配置都已经同步到了新增的zookeeper-server中。
11. 开始进行Zookeeper集群故障演练
演练的思路为模拟Zookeeper服务器故障,检查部分Zookeeper服务器在异常中止的情况下提供服务的能力。
zookeeper在运行的过程中要求必须整个集群的数量为奇数个,目前我们拥有5台zookeeper服务器,能够满足要求。
zookeeper分为三个角色,leader,follower,client,服务端会根据投票自动选择出leader和follower,当这两种角色都存在的时候,整个集群能够正常提供服务,否则整个集群失效。
我们将首先确认各个zookeeper-server的角色,然后随机选择对应的zookeeper服务器依次将其进程杀掉,并在每个过程中检测各个zookeeper-server的角色转变与工作情况。
查看zookeeper-server的状态,通过以下命令
dongguo@zookeeper-single-2:~$ zookeeper-server status
2 | Using config: /etc/zookeeper/zoo.cfg |
如上所示目前zookeeper-single-2的zookeeper角色为follower
获取zookeeper-server的进程ID,通过以下命令
$ ps aux | grep /etc/zookeeper/zoo.cfg | grep -v grep | awk '{print $2}'
然后我们直接使用 sudo kill -9 杀掉其进程ID即可。
结论:
整个故障演练的过程与结果为:
任意服务器组合的故障模拟,证实了zookeeper集群的容错能力确实与其算法相符。
即2n+1=总数,n即为可容纳的故障节点。
在实际的测试中,3台能容错1台,5台能容错2台,7台能容错3台。
原文来自:http://heylinux.com/archives/2063.html