摘要:去年开发BI系统,其中ETL用到了Hadoop和Hive,我用三台Dell服务器,搭建了一个Hadoop集群,用于开发测试。
在接下来的几篇中,我会介绍些BI架构设计的内容,以及在开发中遇到的困难和解决办法。今天就说一下搭建集群!
运行环境
服务器清单
master 10.0.0.88
slave1 10.0.0.89
slave2 10.0.0.90
部署架构图
因为我们使用Python开发,所以Hive和Hbase最好启动Thrift server,Hive内建支持Thrift,而Hbase需要单独启动Thrift Server实例。
实际的架构中,Hbase不会和Hadoop部署在一个集群中,具体的原因是不要在Hbase集群上做MapReduce,会影响Hbase的性能。Zookeeper也会单独拿出来。还比如在Hive和Hbase整合上,也不要试图把Hbase的数据拉到Hadoop上做处理,这样MapReduce的性能低。因为Hive启动了支持并发的Hiveserver2实例,所以还需要Zookeeper和Mysql。其中,Zookeeper提供锁管理,Mysql提供元数据管理,这两个是Hive实现并发必备的组件。
软件清单
centos6.4
hadoop-1.2.1
hbase-0.94.12
hive-0.12.0
python-2.7.5
thrift-0.9.1
setuptools-0.6c11
jdk-7u40-linux-x64
hadoop-1.2.1
hbase-0.94.12
hive-0.12.0
python-2.7.5
thrift-0.9.1
setuptools-0.6c11
jdk-7u40-linux-x64
基础环境准备
centos安装
安装成Basic server模式
增加用户
$useradd "hadoop"
$passwd "admin"
这里需要改变sudoers文件的权限,文件修改完成后,权限再改回来
$vi /etc/sudoers
hadoop ALL=(ALL) ALL
hadoop ALL=(ALL) ALL
网络配置
$vi /etc/sysconfig/network-scripts/ifcfg-eth0
ONBOOT=yes
BOOTPROTO=dhcp
ONBOOT=yes
BOOTPROTO=dhcp
$ vi /etc/sysconfig/network
"GETEWAY=###.###.###.###"
$sudo /etc/init.d/network restart
or
$/etc/sysconfig/network-scripts/ifup-eth
"GETEWAY=###.###.###.###"
$sudo /etc/init.d/network restart
or
$/etc/sysconfig/network-scripts/ifup-eth
注:我在路由器上绑定了各主机Mac
改Hostname
$ vi /etc/sysconfig/network
hostname #####
关闭防护墙
很重要,不然slave上hadoop会莫名其妙的挂掉
$sudo service iptables stop
$sudo chkconfig iptables off
$vi /etc/hosts
10.0.0.88 master
10.0.0.89 slave1
$sudo service iptables stop
$sudo chkconfig iptables off
$vi /etc/hosts
10.0.0.88 master
10.0.0.89 slave1
10.0.0.90 slave2
每台主机除了ip地址不同外,配置相同,然后相互ping一下
每台主机除了ip地址不同外,配置相同,然后相互ping一下
无密码登录
在hadoop账户下进行
$ssh-keygen -t rsa
$cd .ssh
$cp -r id_rsa.pub authorized_keys
$chmod 600 authorized_keys
$cd .ssh
$cp -r id_rsa.pub authorized_keys
$chmod 600 authorized_keys
每台主机都这样操作一遍,然后把各主机的authorized_keys汇总成一份,拷贝
个主机上并覆盖原来的authorized_keys
[hadoop@master ~]$ ssh hadoop@slave1
Last login: Tue Feb 18 15:41:59 2014 from 10.0.0.123
Last login: Tue Feb 18 15:41:59 2014 from 10.0.0.123
个主机之间都测试一遍,如果能不用密码登陆,表明这一步操作成功
安装sun sdk
------------------------------------download jdk-7u40-linux-x64.tar.gz packet
$tar -xf jdk-7u40-linux-x64.tar.gz
$cp -r jdk1.7.0_40 /usr/java
$sudo vi /etc/profile
export JAVA_HOME=/usr/java/jdk1.7.0_40
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
$source /etc/profile
$tar -xf jdk-7u40-linux-x64.tar.gz
$cp -r jdk1.7.0_40 /usr/java
$sudo vi /etc/profile
export JAVA_HOME=/usr/java/jdk1.7.0_40
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
$source /etc/profile
各主机都要安装,并且配置一样
安装mysql
Install MySQL(5.1.69)
$yum -y install mysql-server
$sudo chkconfig mysqld on
$service mysqld start
$/usr/bin/mysqladmin -u root password 'new_password'
安装 Hadoop
$tar -xf hadoop-1.2.1.tar.gz
$cd hadoop-1.2.1/conf
$cd hadoop-1.2.1/conf
$vim core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tmp</value>
</property>
</configuration>
$vim hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.7.0_40
$vim hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
<name>dfs.name.dir</name>
<value>/home/hadoop/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
$ vim mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
</property>
</configuration>
<name>mapred.job.tracker</name>
<value>master:9001</value>
</property>
</configuration>
$vim masters
master
$vim slaves
slave1
slave2
$scp hadoop-1.2.1 hadoop@slave1:/home/hadoop
$scp hadoop-1.2.1 hadoop@slave2:/home/hadoop
$cd bin
$./hadoop namenode -format
$./start-all.sh
启动完毕后
$/usr/java/jdk1.7.0_40/bin/jps
在master上会出现
NameNode
SecondaryNameNode
JobTracker
在slave1,slave2上会出现
TaskNode
TaskTracker
$./hadoop fs -put /home/hadoop/test.py /user/test.py
$./hadoop fs -ls /user
如果出现了test.py文件,说明安装成功
安装hbase
$tar -xf hbase-0.94.12.tar.bz
$cd hbase-0.94.12
$cd conf
$cd hbase-0.94.12
$cd conf
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.master</name>
<value>master:60000</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master,slave1,slave2</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/hbase-0.94.12/zookeeper</value>
</property>
</configuration>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.master</name>
<value>master:60000</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master,slave1,slave2</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/hbase-0.94.12/zookeeper</value>
</property>
</configuration>
$vi hbase-env.sh
export JAVA_HOME=/usr/java/jdk1.7.0_40/
export HBASE_MANAGES_ZK=true
$vi regionservers
slave1
slave1
slave2
$scp hbase-0.94.12 hadoop@slave1:/home/hadoop
$scp hbase-0.94.12 hadoop@slave2:/home/hadoop
$cd bin
启动Hbase
$./start-bhase.sh
启动Thrift,这里使用无阻塞模式,在实际使用中,性能好。
要把Thrfit启动起来,还需要另外的操作,这部分在文章的最后。
$./hbase-daemon.sh start thrift -noblocking
$/usr/java/jdk1.7.0_40/bin/jps
HMaster
HQuorumPeer
Jps
=======slave
$/usr/java/jdk1.7.0_40/bin/jps
HRegionServer
HQuorumPeer
Jps
HMaster
HQuorumPeer
Jps
=======slave
$/usr/java/jdk1.7.0_40/bin/jps
HRegionServer
HQuorumPeer
Jps
hbase shell
$cd bin
$./hbase shell
>list(查看表类似mysql show tables)
具体的shell命令道apache官方上查找
安装hive
特别说明:启用hiveserver2,可以并发处理客户端请求,下篇会专门说这个
$tar -xf hive-0.12.0.tar.gz
$cd hive-0.10.0
$cd conf
$vim hive-env.sh
$cd hive-0.10.0
$cd conf
$vim hive-env.sh
HADOOP_HOME=/home/hadoop/hadoop-1.2.1
export HIVE_CONF_DIR=/home/hadoop/hive-0.12.0/conf
export HIVE_HOME=/home/hadoop/hive-0.12.0
export HIVE_CONF_DIR=/home/hadoop/hive-0.12.0/conf
export HIVE_HOME=/home/hadoop/hive-0.12.0
$vim hive-site.xml
<configuration>
<!-- WARNING!!! This file is provided for documentation purposes ONLY! -->
<!-- WARNING!!! Any changes you make to this file will be ignored by Hive. -->
<!-- WARNING!!! You must make your changes in hive-site.xml instead. -->
<!-- Hive Execution Parameters -->
<property>
<name>mapred.reduce.tasks</name>
<value>-1</value>
</property>
<property>
<name>hive.groupby.skewindata</name>
<value>false</value>
<description>Whether there is skew in data to optimize group by queries</description>
</property>
<property>
<name>hive.exec.parallel.thread.number</name>
<value>8</value>
<description>How many jobs at most can be executed in parallel</description>
</property>
<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://master:3306/bihive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NOSASL</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<property>
<name>hive.server2.async.exec.threads</name>
<value>50</value>
</property>
<property>
<name>hive.server2.async.exec.wait.queue.size</name>
<value>50</value>
</property>
<property>
<name>hive.support.concurrency</name>
<description>Enable Hive's Table Lock Manager Service</description>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>master</value>
</property>
<property>
<name>hive.groupby.skewindata</name>
<value>true</value>
</property>
<property>
<name>hive.multigroupby.singlemr</name>
<value>true</value>
</property>
</configuration>
$vi hive-log4j.properties
log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter
访问Mysql需要依赖 mysql-connector-java-5.1.27-bin.jar包
$./hive
hive>shou tables;
OK
Time taken:2.585 seconds
hive>shou tables;
OK
Time taken:2.585 seconds
$hiveserver2 &
$/usr/java/jdk1.7.0_40/bin/jps
RunJar
RunJar
$netstat -nl|grep 10000
如果显示有listen的话,表明启动成功
启动 Hbase Thrift Server
安装 gcc
$su root
$yum -y install gcc
$yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel
$yum install openssl-devel
$./configure --prefix=/usr/local --enable-shared
$make && make altinstall
$python2.7 setup.py install
$tar -xf thrift-0.9.1.tar.gz
$cd thrift-0.9.1
$./configure
$make install
$yum -y install gcc
$yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel
$yum install openssl-devel
安装 Python
$tar -xf Python-2.7.5.tar.bz2$./configure --prefix=/usr/local --enable-shared
$make && make altinstall
安装easy_install
$tar -xf setuptools-0.6c11.tar.gz$python2.7 setup.py install
$tar -xf thrift-0.9.1.tar.gz
$cd thrift-0.9.1
$./configure
$make install
生成Thrift客户端依赖文件
$thrift --gen py [hbase-root]/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift
$easy_install thrift
$cp -r gen-py/hbase/ /usr/local/lib/python2.7/site-packages/