hadoop 2.2.0 x64 编译以及集群搭建

Recently I compiled and set up hadoop 2.2.0 environment on a small cluster. I faced with many problems, and solved them by searching through Internet and reading the source code, here I'd like to record my step to compile , set up the environment , configuration and these problems for these who want to set up hadoop 2.2.0 or who face with the same problems

1. introduce the cluster

there are 17 nodes:

CPU : Intel i7-3930

ram : 15G

hard disk : NFS and local ssd (100G available)

OS : Linux 2.6.32-358.14.1.e16.x86_64

I set 1 master (served as master and namenode), 12 slaves (served as slave and datanode)

I set hadoop on NFS and use local ssd as hdfs storage for better performance

2.compile hadoop

Download hadoop source code

the reason why we should compile ourselves is that the release version of hadoop use some x86 parts, and if we run the release version of hadoop on a pure x86_64 machine,it will cause ssh failure. These who already set ssh public key but face with SSH failure may consider this.

Here we download from source

of cause there ftps ,https, mirrors ,choose what you like

and run

tar -xzf hadoop-2.2.0-src.tar.gz

other software used to compile

of cause you can use eclipse to build hadoop, but here I can only connect though ssh, so eclipse is not a good choice.

here are the software:

JDK:

if you don't have java,you can download here jdk 7

tar -xzf jdk-7u51-linux-x64.tar.gz

and set environment variables:

vim ~/.profile #if you are root user of the machine, you can change /etc/profile
add JAVA_HOME

export JAVA_HOME=(enter your jdk dir)

maven:

do not use maven 3.1.1 it caused some error when I compile hadoop

here I use maven 3.0.5

tar -xzf apache-maven-3.0.5-bin.tar.gz

and set environment variables:

export MAVEN_HOME=(enter your maven dir)
export PATH=$PATH:$MAVEN_HOME/bin

protobuf:

I use protobuf-2.5.0

you need to compile this, and you may need gcc

tar -xjf protobuf-2.5.0.tar.bz2
cd protobuf-2.5.0
./configure --prefix=(enter the dir you want to install protobuf)
make
make install

and set environment variables:

export PROTO_HOME=(enter your protobuf dir)
export PATH=$PATH:$PROTO_HOME/bin

findbugs:

I use findbugs-2.0.3

tar -xzf findbug-2.0.3.tar.gz

and set environment variables:

export FINDBUGS_HOME=(enter your findbugs dir)
export PATH=$PATH:$FINDBUGS_HOME/bin

start compile

before start compile you need to patch hadoop source code
It may cause:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:testCompile (default-testCompile) on project hadoop-auth: Compilation failure: Compilation failure:
[ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[84,13] cannot access org.mortbay.component.AbstractLifeCycle
[ERROR] class file for org.mortbay.component.AbstractLifeCycle not found
[ERROR] server = new Server(0);
[ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[94,29] cannot access org.mortbay.component.LifeCycle
[ERROR] class file for org.mortbay.component.LifeCycle not found
[ERROR] server.getConnectors()[0].setHost(host);
[ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[96,10] cannot find symbol
[ERROR] symbol  : method start()
[ERROR] location: class org.mortbay.jetty.Server
[ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[102,12] cannot find symbol
[ERROR] symbol  : method stop()
[ERROR] location: class org.mortbay.jetty.Server
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :hadoop-auth

and solution https://issues.apache.org/jira/browse/HADOOP-10110
you need this patch
after that:
cd hadoop-2.2.0-src
mvn package -Pdist,native,docs -DskipTests -Dtar 
machine needs Internet access while compiling

It will take 30-40 minutes to compile, depends on machine.
and target will be found at hadoop-dist/target/hadoop-2.2.0-tar.gz
mv hadoop-2.2.0-tar.gz ~/
tar -xzf hadoop-2.2.0-tar.gz

and set environment variables:

export HADOOP_HOME=(enter your hadoop dir)
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}/share/hadoop/mapreduce
export HADOOP_COMMON_HOME=${HADOOP_HOME}/share/hadoop/common
export HADOOP_HDFS_HOME=${HADOOP_HOME}/share/hadoop/hdfs
export YARN_HOME=${HADOOP_HOME}/share/hadoop/yarn
export HADOOP_COMMON_LIB_NATIVE_LIB=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

4.Set hadoop user

maybe you don't need this, or you can set a new account, here we have logged in "hadoop user"

5.set ssh

first we should use rsa to log in each node because hadoop use ssh to start each part when it starting. We can use rsa to avoid inputting password while log in.

detail about rsa wikipedia

ssh-keygen -t rsa

here you can enter a key to protect your private and public key, but I don't use it

then you will have two key : a public key id_rsa.pub and a private key id_rsa

rename public key

cat id_rsa.pub >> ~/.ssh/authorized_keys

here we use private key to auto log in to a machine which has corresponding public key. 

so we should put this public key to all slave node's ~/.ssh dir

and check:

at master machine:

ssh (all slaves)#here you need to enter yes to authorize to use private key to log in slave
slave's hostname or IP will be recorded in ~/.ssh/unkown_hosts

because I use NFS, so all slaves share the public key and private key

6.configure hadoop

before start hadoop, we need to set configure

all configure file is in $HADOOP_HOME/etc/hadoop

we need to set core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml hadoop-env.sh yarn-env.sh and slaves

here I just put my configure. I will write another blog to talk about each file

here ssd is mounted on /tmp so I use /tmp

core-site.xml

<property>
        <name>fs.default.name</name>
        <value>hdfs://(your namenode hostname or IP):(port number)</value>
        <final>true</final>
</property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/tmp/xtq/tmp</value>
</property> 

hdfs-site.xml

<property>
        <name>dfs.namenode.name.dir</name>
        <value>/tmp/xtq/namenode/</value>
</property>
<property>
        <name>dfs.datanode.data.dir</name>
        <value>/tmp/xtq/datanode</value>
</property>
<property>
        <name>dfs.replication</name>
        <value>2</value>
</property>

mapred-site.xml

some values are dependent on your machines and jobs, you should tuning these values to get a good performance
you need to copy form mapred-site.xml.template to mapred-site.xml
cp mapred-site.xml.template mapred-site.xml

<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
<property>
        <name>mapred.job.tracker</name>
        <value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
        <name>mapreduce.cluster.temp.dir</name>
        <value>/tmp/xtq/cluster-temp</value>
        <final>true</final>
</property>
<property>
        <name>mapreduce.cluster.local.dir</name>
        <value>/tmp/xtq/cluster-local</value>
        <final>true</final>
</property>
<property>
        <name>mapreduce.task.io.sort.mb</name>
        <value>512</value>
</property>
<property>
        <name>mapreduce.task.io.sort.factor</name>
        <value>10</value>
</property>
<property>
        <name>mapreduce.tasktracker.map.tasks.maximum</name>
        <value>100</value>
</property>
<property>
        <name>mapreduce.tasktracker.reduce.tasks.maximum</name>
        <value>100</value>
</property>
<property>
        <name>mapreduce.jobtracker.address</name>
        <value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
        <name>mapreduce.job.maps</name>
        <value>96</value>
</property>
<property>
        <name>mapreduce.job.reduces</name>
        <value>96</value>
</property>
<property>
        <name>mapreduce.jobtracker.jobhistory.location</name>
        <value>/tmp/xtq/jobhistory</value>
</property>

yarn-site.xml

some values are dependent on your machines and jobs, you should tuning these values to get a good performance
<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
<property>
        <name>mapred.job.tracker</name>
        <value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
        <name>mapreduce.cluster.temp.dir</name>
        <value>/tmp/xtq/cluster-temp</value>
        <final>true</final>
</property>
<property>
        <name>mapreduce.cluster.local.dir</name>
        <value>/tmp/xtq/cluster-local</value>
        <final>true</final>
</property>
<property>
        <name>mapreduce.task.io.sort.mb</name>
        <value>512</value>
</property>
<property>
        <name>mapreduce.task.io.sort.factor</name>
        <value>10</value>
</property>
<property>
        <name>mapreduce.tasktracker.map.tasks.maximum</name>
        <value>100</value>
</property>
<property>
        <name>mapreduce.tasktracker.reduce.tasks.maximum</name>
        <value>100</value>
</property>
<property>
        <name>mapreduce.jobtracker.address</name>
        <value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
        <name>mapreduce.job.maps</name>
        <value>96</value>
</property>
<property>
        <name>mapreduce.job.reduces</name>
        <value>96</value>
</property>
<property>
        <name>mapreduce.jobtracker.jobhistory.location</name>
        <value>/tmp/xtq/jobhistory</value>
</property>


hadoop-env.sh

add these Variables

export HADOOP_HOME="(your hadoop dir)"
export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

yarn-env.sh

add these Variables


export HADOOP_HOME="(your hadoop dir)"
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_HOME="(your hadoop dir)"
export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

slaves

all slaves hostname and user in this format
User@hostname

7.start hadoop

before start hadoop, we need to format namenode
hdfs namenode -format
you can run hdfs ,hadoop ... command in any dir if you add environment v ariables as I said
otherwise you should go to $HADOOP_HOME/bin and run
./hdfs namenode -format

there is a common failure that datanode can't start because you formatted namenode several times and didn't clean the datanode data
when datanode start It report to namenode the block it has, and namenode return some id , datanode will compare it with local id
datanode will shutdown if id is not equal
so you should delete all data in which " dfs.datanode.data.dir" point to in hdfs-site.xml and reformat namenode

and run

start-all.sh
or

start-dfs.sh
start-yarn.sh
these commands are in $HADOOP_HOME/sbin

if all be fun, you can see hadoop web at (master):8088

8.test hadoop

here we run a simple teragen job to test whether hadoop work fine
teragen class is in ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

we can run
hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar teragen 1000000 /teragen

9.some useful command

hdfs dfsadmin -report
this command print all datanode information
hadoop fs -XXX
you can write -ls -rm -du.....linux-like command to manage hdfs file 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值