hadoop 2.2.0 x64 编译以及集群搭建

最新推荐文章于 2025-09-05 20:44:09 发布

Belldandyxtq

最新推荐文章于 2025-09-05 20:44:09 发布

阅读量1.1k

点赞数

CC 4.0 BY-SA版权

分类专栏： hadoop 文章标签： hadoop 集群编译

本文链接：https://blog.youkuaiyun.com/u012618882/article/details/18951203

hadoop 专栏收录该内容

1 篇文章

订阅专栏

本文详细介绍了如何在小型集群上搭建并配置Hadoop 2.2.0环境，包括环境准备、编译、设置、SSH配置、环境变量配置、编译过程中的问题解决以及启动流程等。

Recently I compiled and set up hadoop 2.2.0 environment on a small cluster. I faced with many problems, and solved them by searching through Internet and reading the source code, here I'd like to record my step to compile , set up the environment , configuration and these problems for these who want to set up hadoop 2.2.0 or who face with the same problems

1. introduce the cluster

there are 17 nodes：

CPU : Intel i7-3930

ram : 15G

hard disk : NFS and local ssd (100G available)

OS : Linux 2.6.32-358.14.1.e16.x86_64

I set 1 master (served as master and namenode), 12 slaves (served as slave and datanode)

I set hadoop on NFS and use local ssd as hdfs storage for better performance

2.compile hadoop

Download hadoop source code

the reason why we should compile ourselves is that the release version of hadoop use some x86 parts, and if we run the release version of hadoop on a pure x86_64 machine,it will cause ssh failure. These who already set ssh public key but face with SSH failure may consider this.

Here we download from source

of cause there ftps ,https, mirrors ,choose what you like

and run

tar -xzf hadoop-2.2.0-src.tar.gz

other software used to compile

of cause you can use eclipse to build hadoop, but here I can only connect though ssh, so eclipse is not a good choice.

here are the software:

JDK:

if you don't have java,you can download here jdk 7

tar -xzf jdk-7u51-linux-x64.tar.gz

and set environment variables:

vim ~/.profile #if you are root user of the machine, you can change /etc/profile

add JAVA_HOME

export JAVA_HOME=(enter your jdk dir)

maven:

do not use maven 3.1.1 it caused some error when I compile hadoop

here I use maven 3.0.5

tar -xzf apache-maven-3.0.5-bin.tar.gz

and set environment variables:

export MAVEN_HOME=(enter your maven dir)
export PATH=$PATH:$MAVEN_HOME/bin

protobuf:

I use protobuf-2.5.0

you need to compile this, and you may need gcc

tar -xjf protobuf-2.5.0.tar.bz2
cd protobuf-2.5.0
./configure --prefix=(enter the dir you want to install protobuf)
make
make install

and set environment variables:

export PROTO_HOME=(enter your protobuf dir)
export PATH=$PATH:$PROTO_HOME/bin

findbugs:

I use findbugs-2.0.3

tar -xzf findbug-2.0.3.tar.gz

and set environment variables:

export FINDBUGS_HOME=(enter your findbugs dir)
export PATH=$PATH:$FINDBUGS_HOME/bin

start compile

before start compile you need to patch hadoop source code

It may cause:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:testCompile (default-testCompile) on project hadoop-auth: Compilation failure: Compilation failure:
[ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[84,13] cannot access org.mortbay.component.AbstractLifeCycle
[ERROR] class file for org.mortbay.component.AbstractLifeCycle not found
[ERROR] server = new Server(0);
[ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[94,29] cannot access org.mortbay.component.LifeCycle
[ERROR] class file for org.mortbay.component.LifeCycle not found
[ERROR] server.getConnectors()[0].setHost(host);
[ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[96,10] cannot find symbol
[ERROR] symbol  : method start()
[ERROR] location: class org.mortbay.jetty.Server
[ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[102,12] cannot find symbol
[ERROR] symbol  : method stop()
[ERROR] location: class org.mortbay.jetty.Server
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :hadoop-auth

and solution https://issues.apache.org/jira/browse/HADOOP-10110

you need this patch

after that:

cd hadoop-2.2.0-src
mvn package -Pdist,native,docs -DskipTests -Dtar

machine needs Internet access while compiling

It will take 30-40 minutes to compile, depends on machine.

and target will be found at hadoop-dist/target/hadoop-2.2.0-tar.gz

mv hadoop-2.2.0-tar.gz ~/
tar -xzf hadoop-2.2.0-tar.gz

and set environment variables:

export HADOOP_HOME=(enter your hadoop dir)
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}/share/hadoop/mapreduce
export HADOOP_COMMON_HOME=${HADOOP_HOME}/share/hadoop/common
export HADOOP_HDFS_HOME=${HADOOP_HOME}/share/hadoop/hdfs
export YARN_HOME=${HADOOP_HOME}/share/hadoop/yarn
export HADOOP_COMMON_LIB_NATIVE_LIB=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

4.Set hadoop user

maybe you don't need this, or you can set a new account, here we have logged in "hadoop user"

5.set ssh

first we should use rsa to log in each node because hadoop use ssh to start each part when it starting. We can use rsa to avoid inputting password while log in.

detail about rsa wikipedia

ssh-keygen -t rsa

here you can enter a key to protect your private and public key, but I don't use it

then you will have two key : a public key id_rsa.pub and a private key id_rsa

rename public key

cat id_rsa.pub >> ~/.ssh/authorized_keys

here we use private key to auto log in to a machine which has corresponding public key.

so we should put this public key to all slave node's ~/.ssh dir

and check:

at master machine:

ssh (all slaves)#here you need to enter yes to authorize to use private key to log in slave

slave's hostname or IP will be recorded in ~/.ssh/unkown_hosts

because I use NFS, so all slaves share the public key and private key

6.configure hadoop

before start hadoop, we need to set configure

all configure file is in $HADOOP_HOME/etc/hadoop

we need to set core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml hadoop-env.sh yarn-env.sh and slaves

here I just put my configure. I will write another blog to talk about each file

here ssd is mounted on /tmp so I use /tmp

core-site.xml

<property>
        <name>fs.default.name</name>
        <value>hdfs://(your namenode hostname or IP):(port number)</value>
        <final>true</final>
</property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>/tmp/xtq/tmp</value>
</property>

hdfs-site.xml

<property>
        <name>dfs.namenode.name.dir</name>
        <value>/tmp/xtq/namenode/</value>
</property>
<property>
        <name>dfs.datanode.data.dir</name>
        <value>/tmp/xtq/datanode</value>
</property>
<property>
        <name>dfs.replication</name>
        <value>2</value>
</property>

mapred-site.xml

some values are dependent on your machines and jobs, you should tuning these values to get a good performance

you need to copy form mapred-site.xml.template to mapred-site.xml

cp mapred-site.xml.template mapred-site.xml

<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
<property>
        <name>mapred.job.tracker</name>
        <value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
        <name>mapreduce.cluster.temp.dir</name>
        <value>/tmp/xtq/cluster-temp</value>
        <final>true</final>
</property>
<property>
        <name>mapreduce.cluster.local.dir</name>
        <value>/tmp/xtq/cluster-local</value>
        <final>true</final>
</property>
<property>
        <name>mapreduce.task.io.sort.mb</name>
        <value>512</value>
</property>
<property>
        <name>mapreduce.task.io.sort.factor</name>
        <value>10</value>
</property>
<property>
        <name>mapreduce.tasktracker.map.tasks.maximum</name>
        <value>100</value>
</property>
<property>
        <name>mapreduce.tasktracker.reduce.tasks.maximum</name>
        <value>100</value>
</property>
<property>
        <name>mapreduce.jobtracker.address</name>
        <value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
        <name>mapreduce.job.maps</name>
        <value>96</value>
</property>
<property>
        <name>mapreduce.job.reduces</name>
        <value>96</value>
</property>
<property>
        <name>mapreduce.jobtracker.jobhistory.location</name>
        <value>/tmp/xtq/jobhistory</value>
</property>

yarn-site.xml

some values are dependent on your machines and jobs, you should tuning these values to get a good performance

<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
<property>
        <name>mapred.job.tracker</name>
        <value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
        <name>mapreduce.cluster.temp.dir</name>
        <value>/tmp/xtq/cluster-temp</value>
        <final>true</final>
</property>
<property>
        <name>mapreduce.cluster.local.dir</name>
        <value>/tmp/xtq/cluster-local</value>
        <final>true</final>
</property>
<property>
        <name>mapreduce.task.io.sort.mb</name>
        <value>512</value>
</property>
<property>
        <name>mapreduce.task.io.sort.factor</name>
        <value>10</value>
</property>
<property>
        <name>mapreduce.tasktracker.map.tasks.maximum</name>
        <value>100</value>
</property>
<property>
        <name>mapreduce.tasktracker.reduce.tasks.maximum</name>
        <value>100</value>
</property>
<property>
        <name>mapreduce.jobtracker.address</name>
        <value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
        <name>mapreduce.job.maps</name>
        <value>96</value>
</property>
<property>
        <name>mapreduce.job.reduces</name>
        <value>96</value>
</property>
<property>
        <name>mapreduce.jobtracker.jobhistory.location</name>
        <value>/tmp/xtq/jobhistory</value>
</property>

hadoop-env.sh

add these Variables

export HADOOP_HOME="(your hadoop dir)"
export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

yarn-env.sh

add these Variables

export HADOOP_HOME="(your hadoop dir)"
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_HOME="(your hadoop dir)"
export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

slaves

all slaves hostname and user in this format

User@hostname

7.start hadoop

before start hadoop, we need to format namenode

hdfs namenode -format

you can run hdfs ,hadoop ... command in any dir if you add environment v ariables as I said

otherwise you should go to $HADOOP_HOME/bin and run

./hdfs namenode -format

there is a common failure that datanode can't start because you formatted namenode several times and didn't clean the datanode data

when datanode start It report to namenode the block it has, and namenode return some id , datanode will compare it with local id

datanode will shutdown if id is not equal

so you should delete all data in which " dfs.datanode.data.dir" point to in hdfs-site.xml and reformat namenode

and run

start-all.sh

start-dfs.sh
start-yarn.sh

these commands are in $HADOOP_HOME/sbin

if all be fun, you can see hadoop web at (master):8088

8.test hadoop

here we run a simple teragen job to test whether hadoop work fine

teragen class is in ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

we can run

hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar teragen 1000000 /teragen

9.some useful command

hdfs dfsadmin -report

this command print all datanode information

hadoop fs -XXX

you can write -ls -rm -du.....linux-like command to manage hdfs file