Recently I compiled and set up hadoop 2.2.0 environment on a small cluster. I faced with many problems, and solved them by searching through Internet and reading the source code, here I'd like to record my step to compile , set up the environment , configuration and these problems for these who want to set up hadoop 2.2.0 or who face with the same problems
1. introduce the cluster
there are 17 nodes:
CPU : Intel i7-3930
ram : 15G
hard disk : NFS and local ssd (100G available)
OS : Linux 2.6.32-358.14.1.e16.x86_64
I set 1 master (served as master and namenode), 12 slaves (served as slave and datanode)
I set hadoop on NFS and use local ssd as hdfs storage for better performance
2.compile hadoop
Download hadoop source code
the reason why we should compile ourselves is that the release version of hadoop use some x86 parts, and if we run the release version of hadoop on a pure x86_64 machine,it will cause ssh failure. These who already set ssh public key but face with SSH failure may consider this.
Here we download from source
of cause there ftps ,https, mirrors ,choose what you like
and run
tar -xzf hadoop-2.2.0-src.tar.gz
other software used to compile
of cause you can use eclipse to build hadoop, but here I can only connect though ssh, so eclipse is not a good choice.
here are the software:
JDK:
if you don't have java,you can download here jdk 7
tar -xzf jdk-7u51-linux-x64.tar.gz
and set environment variables:
vim ~/.profile #if you are root user of the machine, you can change /etc/profile
add JAVA_HOME
export JAVA_HOME=(enter your jdk dir)
maven:
do not use maven 3.1.1 it caused some error when I compile hadoop
here I use maven 3.0.5
tar -xzf apache-maven-3.0.5-bin.tar.gz
and set environment variables:
export MAVEN_HOME=(enter your maven dir)
export PATH=$PATH:$MAVEN_HOME/bin
protobuf:
I use protobuf-2.5.0
you need to compile this, and you may need gcc
tar -xjf protobuf-2.5.0.tar.bz2
cd protobuf-2.5.0
./configure --prefix=(enter the dir you want to install protobuf)
make
make install
and set environment variables:
export PROTO_HOME=(enter your protobuf dir)
export PATH=$PATH:$PROTO_HOME/bin
findbugs:
I use findbugs-2.0.3
tar -xzf findbug-2.0.3.tar.gz
and set environment variables:
export FINDBUGS_HOME=(enter your findbugs dir)
export PATH=$PATH:$FINDBUGS_HOME/bin
start compile
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:testCompile (default-testCompile) on project hadoop-auth: Compilation failure: Compilation failure: [ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[84,13] cannot access org.mortbay.component.AbstractLifeCycle [ERROR] class file for org.mortbay.component.AbstractLifeCycle not found [ERROR] server = new Server(0); [ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[94,29] cannot access org.mortbay.component.LifeCycle [ERROR] class file for org.mortbay.component.LifeCycle not found [ERROR] server.getConnectors()[0].setHost(host); [ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[96,10] cannot find symbol [ERROR] symbol : method start() [ERROR] location: class org.mortbay.jetty.Server [ERROR] /home/chuan/trunk/hadoop-common-project/hadoop-auth/src/test/java/org/apache/hadoop/security/authentication/client/AuthenticatorTestCase.java:[102,12] cannot find symbol [ERROR] symbol : method stop() [ERROR] location: class org.mortbay.jetty.Server [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn <goals> -rf :hadoop-auth
cd hadoop-2.2.0-src
mvn package -Pdist,native,docs -DskipTests -Dtar
machine needs Internet access while compiling
mv hadoop-2.2.0-tar.gz ~/
tar -xzf hadoop-2.2.0-tar.gz
and set environment variables:
export HADOOP_HOME=(enter your hadoop dir)
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}/share/hadoop/mapreduce
export HADOOP_COMMON_HOME=${HADOOP_HOME}/share/hadoop/common
export HADOOP_HDFS_HOME=${HADOOP_HOME}/share/hadoop/hdfs
export YARN_HOME=${HADOOP_HOME}/share/hadoop/yarn
export HADOOP_COMMON_LIB_NATIVE_LIB=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
4.Set hadoop user
5.set ssh
first we should use rsa to log in each node because hadoop use ssh to start each part when it starting. We can use rsa to avoid inputting password while log in.
detail about rsa wikipedia
ssh-keygen -t rsa
here you can enter a key to protect your private and public key, but I don't use it
then you will have two key : a public key id_rsa.pub and a private key id_rsa
rename public key
cat id_rsa.pub >> ~/.ssh/authorized_keys
here we use private key to auto log in to a machine which has corresponding public key.
so we should put this public key to all slave node's ~/.ssh dir
and check:
at master machine:
ssh (all slaves)#here you need to enter yes to authorize to use private key to log in slave
slave's hostname or IP will be recorded in ~/.ssh/unkown_hosts
because I use NFS, so all slaves share the public key and private key
6.configure hadoop
before start hadoop, we need to set configure
all configure file is in $HADOOP_HOME/etc/hadoop
we need to set core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml hadoop-env.sh yarn-env.sh and slaves
here I just put my configure. I will write another blog to talk about each file
here ssd is mounted on /tmp so I use /tmp
core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://(your namenode hostname or IP):(port number)</value>
<final>true</final>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/xtq/tmp</value>
</property>
hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/tmp/xtq/namenode/</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/tmp/xtq/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
mapred-site.xml
cp mapred-site.xml.template mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
<name>mapreduce.cluster.temp.dir</name>
<value>/tmp/xtq/cluster-temp</value>
<final>true</final>
</property>
<property>
<name>mapreduce.cluster.local.dir</name>
<value>/tmp/xtq/cluster-local</value>
<final>true</final>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>10</value>
</property>
<property>
<name>mapreduce.tasktracker.map.tasks.maximum</name>
<value>100</value>
</property>
<property>
<name>mapreduce.tasktracker.reduce.tasks.maximum</name>
<value>100</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
<name>mapreduce.job.maps</name>
<value>96</value>
</property>
<property>
<name>mapreduce.job.reduces</name>
<value>96</value>
</property>
<property>
<name>mapreduce.jobtracker.jobhistory.location</name>
<value>/tmp/xtq/jobhistory</value>
</property>
yarn-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
<name>mapreduce.cluster.temp.dir</name>
<value>/tmp/xtq/cluster-temp</value>
<final>true</final>
</property>
<property>
<name>mapreduce.cluster.local.dir</name>
<value>/tmp/xtq/cluster-local</value>
<final>true</final>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>10</value>
</property>
<property>
<name>mapreduce.tasktracker.map.tasks.maximum</name>
<value>100</value>
</property>
<property>
<name>mapreduce.tasktracker.reduce.tasks.maximum</name>
<value>100</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>(your namenode hostname or IP):(port number)</value>
</property>
<property>
<name>mapreduce.job.maps</name>
<value>96</value>
</property>
<property>
<name>mapreduce.job.reduces</name>
<value>96</value>
</property>
<property>
<name>mapreduce.jobtracker.jobhistory.location</name>
<value>/tmp/xtq/jobhistory</value>
</property>
hadoop-env.sh
add these Variablesexport HADOOP_HOME="(your hadoop dir)"
export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
yarn-env.sh
add these Variables
export HADOOP_HOME="(your hadoop dir)"
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_HOME="(your hadoop dir)"
export PATH=$PATH:$HADOOP_PREFIX/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
slaves
7.start hadoop
hdfs namenode -format
you can run hdfs ,hadoop ... command in any dir if you add environment v
ariables as I said
./hdfs namenode -format
and run
start-all.sh
or
start-dfs.sh
start-yarn.sh
these commands are in $HADOOP_HOME/sbin
if all be fun, you can see hadoop web at (master):8088
8.test hadoop
hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar teragen 1000000 /teragen
9.some useful command
hdfs dfsadmin -report
this command print all datanode information
hadoop fs -XXX
you can write -ls -rm -du.....linux-like command to manage hdfs file