hadoop 0.20 布署的补充-优快云博客

本文详细介绍了 Hadoop 0.20 版本的配置过程，包括核心配置文件的使用及注意事项。特别强调了在局域网环境下通过编辑 /etc/hosts 文件为机器命名，确保网络通信正常。并提供了配置文件示例，帮助用户快速上手。

最近在自己的局域网中又全新安装了hadoop0.20版本，感觉和0.19版本还是有一些变化的。

20版原包中默认取消了hadoop-default.xml配置文件，取而代之的是三个配置文件：

core-site.xml

mapred-site.xml

hdfs-site.xml

默认的这三个文件都是空的，也就是说，这些配置的全局默认值已经在代码中写死了，我们在配置文件中写的是和默认值不同的选项，会覆盖默认选项。

不同的配置选项要放在相应的文件中，不能放错地方。

hadoop 0.20官方英文文档中告诉我们了该怎么写（注意，是英文文档，20版提供了中文文档，但是里面的内容都是旧内容，我就是看了中文文档所以走了不少弯路），参考：http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html

原文摘录如下：

This section deals with important parameters to be specified in the following:
conf/core-site.xml :

Parameter	Value	Notes
fs.default.name	URI of NameNode .	hdfs://hostname/

conf/hdfs-site.xml :

Parameter	Value	Notes
dfs.name.dir	Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.	If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.data.dir	Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.	If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

conf/mapred-site.xml :

Parameter	Value	Notes
mapred.job.tracker	Host or IP and port of JobTracker .	host:port pair.
mapred.system.dir	Path on the HDFS where where the Map/Reduce framework stores system files e.g. /hadoop/mapred/system/ .	This is in the default filesystem (HDFS) and must be accessible from both the server and client machines.
mapred.local.dir	Comma-separated list of paths on the local filesystem where temporary Map/Reduce data is written.	Multiple paths help spread disk i/o.
mapred.tasktracker.{map\|reduce}.tasks.maximum	The maximum number of Map/Reduce tasks, which are run simultaneously on a given TaskTracker , individually.	Defaults to 2 (2 maps and 2 reduces), but vary it depending on your hardware.
dfs.hosts/dfs.hosts.exclude	List of permitted/excluded DataNodes.	If necessary, use these files to control the list of allowable datanodes.
mapred.hosts/mapred.hosts.exclude	List of permitted/excluded TaskTrackers.	If necessary, use these files to control the list of allowable TaskTrackers.
mapred.queue.names	Comma separated list of queues to which jobs can be submitted.	The Map/Reduce system always supports atleast one queue with the name as default . Hence, this parameter's value should always contain the string default . Some job schedulers supported in Hadoop, like the Capacity Scheduler , support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Once queues are defined, users can submit jobs to a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same.
mapred.acls.enabled	Specifies whether ACLs are supported for controlling job submission and administration	If true , ACLs would be checked while submitting and administering jobs. ACLs can be specified using the configuration parameters of the form mapred.queue.queue-name.acl-name , defined below.
mapred.queue.queue-name .acl-submit-job	List of users and groups that can submit jobs to the specified queue-name .	The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example: user1,user2 group1,group2 . If you wish to define only a list of groups, provide a blank at the beginning of the value.
mapred.queue.queue-name .acl-administer-job	List of users and groups that can change the priority or kill jobs that have been submitted to the specified queue-name .	The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example: user1,user2 group1,group2 . If you wish to define only a list of groups, provide a blank at the beginning of the value. Note that an owner of a job can always change the priority or kill his/her own job, irrespective of the ACLs.

Typically all the above parameters are marked as final to ensure that they cannot be overriden by user-applications.



在局域网里，每个机器往往都用用户的名字命名，如John-desktop，但是我们在分布式系统中，通常希望用master, slave001,slave002这样的命名规则来命名机器，

这样我们需要编辑/etc/hosts文件，把每一台机器希望的命名都写进去，如：

192.168.1.10     John-desktop

192.168.1.10     master

192.168.1.11     Peter-desktop

192.168.1.11     slave001

依此类推。因为在hadoop中，系统会自动取当前机器名（用hostname），这时，如果hostname不是master, slave001这样的名字，网络通信就会出问题。



以下是我的配置文件



注：我有两台机器，主机IP：192.168.1.10 从机IP：192.168.1.11







core-site.xml:



<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://master/</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
 </property>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/hdfs</value>
 </property>

</configuration>

 





hdfs-site.xml:



<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>dfs.name.dir</name>
  <value>${hadoop.tmp.dir}/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name node
      should store the name table(fsimage).  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>${hadoop.tmp.dir}/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>
</configuration>


 







mapred-site.xml



<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>mapred.job.tracker</name>
  <value>192.168.1.10:9001</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
<property>
  <name>mapred.system.dir</name>
  <value>${hadoop.tmp.dir}/mapred/system</value>
  <description>The shared directory where MapReduce stores control files.
  </description>
</property>
<property>
  <name>mapred.local.dir</name>
  <value>${hadoop.tmp.dir}/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>
</configuration>


 



另外，


在conf/hadoop-env.sh里，要把JAVA_HOME环境变量指向JDK路境，尽管可能在.profile中已经设置过了，这里还是要设一下，不然有时会提示“没有指定JAVA_HOME"