最近在自己的局域网中又全新安装了hadoop0.20版本,感觉和0.19版本还是有一些变化的。
20版原包中默认取消了hadoop-default.xml配置文件,取而代之的是三个配置文件:
core-site.xml
mapred-site.xml
hdfs-site.xml
默认的这三个文件都是空的,也就是说,这些配置的全局默认值已经在代码中写死了,我们在配置文件中写的是和默认值不同的选项,会覆盖默认选项。
不同的配置选项要放在相应的文件中,不能放错地方。
hadoop 0.20官方英文文档中告诉我们了该怎么写(注意,是英文文档,20版提供了中文文档,但是里面的内容都是旧内容,我就是看了中文文档所以走了不少弯路),参考:http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html
原文摘录如下:
This section deals with important parameters to be specified in the following:
conf/core-site.xml :
Parameter | Value | Notes |
---|---|---|
fs.default.name | URI of NameNode . | hdfs://hostname/ |
conf/hdfs-site.xml :
Parameter | Value | Notes |
---|---|---|
dfs.name.dir | Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. | If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. |
dfs.data.dir | Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. | If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. |
conf/mapred-site.xml :
Parameter | Value | Notes |
---|---|---|
mapred.job.tracker | Host or IP and port of JobTracker . | host:port pair. |
mapred.system.dir | Path on the HDFS where where the Map/Reduce framework stores system files e.g. /hadoop/mapred/system/ . | This is in the default filesystem (HDFS) and must be accessible from both the server and client machines. |
mapred.local.dir | Comma-separated list of paths on the local filesystem where temporary Map/Reduce data is written. | Multiple paths help spread disk i/o. |
mapred.tasktracker.{map|reduce}.tasks.maximum | The maximum number of Map/Reduce tasks, which are run simultaneously on a given TaskTracker , individually. | Defaults to 2 (2 maps and 2 reduces), but vary it depending on your hardware. |
dfs.hosts/dfs.hosts.exclude | List of permitted/excluded DataNodes. | If necessary, use these files to control the list of allowable datanodes. |
mapred.hosts/mapred.hosts.exclude | List of permitted/excluded TaskTrackers. | If necessary, use these files to control the list of allowable TaskTrackers. |
mapred.queue.names | Comma separated list of queues to which jobs can be submitted. | The Map/Reduce system always supports atleast one queue with the name as default . Hence, this parameter's value should always contain the string default . Some job schedulers supported in Hadoop, like the Capacity Scheduler , support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Once queues are defined, users can submit jobs to a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same. |
mapred.acls.enabled | Specifies whether ACLs are supported for controlling job submission and administration | If true , ACLs would be checked while submitting and administering jobs. ACLs can be specified using the configuration parameters of the form mapred.queue.queue-name.acl-name , defined below. |
mapred.queue.queue-name .acl-submit-job | List of users and groups that can submit jobs to the specified queue-name . | The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example: user1,user2 group1,group2 . If you wish to define only a list of groups, provide a blank at the beginning of the value. |
mapred.queue.queue-name .acl-administer-job | List of users and groups that can change the priority or kill jobs that have been submitted to the specified queue-name . | The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example: user1,user2 group1,group2 . If you wish to define only a list of groups, provide a blank at the beginning of the value. Note that an owner of a job can always change the priority or kill his/her own job, irrespective of the ACLs. |
Typically all the above parameters are marked as final to ensure that they cannot be overriden by user-applications.
在局域网里,每个机器往往都用用户的名字命名,如John-desktop,但是我们在分布式系统中,通常希望用master, slave001,slave002这样的命名规则来命名机器,
这样我们需要编辑/etc/hosts文件,把每一台机器希望的命名都写进去,如:
192.168.1.10 John-desktop
192.168.1.10 master
192.168.1.11 Peter-desktop
192.168.1.11 slave001
依此类推。因为在hadoop中,系统会自动取当前机器名(用hostname),这时,如果hostname不是master, slave001这样的名字,网络通信就会出问题。
以下是我的配置文件
注:我有两台机器,主机IP:192.168.1.10 从机IP:192.168.1.11
core-site.xml:
hdfs-site.xml:
mapred-site.xml
另外,
在conf/hadoop-env.sh里,要把JAVA_HOME环境变量指向JDK路境,尽管可能在.profile中已经设置过了,这里还是要设一下,不然有时会提示“没有指定JAVA_HOME"