Hadoop框架中最核心的设计就是:MapReduce和HDFS。MapReduce的思想是由Google的一篇论文所提及而被广为流传的,简单的 一句话解释MapReduce就是“任务的分解与结果的汇总”。HDFS是Hadoop分布式文件系统(Hadoop Distributed File System)的缩写,为分布式计算存储提供了底层支持。
MapReduce工作原理图:
Hadoop的核心功能有两个:HDFS与MapReduce 。与HDFS相关的服务有NameNode 、SecondaryNameNode 及DataNode ;与MapReduce 相关的服务有JobTracker 和TaskTracker 两种。
Hadoop集群中有两种角色:master与slave,master又分为主master与次master。其中:
- 主 master同时提供NameNode 、SecondaryNameNode 及JobTracker 三种服务;
- 次master只提供SecondaryNameNode 服务;
- 所有slave可以提供DateNode 或TaskTracker 两种服务。
Hadoop有三种集群方式可以选择:
- Local (Standalone) Mode(无集群模式)
- Pseudo-Distributed Mode(单机集群模式)
- Fully-Distributed Mode(多机集群模式)
启动Hadoop的方式是在主master上调用下面的命令:
$ HADOOP_HOME /bin/start-all.sh
此调用过程中,Hadoop依次启动以下服务:
- 在主master上启动NameNode 服 务;
- 在主master上启动SecondaryNameNode 服 务;
- 在次master上启动SecondaryNameNode 服 务;
- 在所有slave上启动DataNode 服务;
- 在主master上 启动JobTracker 服务;
- 在所有slave上的TaskTracker 服务。

图中展现了整个HDFS三个重要角色:NameNode、DataNode和Client。NameNode可以看作是分布式文件系统中的管理者, 主要负责管理文件系统的命名空间、集群配置信息和存储块的复制等。NameNode会将文件系统的Meta-data存储在内存中,这些信息主要包括了文 件信息、每一个文件对应的文件块的信息和每一个文件块在DataNode的信息等。DataNode是文件存储的基本单元,它将Block存储在本地文件 系统中,保存了Block的Meta-data,同时周期性地将所有存在的Block信息发送给NameNode。Client就是需要获取分布式文件系 统文件的应用程序。这里通过三个操作来说明他们之间的交互关系。
文件写入:
- Client向NameNode发起文件写入的请求。
- NameNode根据文件大小和文件块配置情况,返回给Client它所管理部分DataNode的信息。
- Client将文件划分为多个Block,根据DataNode的地址信息,按顺序写入到每一个DataNode块中。
文件读取:
- Client向NameNode发起文件读取的请求。
- NameNode返回文件存储的DataNode的信息。
- Client读取文件信息。
文件Block复制:
- NameNode发现部分文件的Block不符合最小复制数或者部分DataNode失效。
- 通知DataNode相互复制Block。
- DataNode开始直接相互复制。
在Hadoop的系统中,会有一台Master,主要负责NameNode的工作以及JobTracker的工作。JobTracker的主要职责 就是启动、跟踪和调度各个Slave的任务执行。还会有多台Slave,每一台Slave通常具有DataNode的功能并负责TaskTracker的 工作。TaskTracker根据应用要求来结合本地数据执行Map任务以及Reduce任务。
说到这里,就要提到分布式计算最重要的一个设计点:Moving Computation is Cheaper than Moving Data。就是在分布式处理中,移动数据的代价总是高于转移计算的代价。简单来说就是分而治之的工作,需要将数据也分而存储,本地任务处理本地数据然后归 总,这样才会保证分布式计算的高效性。
- chmod a+x jdk-6u21-linux-i586.bin
-
- ./jdk-6u21-linux-i586.bin
- mv jdk1.6.0_21 /usr/local
- ln -s jdk1.6.0_21 jdk #方便将来jdk升级!
- [root@qht02 conf]# cat masters
- 192.168.1.2
- [root@qht02 conf]# cat slaves
- 192.168.1.3
- 192.168.1.4
- 192.168.1.5
- 192.168.1.6
- [root@qht02 conf]# cat core-site.xml
- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-
- <!-- Put site-specific property overrides in this file. -->
-
- <configuration>
- <property>
-
- <name>fs.default.name</name>
-
- <value>hdfs://192.168.1.2:9000</value>
-
- </property>
- </configuration>
- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-
- <!-- Put site-specific property overrides in this file. -->
-
- <configuration>
- <property>
-
- <name>dfs.replication</name>
-
- <value>2</value>
- </property>
- </configuration>
- [root@qht02 conf]# cat mapred-site.xml
- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-
- <!-- Put site-specific property overrides in this file. -->
-
- <configuration>
-
- <property>
-
- <name>mapred.job.tracker</name>
-
- <value>192.168.1.2:9001</value>
-
- </property>
- </configuration>
- scp -r /usr/local/hadoop root@192.168.1.3:/usr/local
- scp -r /usr/local/hadoop root@192.168.1.5:/usr/local
- scp -r /usr/local/hadoop root@192.168.1.5:/usr/local
- scp -r /usr/local/hadoop root@192.168.1.6:/usr/local
- scp -r /usr/local/jdk1.6.0_21/ root@192.168.1.3:/usr/local/
- scp -r /usr/local/jdk1.6.0_21/ root@192.168.1.4:/usr/local/
- scp -r /usr/local/jdk1.6.0_21/ root@192.168.1.5:/usr/local/
- scp -r /usr/local/jdk1.6.0_21/ root@192.168.1.6:/usr/local/
- cd /usr/local/
- ln -s jdk1.6.0_21 jdk
- ssh-keygen -t dsa -P '' -f 、/root/.ssh/id_dsa
- cd /root/.ssh/
- mv id_dsa.pub authorized_keys
- scp authorized_keys root@192.168.1.3:/root/.ssh
- scp authorized_keys root@192.168.1.4:/root/.ssh
- scp authorized_keys root@192.168.1.5:/root/.ssh
- scp authorized_keys root@192.168.1.6:/root/.ssh
- 注意如果/root下没有.ssh目录,则ssh 192.168.1.2 就会自动建立.ssh目录,然后支持scp即可
- 然后分别测试:
-
- ssh root@192.168.1.3
- ssh root@192.168.1.4
- ssh root@192.168.1.5
- ssh root@192.168.1.6
- 如果不提示输入密码即可登陆,表示已经搭建成功!
- [root@qht02 hadoop]# bin/hadoop namenode -format
- 12/04/15 19:06:49 INFO namenode.NameNode: STARTUP_MSG:
- /************************************************************
- STARTUP_MSG: Starting NameNode
- STARTUP_MSG: host = qht02/127.0.0.1
- STARTUP_MSG: args = [-format]
- STARTUP_MSG: version = 1.0.1
- STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1243785; compiled by 'hortonfo' on Tue Feb 14 08:15:38 UTC 2012
- ************************************************************/
- 12/04/15 19:06:50 INFO util.GSet: VM type = 32-bit
- 12/04/15 19:06:50 INFO util.GSet: 2% max memory = 19.33375 MB
- 12/04/15 19:06:50 INFO util.GSet: capacity = 2^22 = 4194304 entries
- 12/04/15 19:06:50 INFO util.GSet: recommended=4194304, actual=4194304
- 12/04/15 19:06:52 INFO namenode.FSNamesystem: fsOwner=root
- 12/04/15 19:06:52 INFO namenode.FSNamesystem: supergroup=supergroup
- 12/04/15 19:06:52 INFO namenode.FSNamesystem: isPermissionEnabled=true
- 12/04/15 19:06:52 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
- 12/04/15 19:06:52 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
- 12/04/15 19:06:52 INFO namenode.NameNode: Caching file names occuring more than 10 times
- 12/04/15 19:06:53 INFO common.Storage: Image file of size 110 saved in 0 seconds.
- 12/04/15 19:06:54 INFO common.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.
- 12/04/15 19:06:54 INFO namenode.NameNode: SHUTDOWN_MSG:
- /************************************************************
- SHUTDOWN_MSG: Shutting down NameNode at qht02/127.0.0.1
- ************************************************************/
- 此时会在/tmp目录下生成相应的目录:
-
- [root@qht03 name]# pwd
- /tmp/hadoop-root/dfs/name
- [root@qht03 name]# ls
- current image
- [root@qht03 name]#
-
- [root@qht02 conf]# cd /usr/local/hadoop/
-
- [root@qht02 hadoop]# bin/start-all.sh
- starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-namenode-qht02.out
- 192.168.1.4: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-datanode-qht04.out
- 192.168.1.3: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-datanode-qht03.out
- 192.168.1.6: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-datanode-qht06.out
- 192.168.1.5: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-datanode-qht05.out
- 192.168.1.2: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-secondarynamenode-qht02.out
- starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-jobtracker-qht02.out
- 192.168.1.4: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-tasktracker-qht04.out
- 192.168.1.3: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-tasktracker-qht03.out
- 192.168.1.6: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-tasktracker-qht06.out
- 192.168.1.5: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-tasktracker-qht05.out
- [root@qht02 hadoop]#
-
-
- [root@qht02 conf]# jps
- 3478 NameNode
- 3729 JobTracker
- 4143 Jps
- 3634 SecondaryNameNode
- [root@qht03 local]# jps
- 3496 TaskTracker
- 3414 DataNode
- 3633 Jps
点击(此处)折叠或打开
- [root@qht02 bin]# ./hadoop dfsadmin -report
- Configured Capacity: 49972363264 (46.54 GB)
- Present Capacity: 32953151518 (30.69 GB)
- DFS Remaining: 32953036800 (30.69 GB)
- DFS Used: 114718 (112.03 KB)
- DFS Used%: 0%
- Under replicated blocks: 0
- Blocks with corrupt replicas: 0
- Missing blocks: 0
-
- -------------------------------------------------
- Datanodes available: 4 (4 total, 0 dead)
-
- Name: 192.168.1.5:50010
- Decommission Status : Normal
- Configured Capacity: 12493090816 (11.64 GB)
- DFS Used: 28672 (28 KB)
- Non DFS Used: 4159823872 (3.87 GB)
- DFS Remaining: 8333238272(7.76 GB)
- DFS Used%: 0%
- DFS Remaining%: 66.7%
- Last contact: Sun Apr 15 19:20:47 CST 2012
-
-
- Name: 192.168.1.6:50010
- Decommission Status : Normal
- Configured Capacity: 12493090816 (11.64 GB)
- DFS Used: 28672 (28 KB)
- Non DFS Used: 4565094400 (4.25 GB)
- DFS Remaining: 7927967744(7.38 GB)
- DFS Used%: 0%
- DFS Remaining%: 63.46%
- Last contact: Sun Apr 15 19:20:48 CST 2012
-
-
- Name: 192.168.1.4:50010
- Decommission Status : Normal
- Configured Capacity: 12493090816 (11.64 GB)
- DFS Used: 28687 (28.01 KB)
- Non DFS Used: 4146950129 (3.86 GB)
- DFS Remaining: 8346112000(7.77 GB)
- DFS Used%: 0%
- DFS Remaining%: 66.81%
- Last contact: Sun Apr 15 19:20:46 CST 2012
-
-
- Name: 192.168.1.3:50010
- Decommission Status : Normal
- Configured Capacity: 12493090816 (11.64 GB)
- DFS Used: 28687 (28.01 KB)
- Non DFS Used: 4147343345 (3.86 GB)
- DFS Remaining: 8345718784(7.77 GB)
- DFS Used%: 0%
- DFS Remaining%: 66.8%
- Last contact: Sun Apr 15 19:20:46 CST 2012
-
点击(此处)折叠或打开
- [root@qht02 hadoop]# bin/start-all.sh
- starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-namenode-qht02.out
- 192.168.1.3: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-datanode-qht03.out
- 192.168.1.4: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-datanode-qht04.out
- 192.168.1.5: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-datanode-qht05.out
- 192.168.1.6: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-datanode-qht06.out
- 192.168.1.2: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-secondarynamenode-qht02.out
- starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-jobtracker-qht02.out
- 192.168.1.5: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-tasktracker-qht05.out
- 192.168.1.4: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-tasktracker-qht04.out
- 192.168.1.6: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-tasktracker-qht06.out
- 192.168.1.3: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-root-tasktracker-qht03.out
--
--
--
--