Hadoop伪分布式集群搭建-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_37112905/article/details/88698265

本文档介绍了如何在Linux Centos 7上使用Hadoop 2.9.2搭建伪分布式集群。内容涵盖了Hadoop和JDK的下载、虚拟机配置、网络设置、环境变量配置、Hadoop配置文件修改、NameNode格式化、服务启动、WordCount程序测试等步骤，是初学者理解Hadoop集群搭建流程的实用教程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Hadoop版本及虚拟机
在这里我使用的Hadoop版本是2.9.2，jdk版本是1.8
Hadoop2.9.2 http://mirrors.shu.edu.cn/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
jdk1.8 https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
虚拟机使用的是Linux Centos 7 https://www.centos.org/download/
此处略过虚拟机的安装过程，安装时选择最小化安装即可，虚拟机内存1G就够了
这里我给自己的虚拟机分配的内存是4G。
在这里插入图片描述
在网络适配器部分，搭建集群时的IP都是需要自己配置的，因此这里也同样的自己配置，使用静态IP，所以网络适配器选择自定义中的NAT，到对应的虚拟网卡也就是上图中的VMnet8中查看对应的IP地址作为虚拟机的网关。
在这里插入图片描述

虚拟机配置
1、网络配置：最小化安装的CentOS系统默认状态下是网卡是不启用的，需要在对应的配置文件中更改配置

 vim /etc/sysconfig/network-scripts/ifcfg-ens33

*修改BOOTPROTO字段为static
添加如下字段，其中IP地址需要保证和自己的网关地址在同一网段
IPADDR=192.168.203.11
NTSMASK=255.255.255.0
GATEWAY=192.168.203.2
DNS1=192.168.203.2

重启使得配置生效。
2、jdk 与Hadoop文件上传，这里我是用的是Xshell连接虚拟机上传的，需要在虚拟机上安装 lrzsz 才能使用xshell提供的文件传输。在虚拟机主目录下建一个目录用来存放最终的Hadoop文件以及jdk，也可以分开装。
3、环境变量配置：
在/etc/profile下配置java环境变量以及Hadoop环境变量：

export JAVA_HOME=/Hadoop/jdk
export HADOOP_HOME=/Hadoop/hadoop
export PATH=$JAVA_HOME/bin:$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

保存退出后重启或者source /etc/profile 使得配置生效。
验证配置是否生效：

[root@hdp-01 Hadoop]# java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) Server VM (build 25.191-b12, mixed mode)
[root@hdp-01 Hadoop]# hadoop version
Hadoop 2.9.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 826afbeae31ca687bc2f8471dc841b66ed2c6704
Compiled by ajisaka on 2018-11-13T12:42Z
Compiled with protoc 2.5.0
From source with checksum 3a9939967262218aa556c684d107985
This command was run using /Hadoop/hadoop/share/hadoop/common/hadoop-common-2.9.2.jar
[root@hdp-01 Hadoop]#

4、Hadoop 配置基本的hadoop需要配置的文件包括hadoop-env.s h,core-site.xml,hdfs-site.xml,mapred-env.s h,mapred-site.xml,yarn-site.xml,slaves,在配置前可在/etc/hosts 指定本机别名
[root@hdp-01 hadoop]# cat /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.203.11 hdp-01

hadoop-env.sh 的配置主要是指明jdk所在位置

export JAVA_HOME=/Hadoop/jdk

core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<!--指定默认的文件系统为HDFS，运行在本机也就是hdp-01:9000 -->
<value>hdfs://hdp-01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<!--指定存储位置 -->
<value>/Hadoop/hadoop/data</value>
</property>
</configuration>

hdfs-site.xml


<configuration>
<property>
<!-- replication 表示副本数，由于是本机就使用一个副本-->
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

mapred-site.xml 原来的文件是mapred-site.xml.template 需要更改名字为mappred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

slaves 保存的是datanode的主机，由于是本机可以指定也可以不更改使用它默认的localhost
-yarn-site.xml

<configuration>
<property> 
<name>yarn.resourcemanager.hostname</name>
<!--指定resourcemanager 为本机-->
<value>hdp-01</value>
</property>
<property> 
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- Site specific YARN configuration properties -->
</configuration>

5、NameNode format

-NameNode Format 之前需要开启我们需要的对应端口，这里由于是单机环境直接关掉防火墙就好了：

systemctl stop firewalld.service
systemctl disable firewalld.service

然后就阔以使用命令：hadoop namenode -format
成功后启动hdfs: start-dfs.sh
启动yarn start-dfs.sh

[root@hdp-01 ~]# start-dfs.sh
Java HotSpot(TM) Server VM warning: You have loaded library /Hadoop/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
19/03/20 07:43:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hdp-01]
hdp-01: starting namenode, logging to /Hadoop/hadoop/logs/hadoop-root-namenode-hdp-01.out
hdp-01: starting datanode, logging to /Hadoop/hadoop/logs/hadoop-root-datanode-hdp-01.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /Hadoop/hadoop/logs/hadoop-root-secondarynamenode-hdp-01.out
Java HotSpot(TM) Server VM warning: You have loaded library /Hadoop/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
19/03/20 07:44:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[root@hdp-01 ~]# jps
9719 NameNode
10173 Jps
9823 DataNode
10015 SecondaryNameNode
[root@hdp-01 ~]# start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /Hadoop/hadoop/logs/yarn-root-resourcemanager-hdp-01.out
hdp-01: starting nodemanager, logging to /Hadoop/hadoop/logs/yarn-root-nodemanager-hdp-01.out
[root@hdp-01 ~]# jps
10342 NodeManager
9719 NameNode
10234 ResourceManager
10379 Jps
9823 DataNode
10015 SecondaryNameNode

浏览器访问节点：
在这里插入图片描述

测试自带的WordCount程序

准备待读取的文件input.txt ,写入一写单词：
HDFS consists of only one Name Node we call it as Master Node which can track the files, manage the file system and has the meta data and the whole data in it. To be particular Name node contains the details of the No. of blocks, Locations at what data node the data is stored and where the replications are stored and other details. As we have only one Name Node we call it as Single Point Failure. It has Direct connect with the clien
然后将文件上传到集群上：

#在集群上创建文件夹，把单词文件放到其中
[root@hdp-01 ~]# hadoop fs -mkdir /Test
Java HotSpot(TM) Server VM warning: You have loaded library /Hadoop/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
19/03/20 07:52:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[root@hdp-01 ~]# 
#将单词文件上传
[root@hdp-01 ~]# hadoop fs -put input.txt /Test/input.txt
Java HotSpot(TM) Server VM warning: You have loaded library /Hadoop/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
19/03/20 07:53:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[root@hdp-01 ~]#

此时在浏览器中的Utilities中可以看到：
在这里插入图片描述
在/hadoop/share/hadoop/mapreduce下的hadoop-mapreduce-examples-2.9.2.jar 中含有WordCount 的Mapreduce，需要注意的是输出路径/output不能事先创建，否则会报错目录已存在。

[root@hdp-01 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.9.2.jar wordcount /Test /output
Java HotSpot(TM) Server VM warning: You have loaded library /Hadoop/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
19/03/20 08:01:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/03/20 08:01:22 INFO client.RMProxy: Connecting to ResourceManager at hdp-01/192.168.203.11:8032
19/03/20 08:01:27 INFO input.FileInputFormat: Total input files to process : 1
19/03/20 08:01:28 INFO mapreduce.JobSubmitter: number of splits:1
19/03/20 08:01:29 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/03/20 08:01:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1553082282969_0001
19/03/20 08:01:36 INFO impl.YarnClientImpl: Submitted application application_1553082282969_0001
19/03/20 08:01:36 INFO mapreduce.Job: The url to track the job: http://hdp-01:8088/proxy/application_1553082282969_0001/
19/03/20 08:01:36 INFO mapreduce.Job: Running job: job_1553082282969_0001
19/03/20 08:02:25 INFO mapreduce.Job: Job job_1553082282969_0001 running in uber mode : false
19/03/20 08:02:25 INFO mapreduce.Job:  map 0% reduce 0%
19/03/20 08:02:57 INFO mapreduce.Job:  map 100% reduce 0%
19/03/20 08:03:09 INFO mapreduce.Job:  map 100% reduce 100%
19/03/20 08:03:11 INFO mapreduce.Job: Job job_1553082282969_0001 completed successfully
19/03/20 08:03:12 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=647
		FILE: Number of bytes written=397993
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=536
		HDFS: Number of bytes written=421
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=28403
		Total time spent by all reduces in occupied slots (ms)=10378
		Total time spent by all map tasks (ms)=28403
		Total time spent by all reduce tasks (ms)=10378
		Total vcore-milliseconds taken by all map tasks=28403
		Total vcore-milliseconds taken by all reduce tasks=10378
		Total megabyte-milliseconds taken by all map tasks=29084672
		Total megabyte-milliseconds taken by all reduce tasks=10627072
	Map-Reduce Framework
		Map input records=1
		Map output records=85
		Map output bytes=778
		Map output materialized bytes=647
		Input split bytes=98
		Combine input records=85
		Combine output records=55
		Reduce input groups=55
		Reduce shuffle bytes=647
		Reduce input records=55
		Reduce output records=55
		Spilled Records=110
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=3263
		CPU time spent (ms)=13280
		Physical memory (bytes) snapshot=421761024
		Virtual memory (bytes) snapshot=1386520576
		Total committed heap usage (bytes)=295960576
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=438
	File Output Format Counters 
		Bytes Written=421

在浏览器端可观察到
在这里插入图片描述

将文件下载到本地查看结果：

[root@hdp-01 mapreduce]# hadoop fs -get /output
Java HotSpot(TM) Server VM warning: You have loaded library /Hadoop/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
19/03/20 08:07:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[root@hdp-01 mapreduce]# ls
hadoop-mapreduce-client-app-2.9.2.jar     hadoop-mapreduce-client-hs-2.9.2.jar          hadoop-mapreduce-client-jobclient-2.9.2-tests.jar  jdiff         output
hadoop-mapreduce-client-common-2.9.2.jar  hadoop-mapreduce-client-hs-plugins-2.9.2.jar  hadoop-mapreduce-client-shuffle-2.9.2.jar          lib           sources
hadoop-mapreduce-client-core-2.9.2.jar    hadoop-mapreduce-client-jobclient-2.9.2.jar   hadoop-mapreduce-examples-2.9.2.jar                lib-examples  test.txt
[root@hdp-01 mapreduce]# cd output/
[root@hdp-01 output]# ls
part-r-00000  _SUCCESS
[root@hdp-01 output]# cat part-r-00000 
As	1
Direct	1
Failure.	1
HDFS	1
It	1
Locations	1
Master	1
Name	3
No.	1
Node	3
Point	1
Single	1
To	1
and	4
are	1
as	2
at	1
be	1
blocks,	1
call	2
can	1
client	1
connect	1
consists	1
contains	1
data	4
details	1
details.	1
file	1
files,	1
has	2
have	1
in	1
is	1
it	2
it.	1
manage	1
meta	1
node	2
of	3
one	2
only	2
other	1
particular	1
replications	1
stored	2
system	1
the	9
track	1
we	3
what	1
where	1
which	1
whole	1
with	1
[root@hdp-01 output]#