Hadoop集群的搭建和使用

最新推荐文章于 2025-04-03 15:07:06 发布

季马宝宝

最新推荐文章于 2025-04-03 15:07:06 发布

阅读量3.5k

点赞数 1

分类专栏：工具类杂记文章标签： hadoop hdfs big data

本文链接：https://blog.youkuaiyun.com/qq_21043585/article/details/121876862

版权

工具类杂记专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍了如何在Ubuntu环境下搭建3节点的全分布式Hadoop集群，包括主机名配置、SSH免密登录、Java安装、Hadoop下载与配置、HDFS文件系统使用、MapReduce程序开发和Yarn模式的配置与运行。通过实验，读者可以掌握Hadoop集群的基本操作和MapReduce处理大数据的流程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

- - Hadoop集群的搭建和使用

Hadoop集群的搭建和使用

1.实验目的

搭建 3 复本全分布式 Hadoop 集群
熟悉 MapReduce 程序开发环境
编写 MapReduce 程序处理大数据

2.Hadoop介绍

Hadoop 是一个分布式系统基础架构，由Apache基金会开发。用户可以在不了解分布式底层细节的情况下，开发分布式程序。充分利用集群的威力高速运算和存储。

Hadoop分布式文件系统(HDFS)是指被设计成适合运行在通用硬件(commodity hardware)上的分布式文件系统（Distributed File System）。它和现有的分布式文件系统有很多共同点。但同时，它和其他的分布式文件系统的区别也是很明显的。HDFS是一个高度容错性的系统，适合部署在廉价的机器上。HDFS能提供高吞吐量的数据访问，非常适合大规模数据集上的应用。HDFS放宽了一部分POSIX约束，来实现流式读取文件系统数据的目的。HDFS在最开始是作为Apache Nutch搜索引擎项目的基础架构而开发的。HDFS是Apache Hadoop Core项目的一部分。

3.实验环境

使用VMware虚拟机搭建，每台都是2核4g，分配了20g硬盘内存

名称	系统	配置	存储	内网ip
master	Ubuntu 16.04	2核4g	20G	192.168.64.128
slave1	Ubuntu 16.04	2核4g	20G	192.168.64.129
slave2	Ubuntu 16.04	2核4g	20G	192.168.64.130

配置情况：

4.集群搭建

4.1修改主机名

sudo hostnamectl set-hostname master	#master 服务器执行：
sudo hostnamectl set-hostname slave1	#slave1 服务器执行：
sudo hostnamectl set-hostname slave2	#slave2 服务器执行：

4.2修改host

修改/etc/hosts文件，用主机名代替ip端口号，便于后续操作

sudo gedit /etc/hosts

通过ping验证是否在同一个网段连接成功

ping slave1
ping slave2
ping master

没有丢包情况，连接通畅，主机之间可以相互ping通

4.3配置ssh免密登录

三台都通过ssh-keygen生产密匙，并ssh-copy-id分发使相互之间可以免密使用ssh功能，实际上不需要每台都执行三次，主机不会使用ssh连接自己

ssh-keygen
ssh-copy-id -i .ssh/id_rsa.pub ubuntu@master
ssh-copy-id -i .ssh/id_rsa.pub ubuntu@slave1
ssh-copy-id -i .ssh/id_rsa.pub ubuntu@slave2

配置完成使用ssh命令验证，是否可以免密连接

使用exit退出

经验证，三台主机都可以互相使用ssh免密登录

4.4安装Java

使用命令sudo apt install openjdk-8-jdk-headless安装java1.8

通过java -version命令验证java安装成功

5.安装Hadoop

因为后面有很多权限问题，我偷懒使用了以下命令，将所有文件的全部权限开发，仅供实验使用

sudo chmod 777 /

5.1下载hadoop

下载hadoop并解压

wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop2.10.1/hadoop-2.10.1.tar.gz
tar -zxvf hadoop-2.10.1.tar.gz

解压文件，为了操作方便，我将hadoop解压到了桌面

5.2为 hadoop 设置系统环境变量

使用命令gedit ~/.bashrc修改./bashrc文件，将hadoop文件夹中的bin和sbin文件夹添加到全局路径中

gedit ~/.bashrc#等价于直接执行export PATH=$PATH:/home/jimazeyu/Desktop/hadoop-2.10.1/bin:/home/jimazeyu/Desktop/hadoop-2.10.1/sbin，但是添加到bashrc后打开terminal直接执行，否则每次创建新窗口都需要再次执行
source ~/.bashrc

5.3编辑 Hadoop 配置文件

进入目标文件夹，并修改以下文件

cd hadoop-2.10.1/etc/hadoop
gedit slaves
gedit core-site.xml
gedit hadoop-env.sh
gedit hdfs-site.xml

slaves

slave1
slave2
master

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://master:9000</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/data/hadoop/tmp</value>
	</property>
</configuration>

hadoop-env.sh

文末加入 1 行：export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

hdfs-site.xml

dfs.namenode.secondary.http-address： The secondary namenode http server address and port。 dfs.namenode.name.dir ： master 节点，文件树（fsImage）存放的位置（目录）。 dfs.datanode.data.dir：所有云服务器，数据块存放的位置（目录）。

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>dfs.namenode.secondary.http-address</name>
		<value>master:50090</value>
	</property>
	<property>
		<name>dfs.replication</name>
		<value>3</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>/data/hadoop/name</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>/data/hadoop/data</value>
	</property>
</configuration>

5.4验证Hadoop是否安装完成

通过hadoop version命令查看hadoop是否安装完成

5.5分发Hadoop

在服务器 master 上执行以下命令，将配置好的 Hadoop 分发给 slave1 和 slave2。

cd ~
scp -r hadoop-2.10.1 jimazeyu@slave1:/home/jimazeyu/Desktop/hadoop-2.10.1
scp -r hadoop-2.10.1 jimazeyu@slave1:/home/jimazeyu/Desktop/hadoop-2.10.1

可以看到，两台从机上也有了配置好的hadoop

6.启动集群和 HDFS 文件系统的使用

以下操作都在master主机进行

6.1格式化HDFS文件系统

使用命令hdfs namenode -format格式化HDFS文件系统

6.2启动集群

通过命令start-dfs.sh 启动集群

6.3验证

用 jps 命令查看所有 HDFS 服务进程。可以看到 NameNode 进程和 SecondaryNameNode 进程运行在 master 服务器上。复本配置的 Hadoop 集群中，master 服务器和每台 slave 服务器各运行有 1 个 DataNode 进程。Hadoop 集群配置成功。

master：

slave1：

slave2：

使用命令hdfs dfsadmin -report查看 HDFS 文件系统存储空间的总容量。可以看到，一台一台20G内存，总共大约60G

7.HDFS文件系统使用

7.1命令行方式读写 hdfs

编写一个简单的txt文件gedit test.txt,里面随便写入字符串"hello,i’m jimazeyu"
通过命令hadoop fs -put test.txt /把它存进 HDFS 的根目录
通过命令擦好看HDFS根目录hadoop fs -ls /
通过hadoop查看文件内容

8.通过java 程序读 HDFS 文件

编辑源程序gedit reader.java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class reader{
    public static void main(String[] args) {
        try {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf); // 得到 HDFS 文件系统的配置
        // 打开 HDFS 文件，文件名在命令行参数里
        FSDataInputStream in = fs.open(new Path(args[0])); 
        //读 HDFS 文件，屏幕显示文件内容
        byte[] buffer = new byte[1024];
        int len = in.read(buffer);
        while(len != -1) {
        	System.out.write(buffer,0,len);
        	len = in.read(buffer);
        }
        in.close(); //关闭文件
        fs.close(); //不再使用 HDFS 文件系统
        } catch(Exception e) {
         	e.printStackTrace();
         }
    }
}

2.编译、打包、运行

javac -cp $(hadoop classpath) reader.java
jar cvf reader.jar reader.class
hadoop jar reader.jar reader /test.txt

注意：

在 reader.jar 所在的目录执行这条 hadoop jar 命令。
程序处理的是 HDFS 文件： HDFS 文件系统根目录下的/test.txt 文件。
hadoop jar：命令 Hadoop 平台执行 Java 程序
reader.jar：我们运行的程序是一个 jar 包
reader ： main 函数在这个 class 里 /test.txt：
Java 程序的命令行参数，是我们要处理的 HDFS 文件。

便打包后获得文件：

执行文件:

9.Local运行 MapReduce

9.1HDFS 文件系统中创建目录，放 wordcount 程序的输入

hadoop fs -mkdir -p /wordcount/input

9.2将数据文件放入input目录

编写一个简单的txt文件，每行放入一个单词，并执行hadoop fs -put iloveyou.txt /wordcount/input。将文件放入input目录中

9.3运行mapreduce程序

master服务器单机运行程序,在hadoop根目录中执行以下代码

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar wordcount 
/wordcount/input /wordcount/output

程序的运行结果存在 HDFS 文件里。用以下命令查看程序的运行结果。通过hadoop fs -cat /wordcount/output/part-r-00000代码查看结果

10.Yarn模式配置

10.1配置Yarn和MapReduce

为三台服务器并行处理大数据文件做准备。第一次安装 Hadoop 集群需要做。装好以后，每次使用前 start-all.sh 启动集群即可。

配置Yarn，gedit yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

	<!-- Site specific YARN configuration properties -->
	<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>master</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
	<property>
		<name>yarn.resourcemanager.address</name>
		<value>master:8032</value>
	</property>
	<property>
		<name>yarn.resourcemanager.scheduler.address</name>
		<value>master:8030</value>
	</property>
	<property>
		<name>yarn.resourcemanager.resource-tracker.address</name>
		<value>master:8031</value>
	</property>
	<property>
		<name>yarn.resourcemanager.admin.address</name>
		<value>master:8033</value>
	</property>
	<property>
		<name>yarn.resourcemanager.webapp.address</name>
		<value>master:8088</value>
	</property>
	<property>
		<name>yarn.log-aggregation.ratain-seconds</name>
		<value>-1</value>
	</property>
</configuration>

配置MapReduce，修改 MapReduce 配置文件，将 mapreduce 切换为 yarn 模式。

cd etc/hadoop/
cp mapred-site.xml.template mapred-site.xml
vim mapred-site.xml

mapred-site.xml文件内容

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

10.2重启集群

stop-all.sh
start-all.sh

使用jps 命令查看 master 节点增加 ResourceManager，slave 节点增加 NodeManager

master节点：

slave节点：

10.Yarn 模式运行 MapReduce 作业

10.1 Yarn 模式运行 WordCount 作业

使用命令hadoop fs -rm -r /wordcount/output清空 wordcount 的输出文件夹
执行 MapReduce 程序：hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar wordcount /wordcount/input /wordcount/output

效果和单机执行一样，但是时间久一点，因为数据量小，三台主机需要通信

10.2 MapReduce 程序的编译和运行

1.编译

把所有.java 文件放在一个目录里
以这个目录为当前工作目录，执行 javac 命令
```
javac -cp $(hadoop classpath) -d . SalesMapper.java SalesCountryReducer.java 
SalesCountryDriver.java
```
注释：
- -cp $(hadoop classpath)是 Java 程序的带路径编译，指出 java 文件中 import 的类的位置。
- -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java，指示将编译生成的.class放进当前目录。
- 因为源程序中有语句 package SalesCountry; 所以，编译时 javac 会在当前目录下创建子目录 SalesCountry，将所有.class 放进该目录。
新建 Manifest.txt 文件，指出 main 函数所在的类文件（依然在First_Hadoop_Program目录）

文件中写入一行内容:
```
Main-Class: SalesCountry.SalesCountryDriver
```
注意结尾有回车
创建 jar 文件（依然在上述目录）

jar cfm ProductSalePerCountry.jar Manifest.txt SalesCountry/*.class

注释：
- jar 是 Java 程序.class 文件的压缩包。
- 创建新 jar 包；f，指定 jar 包名 ProductSalePerCountry.jar；m，指定打包使用的清单Manifest.txt。
- SalesCountry/*.class 是 java 程序所有的类。
全部编译打包完成后，将获得如下文件

将待处理的大数据文件 SalesJan2009.csv 传进 HDFS 文件系统/inputMapReduce 目录下
```
hadoop fs -put SalesJan2009.csv / inputMapReduce
```
进 ProductSalePerCountry.jar 所在的目录（master 服务器的本地文件系统），执行程序
```
hadoop jar ProductSalePerCountry.jar /inputMapReduce /mapreduce_output_sales
```

程序的运行结果在 HDFS 文件系统里。查看程序的运行结果。

hadoop fs -ls /mapreduce_output_sales
hadoop fs -cat /mapreduce_output_sales/part-0000

程序执行成功，试验结束：

hadoop jar ProductSalePerCountry.jar /inputMapReduce /mapreduce_output_sales

程序的运行结果在 HDFS 文件系统里。查看程序的运行结果。

hadoop fs -ls /mapreduce_output_sales
hadoop fs -cat /mapreduce_output_sales/part-0000

程序执行成功，试验结束：