Hadoop的分布式部署是将每一个NameNode和DataNode都部署在独立的服务器上,然后做集群。伪分布式部署是将所有NameNode和DataNode部署在同一台机器上。
由于有些公司的业务并不是很大(大概一两百万行),或者没有充足的预算配置集群,可以采用伪分布式部署。
一、安装Hadoop
系统和运行环境:Linux系统、安装JDK
第一步:下载Hadoop和JDK
可以在Linux服务器上下载,或者下载后用FTP上传至Linux服务器。
Hadoop下载地址:https://dlcdn.apache.org/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz
百度云盘链接:https://pan.baidu.com/s/1n8PvXQH1bltM7lEQePrKIg
提取码:77oz
JDK下载地址:https://download.oracle.com/otn/java/jdk/8u321-b07/df5ad55fdd604472a86a45a217032c7d/jdk-8u321-linux-x64.tar.gz
百度云盘链接:链接:https://pan.baidu.com/s/1W072PXamWljiotCB2d1CVw
提取码:wmiu
第二步:解压tar.gz
//先切换到root用户
[abc@Hadoop_001 ~]$ su - root
//先创建解压的目的目录
[root@Hadoop_001 ~]# mkdir /app
[root@Hadoop_001 ~]# mkdir /app/jdk
[root@Hadoop_001 ~]# mkdir /app/hadoop
//解压jdk
[root@Hadoop_001 ~]# tar xvf jdk-8u321-linux-x64.tar.gz -C /app/jdk
[root@Hadoop_001 ~]# ls /app/jdk
jdk1.8.0_321
//配置JDK的环境变量
[root@Hadoop_001 ~]# vim /etc/profile
export JAVA_HOME=/app/jdk/jdk1.8.0_321
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
//验证JDK是否安装成功
[root@Hadoop_001 ~]# java -version
java version "1.7.0_45"
OpenJDK Runtime Environment (rhel-2.4.3.3.el6-x86_64 u45-b15)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
//解压Hadoop
//配置免密码登录。这里是伪分布式只有一台机器,配置自己的IP
//先生成密钥
root@Hadoop_001 ~]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
29:33:8b:58:d9:53:bc:e9:21:29:3e:b1:25:9a:35:7d root@Hadoop_001
The key's randomart image is:
+--[ RSA 2048]----+
| |
| . |
| o |
| + o + |
| O @ E |
| B O X . |
| + = . . |
| . |
| |
+-----------------+
//发送密钥到目的服务器,这里只有一台服务器,发送给自己,
[root@Hadoop_001 ~]# ssh-copy-id 192.168.181.21 //这里的IP要填目的服务器的IP
The authenticity of host '192.168.181.21 (192.168.181.21)' can't be established.
RSA key fingerprint is a4:4f:00:28:80:20:b4:43:bc:1a:d0:20:7f:d5:5a:6c.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.181.21' (RSA) to the list of known hosts.
root@192.168.181.21's password:
Now try logging into the machine, with "ssh '192.168.181.21'", and check in:
.ssh/authorized_keys
to make sure we haven't added extra keys that you weren't expecting.
第三步:修改配置文件
配置文件在刚解压的Hadoop解压包下的etc/hadoop/文件夹里修改“core-site.xml”、“hdfs-site.xml”、“yarn-site.xml”、“mapred-site.xml”、“slaves”文件就可以了。
[root@Hadoop_001 ~]# cd /app/hadoop/hadoop-2.10.1/etc/hadoop
[root@Hadoop_001 hadoop]# ls
capacity-scheduler.xml hadoop-policy.xml kms-log4j.properties ssl-client.xml.example
configuration.xsl hdfs-site.xml kms-site.xml ssl-server.xml.example
container-executor.cfg httpfs-env.sh log4j.properties yarn-env.cmd
core-site.xml httpfs-log4j.properties mapred-env.cmd yarn-env.sh
hadoop-env.cmd httpfs-signature.secret mapred-env.sh yarn-site.xml
hadoop-env.sh httpfs-site.xml mapred-queues.xml.template
hadoop-metrics2.properties kms-acls.xml mapred-site.xml.template
hadoop-metrics.properties kms-env.sh slaves
先修改环境配置文件【hadoop-en.sh】
将文件中“export JAVA_HOME=${JAVA_HOME}”修改为“export JAVA_HOME=/app/jdk/jdk1.8.0_321/”
[root@Hadoop_001 hadoop]# vim hadoop-env.sh
export JAVA_HOME=/app/jdk/jdk1.8.0_321/
如果改过主机名,需要再/etc/hosts里添加主机名的域名解析或者将主机名改为localhost
//添加主机名的域名解析
[root@Hadoop_001 bin]# vim /etc/hosts
127.0.0.1 Hadoop_001
【core-site.xml】文件
该文件的“< ! - - - - >”里的内容是注释,不会影响配置,可以忽略。
将“< configuration>< /configuration>”改为如下内容:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://127.0.0.1:9000</value> //Hadoop的地址与端口,这里用的本地主机名,可根据实际情况更改
</property>
<property>
<name>hadoop.tmp.dir </name>
<value>/app/hadoop/hadoop-2.10.1/tmp</value> //临时文件的位置,这里用到Hadoop的解压目录,可根据实际情况更改
</property>
</configuration>
【hdfs-site.xml】文件
将“< configuration>< /configuration>”改为如下内容:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value> //每个文件块备份几分。这里是伪分布式部署,选3份没意义,选1就行。
</property>
</configuration>
【yarn-site.xml】文件
将“< configuration>< /configuration>”改为如下内容:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value> //这里用的主机名,可根据实际情况更改
</property>
<property>
<name>yarn.nodemanager.auxservices</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
【mapred-site.xml】文件
写入如下内容:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>localhost</value>
</property>
</configuration>
【slaves】文件
[root@Hadoop_001 hadoop]# vim slaves
localhost
这个文件里写除了本机以外其他各个机器的地址
伪分布式部署:只有一台机器,这里写localhost或127.0.0.1。
真分布式部署:这个文件里写除了本机以外其他各个机器的地址,然后把解压缩的JDK和Hadoop、/etc/profile、各种配置文件往各个机器复制一份(slaves文件需要根据每台机器的情况进行修改。)
第四步:格式化文件系统
Hadoop的所有命令都在解压包的bin/和sbin/文件夹里。
[root@Hadoop_001 hadoop]# cd /app/hadoop/hadoop-2.10.1
[root@Hadoop_001 hadoop-2.10.1]# ./bin/hadoop namenode -format
第五步:启动Hadoop
Hadoop的启动文件在解压包的sbin/文件夹里的“start-all.sh”。启动过程中全输入“yes”。
Hadoop还有只启动HDFS的命令“start-dfs.sh”、只启动yarn的命令“start-yarn.sh”。官方建议各个组件依次手动启动。
[root@Hadoop_001 hadoop-2.10.1]# ./sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
22/03/22 06:35:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/03/22 06:35:33 WARN hdfs.DFSUtilClient: Namenode for null remains unresolved for ID null. Check your hdfs-site.xml file to ensure namenodes are configured properly.
Starting namenodes on [tourbis]
tourbis: ssh: Could not resolve hostname tourbis: Name or service not known
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is a4:4f:00:28:80:20:b4:43:bc:1a:d0:20:7f:d5:5a:6c.
Are you sure you want to continue connecting (yes/no)? yes
localhost: Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
localhost: starting datanode, logging to /app/hadoop/hadoop-2.10.1/logs/hadoop-root-datanode-Hadoop_001.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
RSA key fingerprint is a4:4f:00:28:80:20:b4:43:bc:1a:d0:20:7f:d5:5a:6c.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /app/hadoop/hadoop-2.10.1/logs/hadoop-root-secondarynamenode-Hadoop_001.out
22/03/22 06:37:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
starting yarn daemons
starting resourcemanager, logging to /app/hadoop/hadoop-2.10.1/logs/yarn-root-resourcemanager-Hadoop_001.out
localhost: starting nodemanager, logging to /app/hadoop/hadoop-2.10.1/logs/yarn-root-nodemanager-Hadoop_001.out
//验证启动成功
[root@Hadoop_001 hadoop-2.10.1]# jps //JAVA进程查看命令。能看到NameNode、ResourceManager、SecondaryNameNode、DataNode、NodeManager这5个进程表示Hadoop启动成功
7536 NameNode
8000 ResourceManager
8435 Jps
7844 SecondaryNameNode
7673 DataNode
8108 NodeManager
用浏览器访问Hadoop,地址为“Hadoop主机地址:50070”
如果有很多台服务器,只在一台机器上执行启动命令,其余机器也会启动。
二、用Hadoop做词频统计
先创建一个文件,在里面写入英文单词
[root@Hadoop_001 hadoop-2.10.1]# touch 123.txt
[root@Hadoop_001 hadoop-2.10.1]# vim 123.txt
Hello World
Hello Earth
Hello Moon
bigdata Hadoop
JAVA java PHP
SQL
SQLserver
SQL Server
python python Go
HDFS的操作与Linux文件操作命令是一样的,只是需要在前面额外加上"hadoop安装包路径/bin/hadoop dfs”或“hadoop安装包路径/bin/hdfs dfs”
//创建一个用来存试验数据的输入目录“/test/input”,再创建一个用来保存输出结果的“/test/output”
[root@Hadoop_001 hadoop-2.10.1]# /app/hadoop/hadoop-2.10.1/bin/hadoop dfs -mkdir /test/
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
22/03/23 07:30:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[root@Hadoop_001 hadoop-2.10.1]# /app/hadoop/hadoop-2.10.1/bin/hadoop dfs -mkdir /test/input/
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
22/03/23 07:30:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[root@Hadoop_001 hadoop-2.10.1]# /app/hadoop/hadoop-2.10.1/bin/hdfs dfs -mkdir /test/output/
22/03/23 07:31:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
三、Hadoop与其他集群的区别
什么是集群
集群可以理解为一堆计算机拼起来,与单节点对应。
单节点就是单独的一台计算机,如果出现故障,服务就会中断。
集群的意义:
1、故障转移(一台机器故障了,另一台机器立马顶上)。
2、负载均衡(一台机器最大支持1万人访问,同时有3万人访问,这台机器就会崩溃。现在有4台甚至更多的机器可以提供相同的服务,将这3万人平均分配给每台机器,每台机器都没达到最大负载,可以正常运行)。
高可以用性:可以正常运行的时间除以总时间,例如99.99%的高可用性,意味着1万小时里有1个小时不能正常运行。
**高可用集群未必是高性能的集群。**针对单个操作有可能反而更慢了。例如:某企业搭了个商品库存系统的集群,当前某样商品库存1万件,几乎同时,服务器A上有个客户下单5千件,服务器B上有个客户下单6000件,如果是单节点,一个客户下单1万1千件,则系统可以立即反馈库存不足只能下单1万件;而在集群中为了保证数据不出错,会出现先处理服务器A上的订单、服务器B上的订单等服务器A上的订单处理完后再处理,这样采用单节点处理只需处理一次,而采用集群则要处理两次,采用集群反而变慢了。