何为hadoop
Hadoop是一个由Apache基金会所开发的分布式系统基础架构。
用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力进行高速运算和存储。
Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。
HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高吞吐量(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求,可以以流的形式访问(streaming access)文件系统中的数据。
Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS为海量的数据提供了存储,则MapReduce为海量的数据提供了计算。
Hadoop解决哪些问题?
-
海量数据需要及时分析和处理
-
海量数据需要深入分析和挖掘
-
数据需要长期保存
海量数据存储的问题:
-
磁盘IO称为一种瓶颈,而非CPU资源
-
网络带宽是一种稀缺资源
-
硬件故障成为影响稳定的一大因素
HDFS采用master/slave架构
Hadoop的三种运行模式 :
1.独立(本地)运行模式:无需任何守护进程,所有的程序都运行在同一个JVM上执行。在独立模式下调试MR程序非常高效方便。所以一般该模式主要是在学习或者开发阶段调试使用 。
2.伪分布式模式: Hadoop守护进程运行在本地机器上,模拟一个小规模的集群,换句话说,可以配置一台机器的Hadoop集群,伪分布式是完全分布式的一个特例。
3.完全分布式模式:Hadoop守护进程运行在一个集群上。
HDFS的主要模块
1.NameNode:
功能:是整个文件系统的管理节点。维护整个文件系统的文件目录数,文件/目录的源数据和每个文件对应的数据快列表。用于接受用户的请求。
2.DataNode:
是HA(高可用性)的一个解决方案,是备用镜像,但不支持热设备
一、hadoop单机版测试
1.安装hadoop,创建hadoop用户
[root@server1 ~]# useradd hadoop
[root@server1 ~]# echo redhat | passwd --stdin hadoop
Changing password for user hadoop.
passwd: all authentication tokens updated successfully.
==========================================================
[root@server1 ~]# id hadoop
uid=1000(hadoop) gid=1000(hadoop) groups=1000(hadoop)
==========================================================
[root@server1 ~]# mv * ~hadoop/
[root@server1 ~]# su - hadoop
[hadoop@server1 ~]$ ls
hadoop-2.7.3.tar.gz jdk-7u79-linux-x64.tar.gz
hbase-1.2.4-bin.tar.gz zookeeper-3.4.9.tar.gz
[hadoop@server1 ~]$ tar zxf hadoop-2.7.3.tar.gz
[hadoop@server1 ~]$ tar zxf jdk-7u79-linux-x64.tar.gz
[hadoop@server1 ~]$ ln -s jdk1.7.0_79/ java
[hadoop@server1 ~]$ ln -s hadoop-2.7.3 hadoop
[hadoop@server1 ~]$ ls
hadoop hadoop-2.7.3.tar.gz java jdk-7u79-linux-x64.tar.gz
hadoop-2.7.3 hbase-1.2.4-bin.tar.gz jdk1.7.0_79 zookeeper-3.4.9.tar.gz
2.配置环境变量
[hadoop@server1 ~]$ vim hadoop/etc/hadoop/hadoop-env.sh
========================================================
25 export JAVA_HOME=/home/hadoop/java
[hadoop@server1 ~]$ vim .bash_profile
=======================================
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HOME/java/bin/
[hadoop@server1 ~]$ source .bash_profile
[hadoop@server1 ~]$ jps
2280 Jps
3.测试
[hadoop@server1 ~]$ cd hadoop
[hadoop@server1 hadoop]$ mkdir input
[hadoop@server1 hadoop]$ cp etc/hadoop/*.xml input/
[hadoop@server1 hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
===================================================
[hadoop@server1 hadoop]$ ls input/
capacity-scheduler.xml hadoop-policy.xml httpfs-site.xml kms-site.xml
core-site.xml hdfs-site.xml kms-acls.xml yarn-site.xml
[hadoop@server1 hadoop]$ ls output/
part-r-00000 _SUCCESS
二、伪分布式
1.编辑文件
[hadoop@server1 hadoop]$ vim etc/hadoop/core-site.xml
======================================================
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://172.25.70.1:9000</value>
</property>
</configuration>
[hadoop@server1 hadoop]$ vim etc/hadoop/hdfs-site.xml
========================================
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
2.免密操作
[hadoop@server1 hadoop]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
...
===============================================================
[hadoop@server1 ~]$ ssh-copy-id 172.25.70.1
[hadoop@server1 ~]$ ssh-copy-id localhost
[hadoop@server1 ~]$ ssh-copy-id server1
3.格式化,开启服务
[hadoop@server1 ~]$ cd hadoop
[hadoop@server1 hadoop]$ bin/hdfs namenode -format
===================================================
...
19/05/24 02:12:16 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 353 bytes saved in 0 seconds.
19/05/24 02:12:16 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
19/05/24 02:12:16 INFO util.ExitUtil: Exiting with status 0
19/05/24 02:12:16 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at server1/172.25.70.1
************************************************************/
=======================================================================
[hadoop@server1 hadoop]$ sbin/start-dfs.sh
============================================
Starting namenodes on [server1]
server1: starting namenode, logging to /home/hadoop/hadoop-2.7.3/logs/hadoop-hadoop-namenode-server1.out
localhost: starting datanode, logging to /home/hadoop/hadoop-2.7.3/logs/hadoop-hadoop-datanode-server1.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is 8f:49:b9:37:74:65:26:03:4e:73:fc:44:2d:8b:6e:83.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop-2.7.3/logs/hadoop-hadoop-secondarynamenode-server1.out
[hadoop@server1 hadoop]$ jps
3064 DataNode
3389 Jps
3242 SecondaryNameNode
2962 NameNode
网页输入http://172.25.70.1:50070/
4.测试,创建目录,上传
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
[hadoop@server1 hadoop]$ dfs -ls
-bash: dfs: command not found
[hadoop@server1 hadoop]$ bin/hdfs dfs -ls
[hadoop@server1 hadoop]$ bin/hdfs dfs -put input
[hadoop@server1 hadoop]$ bin/hdfs dfs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2019-05-24 02:20 input
[hadoop@server1 hadoop]$
[hadoop@server1 hadoop]$ rm -fr input
[hadoop@server1 hadoop]$ rm -fr output
[hadoop@server1 hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount input output
==============================================================================
...
Merged Map outputs=8
GC time elapsed (ms)=618
Total committed heap usage (bytes)=1378369536
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=26007
File Output Format Counters
Bytes Written=9984
==============================================================================
[hadoop@server1 hadoop]$ bin/hdfs dfs -cat output/*
===================================================
"*" 18
"AS 8
"License"); 8
"alice,bob 18
"kerberos". 1
"simple" 1
'HTTP/' 1
'none' 1
'random' 1
'sasl' 1
'string' 1
'zookeeper' 2
...
=======================================================
[hadoop@server1 hadoop]$ bin/hdfs dfs -get output
[hadoop@server1 hadoop]$ ls output/
======================================
part-r-00000 _SUCCESS
[hadoop@server1 hadoop]$
网页查看
三、分布式
1.环境恢复
[hadoop@server1 hadoop]$ sbin/stop-dfs.sh
Stopping namenodes on [server1]
server1: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
[hadoop@server1 hadoop]$ cd /tmp/
[hadoop@server1 tmp]$ ls
hadoop-hadoop Jetty_0_0_0_0_50090_secondary____y6aanv
hsperfdata_hadoop Jetty_localhost_38281_datanode____.rmss8j
Jetty_0_0_0_0_50070_hdfs____w2cu08
[hadoop@server1 tmp]$ rm -fr *
2.新建立server2、server3节点
- server2-3:
[root@server2 ~]# useradd hadoop
[root@server2 ~]# echo redhat | passwd --stdin hadoop
Changing password for user hadoop.
passwd: all authentication tokens updated successfully.
[root@server2 ~]# id hadoop
uid=1000(hadoop) gid=1000(hadoop) groups=1000(hadoop)
======================================================
[root@server3 ~]# useradd hadoop
[root@server3 ~]# echo redhat | passwd --stdin hadoop
Changing password for user hadoop.
passwd: all authentication tokens updated successfully.
[root@server3 ~]# id hadoop
uid=1000(hadoop) gid=1000(hadoop) groups=1000(hadoop)
- server1-3:
[root@server1 ~]# yum install -y nfs-utils
[root@server2 ~]# yum install -y nfs-utils
[root@server3 ~]# yum install -y nfs-utils
[root@server1 ~]# systemctl start rpcbind
[root@server2 ~]# systemctl start rpcbind
[root@server3 ~]# systemctl start rpcbind
3.server1开启服务,配置
[root@server1 ~]# systemctl start nfs-server
[root@server1 ~]# vim /etc/exports
===================================
/home/hadoop *(rw,anonuid=1000,anongid=1000)
============================================
[root@server1 ~]# exportfs -r
[root@server1 ~]# exportfs -rv
exporting *:/home/hadoop
[root@server1 ~]# showmount -e
Export list for server1:
/home/hadoop *
4.server2、3挂载
- server2-3:
[root@server2 ~]# mount 172.25.70.1:/home/hadoop /home/hadoop
[root@server2 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rhel-root 18855936 1097944 17757992 6% /
devtmpfs 239256 0 239256 0% /dev
tmpfs 250228 0 250228 0% /dev/shm
tmpfs 250228 8584 241644 4% /run
tmpfs 250228 0 250228 0% /sys/fs/cgroup
/dev/sda1 1038336 141504 896832 14% /boot
tmpfs 50048 0 50048 0% /run/user/0
172.25.70.1:/home/hadoop 18855936 2232320 16623616 12% /home/hadoop
[root@server2 ~]#
================================================================================
[root@server3 ~]# mount 172.25.70.1:/home/hadoop /home/hadoop
[root@server3 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rhel-root 18855936 1097932 17758004 6% /
devtmpfs 239256 0 239256 0% /dev
tmpfs 250228 0 250228 0% /dev/shm
tmpfs 250228 8584 241644 4% /run
tmpfs 250228 0 250228 0% /sys/fs/cgroup
/dev/sda1 1038336 141504 896832 14% /boot
tmpfs 50048 0 50048 0% /run/user/0
172.25.70.1:/home/hadoop 18855936 2232320 16623616 12% /home/hadoop
[root@server3 ~]#
5.重新编辑文件
[hadoop@server1 ~]$ vim hadoop/etc/hadoop/hdfs-site.xml
=======================================
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
===============================================
[hadoop@server1 ~]$ vim hadoop/etc/hadoop/workers
==================================================
172.25.70.2
172.25.70.3
=================================================
[root@server2 ~]# su - hadoop
[hadoop@server2 ~]$ cat hadoop/etc/hadoop/workers
172.25.70.2
172.25.70.3
[hadoop@server2 ~]$
=================================================
[hadoop@server3 ~]$ cat hadoop/etc/hadoop/workers
172.25.70.2
172.25.70.3
[hadoop@server3 ~]$
6.格式化,并重启服务
[hadoop@server1 hadoop]$ bin/hdfs namenode -format
[hadoop@server1 hadoop]$ sbin/start-dfs.sh
Starting namenodes on [server1]
server1: starting namenode, logging to /home/hadoop/hadoop-2.7.3/logs/hadoop-hadoop-namenode-server1.out
localhost: starting datanode, logging to /home/hadoop/hadoop-2.7.3/logs/hadoop-hadoop-datanode-server1.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop-2.7.3/logs/hadoop-hadoop-secondarynamenode-server1.out
=================================================================
[hadoop@server2 ~]$ jps
12032 Jps
[hadoop@server2 ~]$
=========================
[hadoop@server3 ~]$ jps
12153 Jps
[hadoop@server3 ~]$
7.测试
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
[hadoop@server1 hadoop]$ ls
bin include libexec logs output README.txt share
etc lib LICENSE.txt NOTICE.txt part-r-00000 sbin _SUCCESS
[hadoop@server1 hadoop]$ bin/hdfs dfs -put etc/hadoop/input
8.上传大文件
[hadoop@server1 hadoop]$ pwd
/home/hadoop/hadoop
[hadoop@server1 hadoop]$ dd if=/dev/zero of=bigfile bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 5.85916 s, 89.5 MB/s
[hadoop@server1 hadoop]$ bin/hdfs dfs -put bigfile
[hadoop@server1 hadoop]$