如果你对大数据感兴趣,这篇文章将带你进入大数据环境的准备和安装。万事开头难,请耐心读完工具和环境部分,这部分给大家介绍了本文将要使用到的工具组件,在最后会提供一个完整的安装包资源和代码资源供大家学习之用。
大数据为什么这么难
拦路三件套
大数据为什么高不可攀,笔者总结了几条:
- 环境搭建复杂性:要准备的组件很多,搭建一套环境费时费力
- 生态组件丰富性:不知道怎么取舍,具体项目该怎么用不清楚
- 缺乏实战分析:搜遍很多代码平台大多来不来就上代码,缺少场景分析
入门即放弃
很多时候大家通过官网去了解的都是一些原理特性之类的,良心一点的就是附带一些简单示例,但是如何把这些知识紧密结合起来就很困难。所以,入门即放弃,这也是大多数高深技术之所以门槛很高的原因之一。
放弃很容易,但坚持一定很酷!
书看得少
看大数据相关的书可以很好弥补官方介绍不全的问题,只有真正实践过的才能将这些总结为经验。在大数据领域有很多大佬,但是我们仅从官网去看是不够的。一方面它的更新迭代太快,我们反而不能静下心来研究一个沉淀版本;另一方面官方推销味儿太浓,一味地介绍特性原理怎么香却没有讲怎么深入实战。所以我们总是觉得资料很匮乏,只是我们的目光一直盯着官网崇尚权威。对于大数据方面的书籍,多翻一翻总会有一些收获,至少比每次打开官网看到那种销售话语的字样要强得多。
读书可以刷新认知,读书少是一切困难的开始!
基础环境准备
vmware workstation pro 17
下载相关的注意事项
VMware workstation pro 17 : 有些注意事项可以从下面这边文章获得
VMware Workstation Pro 17 提供个人版使用-优快云博客
安装个人版虚拟机
安装时选择个人使用即可,不需要激活码:
分享个人版虚拟机云盘下载
这个工具会提供分享给不方便下载的同学,可以微信关注订阅号:软件服务技术推进官 输入关键字:WMware 获得虚拟机工具。
注意:在分享目录有WmWare Workstation 10的版本,不建议再使用(有性能问题)。
centos-stream-9
官方下载地址:
Downloadhttps://www.centos.org/download/
本次安装推荐采用centos9,官方说的centsos linx7 和centos stream 8生命周期已终结:
iso镜像下载链接,由于该镜像10个G左右所以不在提供下载之列,自行下载即可: https://mirrors.tuna.tsinghua.edu.cn/centos-stream/9-stream/BaseOS/x86_64/iso/CentOS-Stream-9-latest-x86_64-dvd1.iso
安装一个centos-stream-9作为基础虚拟机实例:
开启ssh远程客户端登录
开启系统远程登录许可
验证xshell ssh客户端登录
关闭和禁用防火墙
实例中并未执行以下命令的操作,大家可以根据需要决定要不要禁用防火墙
# 关闭防火墙
systemctl stop firewalld.service
# 启动禁用防火墙
systemctl disable firewalld.service
注意:在生产环境不能关闭防火墙只能配置防火墙规则。
启用或关闭SELinux
SELinux是Linux内核的安全子系统,通过严格的访问控制机制增强系统安全性。一般情况下,建议开启SELinux来限制进程的权限,防止恶意程序通过提权等方式对系统进行攻击;然而,由于SELinux的严格访问控制机制,可能会导致一些应用程序或服务无法启动,因此在特定情况下(如开发、调试等),需暂时关闭SELinux。
查看SELinux配置:
[root@localhost ~]# cat /etc/selinux/config
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
# See also:
# https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/using_selinux/changing-selinux-states-and-modes_using-selinux#changing-selinux-modes-at-boot-time_changing-selinux-states-and-modes
#
# NOTE: Up to RHEL 8 release included, SELINUX=disabled would also
# fully disable SELinux during boot. If you need a system with SELinux
# fully disabled instead of SELinux running with no policy loaded, you
# need to pass selinux=0 to the kernel command line. You can use grubby
# to persistently set the bootloader to boot with selinux=0:
#
# grubby --update-kernel ALL --args selinux=0
#
# To revert back to SELinux enabled:
#
# grubby --update-kernel ALL --remove-args selinux
#
SELINUX=enforcing
# SELINUXTYPE= can take one of these three values:
# targeted - Targeted processes are protected,
# minimum - Modification of targeted policy. Only selected processes are protected.
# mls - Multi Level Security protection.
SELINUXTYPE=targeted
SELinux共有3个状态:
- enforcing (执行中)
- permissive (不执行但产生警告)
- disabled(关闭)
如果要禁用需要将SELINUX=enforcing 改为SELINUX=disabled,这里建议先禁用。
虚拟机实例基础版本下载
这个虚拟机实例非常简单,如果需要也可以微信关注订阅号:软件服务技术推进官 输入centos-stream-9 获得虚拟机实例(虚拟机实例包括账号密码请参考内部文件readme.txt)。
注:这款虚拟机实例相对纯净,可以基于此做自己需要的环境。
机器间拷贝文件命令
使用scp命令即可实现。
复制文件
scp local_file remote_username@remote_ip:remote_folder
或者
scp local_file remote_username@remote_ip:remote_file
或者
scp local_file remote_ip:remote_folder
或者
scp local_file remote_ip:remote_file
复制目录
scp -r local_folder remote_username@remote_ip:remote_folder
或者
scp -r local_folder remote_ip:remote_folder
RHEL订阅注册
无法执行yum安装rpm问题(后面安装Mysql会遇到):
This system is not registered with an entitlement server. You can use "rhc" or "subscription-manager" to register.
使用subscription-manager register --help 查看注册命令
必须在 Hybrid Cloud Console 激活订阅
点击Subscriptions active进行激活:
[root@hadoop0 yum.repos.d]# subscription-manager register --username xxxxxxx@163.com --password xxxxxxxxxxxxx --auto-attach
Registering to: subscription.rhsm.redhat.com:443/subscription
The system has been registered with ID: 1b792ee7-71ba-489a-9862-2ab9a90ab235
The registered system name is: hadoop0
Ignoring the request to auto-attach. Attaching subscriptions is disabled for organization "18489547" because Simple Content Access (SCA) is enabled.
[root@hadoop0 yum.repos.d]#
注:必须要注册redhat账号,上面使用的就是注册账号和密码进行激活。
Hadoop组件安装
Hadoop集群实例规划
一主两从三个实例
按照一主两从规划三个虚拟机实例服务器节点。
通过虚拟机管理的克隆功能就可以实现。
注:实例是关闭的防火墙和开机启动防火墙的,如何关闭和禁用防火墙上面部分有讲述。
设置实例IP地址
三个节点的IP地址如下:
- hadoop0: 192.168.158.100
- hadoop1: 192.168.158.101
- hadoop2: 192.168.158.102
这里以设置 hadoop0: 192.168.158.100 为例进行讲解,hadoop0作为master节点。
找到实例的Activities > Settings:
点开Settings,找到Network wired进行设置:
切换到IPv4使用Manual方式配置Address、NetMask、Gateway:
注意:这里的Netmask和Gateway都能通过ifconfig命令查到对应的网络名称ens33参考设置,改完之后可以reboot重启一下虚拟机,重启之后就可以验证是否修改成功。
通过查询验证IP地址设置是否成功:
通过SSH 验证IP地址是否成功:
其他两个节点如法炮制即可。
设置静态IP开启外网访问
为三个节点开启外网访问,参考我的另一篇文章进行配置:
CentOs-Stream-9 设置静态IP外网访问-优快云博客
验证是否能ssh正常访问
192.168.158.100
192.168.158.101
192.168.158.102
设置实例hostname名称
使用如下命令进行修改三个节点的域名
hostnamectl set-hostname ${realname}
分别在各个服务ssh连接端执行命令:
# 192.168.158.100
hostnamectl set-hostname hadoop0
# 192.168.158.101
hostnamectl set-hostname hadoop1
# 192.168.158.102
hostnamectl set-hostname hadoop2
对主机域名进行验证:
192.168.158.100
192.168.158.101
192.168.158.102
实例共享主机名访问
在hadoop0上vi 命令编辑/etc/hosts,在文件末尾添加如下内容:
192.168.158.100 hadoop0
192.168.158.101 hadoop1
192.168.158.102 hadoop2
通过scp命令将该配置发送到hadoop1和hadoop2节点:
scp /etc/hosts hadoop1://etc/
scp /etc/hosts hadoop2://etc/
至此,三台服务器实例都已准备完毕。
开启ssh免密访问
Hadoop各组件之间使用SSH协议登录,为了免输入密码,可以设计SSH免密登录。实例上需要分别设置SSH协议免密登录,需要分别在三个虚拟机实例执行以下命令:
# 进入秘钥存放目录
cd /root/.ssh
# 删除旧秘钥
rm -rf *
# 生成密码 需要多次回车
ssh-keygen -t rsa
# 将生成密钥文件id_rsa.pub复制到authorized_keys中
cat id_rsa.pub >>authorized_keys
将hadoop1的SSH秘钥复制到hadoop0,在hadoop1上执行以下命令:
ssh-copy-id -i /root/.ssh/id_rsa.pub hadoop0
将hadoop2的SSH秘钥复制到hadoop0,在hadoop2上执行以下命令:
ssh-copy-id -i /root/.ssh/id_rsa.pub hadoop0
检查hadoop0上的秘钥可以用如下命令:
cat /root/.ssh/authorized_keys
hadoop0上会看到3个SSH秘钥字符串:
注意:此时我们hadoop0上的秘钥是最全的,需要复制到hadoop1和hadoop2上去。
我们在hadoop2上用scp命令来分别复制到hadoop0和hadoop1上去:
scp /root/.ssh/authorized_keys hadoop1:/root/.ssh
scp /root/.ssh/authorized_keys hadoop2:/root/.ssh
注意:执行到此处三台机器都支持ssh免密互相访问了(仍需输入密码,就输入一次即可,后续不需要二次输入)。
参考相关SSH免密登录资料:
centos7:ssh免密登陆设置及常见错误 - 平复心态 - 博客园
知识扫盲:id rsa、 id rsa.pub 和authorized keys_idrsa 单行-优快云博客
注意:上述SSH免密直接只对hadoop0起作用,hadoop1和hadoop2首次访问还是需要输入密码,之后就不需要密码验证了。可以在hadoop1和hadoop2上分别执行一次如下命令。
ssh ${host}
开启本机ssh免密
本机域名访问用localhost,ssh访问本机域名即可
ssh localhost
Jdk安装
需要在三个实例上都安装jdk环境,可以安装一个然后分别拷贝到另外两个去。
下载并解压jdk8
直接官网下载: Java Downloads | Oracle
通过ftp工具将jdk包放入/usr/local目录下
cd /usr/local
tar -zxvf jdk-8u421-linux-x64.tar.gz
mv jdk1.8.0_421 jdk
关于JDK目前暂时采用jk8。 获得更多JDK版本可以关注微信订阅号:软件服务技术推进官 输入关键词 jdk 进行获取。
设置环境变量
编辑/etc/profile
##Java home
export JAVA_HOME=/usr/local/jdk
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
更新环境配置
source /etc/profile
执行以上命令会立即生效。
验证Java环境
查看Java环境是否生效
Hadoop集群安装
按照规划,三个实例机器上分别安装Hadoop,可以通过scp方式复制分发到副本机器。可以以hadoop0为原型,做环境配置。
下载并解压Hadoop
手动下载安装包:
下载地址https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
也可以wget获取下载到指定目录:
cd /usr/local
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
mv hadoop-3.4.0 hadoop
最终我们需要得到/usr/local/hadoop的目录。
设置环境变量
在jdk环境变量的基础上进行添加,修改/etc/profile文件,将Hadoop各进程的用户设置为root
# Hadoop home
export HADOOP_HOME=/usr/local/hadoop
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$PATH
更新环境配置
source /etc/profile
执行以上命令会立即生效。
验证环境设置生效
使用hdfs命令测试
Hadoop的5个核心文件
配置hadoop-env.sh
该文件修改hadoop使用的jdk路径,hadoop-env.sh位于/usr/local/hadoop/etc/hadoop目录下
#找到 JAVA_HOME=/usr/java/testing hdfs dfs -ls
JAVA_HOME=/usr/local/jdk
注意:需要将#注释行改为启用。
配置core-site.xml
配置hdfs的默认节点和hadoop的临时存储路径,core-site.xml位于/usr/local/hadoop/etc/hadoop:
<!-- 指定HDFS的老大(NameNode)的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop0:9000</value>
</property>
<!-- 指定hadoop运行时产生文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
<!-- 下面两个配置解决hive无法连接问题 -->
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
配置yarn-site.xml
yarn-site.xml位于/usr/local/hadoop/etc/hadoop:
<!-- 集群master -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop0</value>
</property>
<!-- NodeManger上运行的附属服务 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
容器可能会覆盖的环境变量,可以关闭内存检测。
<!-- 集群master -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop0</value>
</property>
<!-- NodeManger上运行的附属服务 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--环境变量的继承 -->
<property>
<name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<!--关闭内存检查 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
修改mapred-site.xml
mapred-site.xml位于/usr/local/hadoop/etc/hadoop:
mapreduce.framework.name取值local、classic、mapreduce或yarn其中之一;
如果不是yarn,则不会使用yarn集群来实现资源的分配
<!-- 指定mr运行在yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
配置hdfs-site.xml
hdfs-site.xml位于/usr/local/hadoop/etc/hadoop:
<!-- HDFS web地址 -->
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop0:50070</value>
</property>
<!-- 指定HDFS副本的数量 -->
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!-- 是否启用HDFS权限 -->
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<!-- 块大小,默认128MB -->
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
修改workers文件
workers位于/usr/local/hadoop/etc/hadoop,添加工作节点,第一个为master:
hadoop0
hadoop1
hadoop2
注意:默认值localhost要去掉。 以上6个文件的修改都很重要,不能遗漏。
验证Hadoop
Hadoop格式化操作
验证hadoop之前要执行格式化。
在hadoop0上执行格式化:
cd /usr/local/hadoop
hadoop namenode -format
格式化日志输出如下:
STARTUP_MSG: build = git@github.com:apache/hadoop.git -r bd8b77f398f626bb7791783192ee7a5dfaeec760; compiled by 'root' on 2024-03-04T06:35Z
STARTUP_MSG: java = 1.8.0_421
************************************************************/
2024-09-19 21:16:25,694 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2024-09-19 21:16:26,036 INFO namenode.NameNode: createNameNode [-format]
2024-09-19 21:16:28,664 INFO namenode.NameNode: Formatting using clusterid: CID-75ba7cfd-697f-4916-aa76-3a1992c8c183
2024-09-19 21:16:28,848 INFO namenode.FSEditLog: Edit logging is async:true
2024-09-19 21:16:28,975 INFO namenode.FSNamesystem: KeyProvider: null
2024-09-19 21:16:28,984 INFO namenode.FSNamesystem: fsLock is fair: true
2024-09-19 21:16:28,985 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2024-09-19 21:16:29,033 INFO namenode.FSNamesystem: fsOwner = root (auth:SIMPLE)
2024-09-19 21:16:29,034 INFO namenode.FSNamesystem: supergroup = supergroup
2024-09-19 21:16:29,034 INFO namenode.FSNamesystem: isPermissionEnabled = false
2024-09-19 21:16:29,035 INFO namenode.FSNamesystem: isStoragePolicyEnabled = true
2024-09-19 21:16:29,035 INFO namenode.FSNamesystem: HA Enabled: false
2024-09-19 21:16:29,238 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2024-09-19 21:16:29,647 INFO blockmanagement.DatanodeManager: Slow peers collection thread shutdown
2024-09-19 21:16:29,719 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit : configured=1000, counted=60, effected=1000
2024-09-19 21:16:29,724 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2024-09-19 21:16:29,737 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2024-09-19 21:16:29,738 INFO blockmanagement.BlockManager: The block deletion will start around 2024 Sep 19 21:16:29
2024-09-19 21:16:29,745 INFO util.GSet: Computing capacity for map BlocksMap
2024-09-19 21:16:29,747 INFO util.GSet: VM type = 64-bit
2024-09-19 21:16:29,764 INFO util.GSet: 2.0% max memory 177.9 MB = 3.6 MB
2024-09-19 21:16:29,765 INFO util.GSet: capacity = 2^19 = 524288 entries
2024-09-19 21:16:29,806 INFO blockmanagement.BlockManager: Storage policy satisfier is disabled
2024-09-19 21:16:29,809 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2024-09-19 21:16:29,832 INFO blockmanagement.BlockManagerSafeMode: Using 1000 as SafeModeMonitor Interval
2024-09-19 21:16:29,834 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.999
2024-09-19 21:16:29,835 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
2024-09-19 21:16:29,836 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
2024-09-19 21:16:29,837 INFO blockmanagement.BlockManager: defaultReplication = 3
2024-09-19 21:16:29,838 INFO blockmanagement.BlockManager: maxReplication = 512
2024-09-19 21:16:29,839 INFO blockmanagement.BlockManager: minReplication = 1
2024-09-19 21:16:29,839 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
2024-09-19 21:16:29,840 INFO blockmanagement.BlockManager: redundancyRecheckInterval = 3000ms
2024-09-19 21:16:29,840 INFO blockmanagement.BlockManager: encryptDataTransfer = false
2024-09-19 21:16:29,840 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
2024-09-19 21:16:29,981 INFO namenode.FSDirectory: GLOBAL serial map: bits=29 maxEntries=536870911
2024-09-19 21:16:29,982 INFO namenode.FSDirectory: USER serial map: bits=24 maxEntries=16777215
2024-09-19 21:16:29,983 INFO namenode.FSDirectory: GROUP serial map: bits=24 maxEntries=16777215
2024-09-19 21:16:29,983 INFO namenode.FSDirectory: XATTR serial map: bits=24 maxEntries=16777215
2024-09-19 21:16:30,044 INFO util.GSet: Computing capacity for map INodeMap
2024-09-19 21:16:30,046 INFO util.GSet: VM type = 64-bit
2024-09-19 21:16:30,055 INFO util.GSet: 1.0% max memory 177.9 MB = 1.8 MB
2024-09-19 21:16:30,055 INFO util.GSet: capacity = 2^18 = 262144 entries
2024-09-19 21:16:30,058 INFO namenode.FSDirectory: ACLs enabled? true
2024-09-19 21:16:30,059 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2024-09-19 21:16:30,060 INFO namenode.FSDirectory: XAttrs enabled? true
2024-09-19 21:16:30,062 INFO namenode.NameNode: Caching file names occurring more than 10 times
2024-09-19 21:16:30,090 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false, snapshotDiffAllowSnapRootDescendant: true, maxSnapshotFSLimit: 65536, maxSnapshotLimit: 65536
2024-09-19 21:16:30,097 INFO snapshot.SnapshotManager: dfs.namenode.snapshot.deletion.ordered = false
2024-09-19 21:16:30,110 INFO snapshot.SnapshotManager: SkipList is disabled
2024-09-19 21:16:30,141 INFO util.GSet: Computing capacity for map cachedBlocks
2024-09-19 21:16:30,143 INFO util.GSet: VM type = 64-bit
2024-09-19 21:16:30,145 INFO util.GSet: 0.25% max memory 177.9 MB = 455.4 KB
2024-09-19 21:16:30,147 INFO util.GSet: capacity = 2^16 = 65536 entries
2024-09-19 21:16:30,211 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2024-09-19 21:16:30,212 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2024-09-19 21:16:30,213 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2024-09-19 21:16:30,231 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2024-09-19 21:16:30,232 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2024-09-19 21:16:30,249 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2024-09-19 21:16:30,250 INFO util.GSet: VM type = 64-bit
2024-09-19 21:16:30,251 INFO util.GSet: 0.029999999329447746% max memory 177.9 MB = 54.6 KB
2024-09-19 21:16:30,253 INFO util.GSet: capacity = 2^13 = 8192 entries
2024-09-19 21:16:30,350 INFO namenode.FSImage: Allocated new BlockPoolId: BP-45774143-192.168.158.100-1726751790335
2024-09-19 21:16:30,392 INFO common.Storage: Storage directory /usr/local/hadoop/tmp/dfs/name has been successfully formatted.
2024-09-19 21:16:30,530 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2024-09-19 21:16:30,844 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 399 bytes saved in 0 seconds .
2024-09-19 21:16:30,926 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2024-09-19 21:16:30,951 INFO blockmanagement.DatanodeManager: Slow peers collection thread shutdown
2024-09-19 21:16:30,961 INFO namenode.FSNamesystem: Stopping services started for active state
2024-09-19 21:16:30,963 INFO namenode.FSNamesystem: Stopping services started for standby state
2024-09-19 21:16:31,002 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2024-09-19 21:16:31,003 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop0/192.168.158.100
************************************************************/
Hadoop启停操作
通过master节点操作启动和停止:
cd /usr/local/hadoop/sbin
# 启动
./start-all.sh
# 停止
./stop-all.sh
在sbin下才有可执行sh文件。
下面是启动命令输出:
下面是停止命令输出:
Hadoop查看进程
使用jps命令分别在三个节点去查看:
master hadoop0:
hadoop1:
hadoop2:
以上显示均符合预期。
Hadoop浏览文件
使用如下命令:
hadoop fs -ls /
此时由于没有文件存在,所以没有显示,但是hdfs已支持我们上传和下载文件。
Hadoop浏览器访问
浏览器输入:http://hadoop0:50070/
我们设置过hdfs-site.xml中设置了web站点访问地址:hadoop0:50070,此时我们可用这个地址在浏览器中去访问。
有3个存活节点:
这3个存活节点信息如下:
至此,Hadoop集群搭建成功。
如何横向扩展?
通过对Hadoop集群的安装,我们知道如果要增加节点,那么只需要将jdk和Hadoop的目录进行远程复制到新机器,设置好环境变量,让环境变量生效,添加新增节点的域名hadoop3,haoop4,....hadoopX这样实现机器集群的扩容。
后面具体实操遇到了再进行详细补充。
HBase集群安装
当您需要对大数据进行随机、实时的读/写访问时,请使用Apache HBase®。这个项目的目标是托管非常大的表——数十亿行X数百万列——在商品硬件集群之上。Apache HBase®是一个开源的、分布式的、版本化的、非关系型数据库,其模型来自于Google的Bigtable: A distributed Storage System for Structured Data(由Chang等人编写)。正如Bigtable利用了Google File System提供的分布式数据存储一样,Apache HBase®在Hadoop和HDFS之上提供了类似Bigtable的功能。
HBase安装的前提是Hadoop已安装启动。我们还是在hadoop0上安装HBase然后通过远程拷贝的方式复制给hadoop1和hadoop2。操作步骤比Hadoop简单,关键步骤仍然是需要的,如:安装包下载解压、配置环境变量、配置HBase参数等等。
HBase下载解压
官网下载地址:Apache HBase – Apache HBase Downloads
本文以2.6.0为例,命令行处理下载和解压:
cd /usr/local
wget https://dlcdn.apache.org/hbase/2.6.0/hbase-2.6.0-bin.tar.gz
tar -zxvf hbase-2.6.0-bin.tar.gz
mvn hbase-2.6.0 hbase
配置环境变量
编辑/etc/profile,在原来配置的基础上添加HBASE_HOME的配置:
# HBase
export HBASE_HOME=/usr/local/hbase
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$PATH
更新环境配置
source /etc/profile
执行以上命令会立即生效。
配置HBase参数
配置hbase-env.sh
hbase-env.sh位于/usr/local/hbase/conf目录
export JAVA_HOME=/usr/local/jdk/
export HBASE_MANAGES_ZK=true
修改后将该文件拷贝到/usr/local/hbase/bin目录下。
配置hbase-site.xml
hbase-site.xml位于/usr/local/hbase/conf目录
<!-- ******核心配置,必须配置 端口跟hdfs的defaultFS端口一致******** -->
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop0:9000/hbase</value>
</property>
<!-- HBase的运行模式,false是单机模式,true是分布式模式,若为false则hbase和zk回运行在一个jvm中 -->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<!-- zookeeper节点,只能为奇数个 -->
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoop0,hadoop1,hadoop2</value>
</property>
<!-- 指定Zookeeper数据存储目录 -->
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/hbase/zookeeper</value>
</property>
<!-- HBase的Web端口,默认是16010 -->
<property>
<name>hbase.master.info.port</name>
<value>16010</value>
</property>
修改后将该文件拷贝到/usr/local/hbase/bin目录下。
配置regionservers
regionservers位于/usr/local/hbase/conf目录,将内容改为如下节点:
hadoop0
hadoop1
hadoop2
注意:默认值localhost要去掉。
远程复制配置文件
将hbase目录分别复制给hadoop1和hadoop2。
# 复制hbase目录
scp -r /usr/local/hbase hadoop1:/usr/local
scp -r /usr/local/hbase hadoop2:/usr/local
# 复制环境变量配置文件
scp /etc/profile hadoop1://etc/
scp /etc/profile hadoop2://etc/
执行完上述命令之后需要让hadoop1和hadoop2环境变量生效:
source /etc/profile
执行以上命令会立即生效。
验证HBase
前提是需要先启动Hadoop集群。
HBase启动和停止
在master节点hadoop0上启动HBase。进入到HBASE_HOME/bin目录:启动./start-hbase.sh
相应的有停止命令./stop-hbase.sh。
启动Shell命令行
通过hbase shell命令进入shell命令行:
hbase shell
查看所有表,输入list命令查看:
hbase> list
如果输入list之后发生了错误,需要关闭hdfs的安全模式
关闭hdfs安全模式命令:
# 关闭hdfs 安全模式
hdfs dfsadmin -safemode leave
注意:关闭安全模式之后需要重启HBase集群。
要退出shell命令行,则输入exit退出命令行。
浏览器web访问
浏览器输入:http://hadoop0:16010/
可以看到集群三台服务器的信息。
自动创建/hbase目录
HBase安装完成之后会自动在hdfs中创建/hbase目录
查看服务进程
使用jps可以看到hadoop0有3个H开头的进程,而其他两个节点只有2个H开头的服务进程。
hadoop0
hadoop1
hadoop2
从以上几个方面已作验证集群安装是否成功。
Hive集群安装
Apache Hive是一个分布式的、容错的数据仓库系统,支持大规模的分析。Hive Metastore(HMS)提供了一个元数据的中央存储库,可以很容易地对其进行分析,从而做出明智的、数据驱动的决策,因此它是许多数据湖架构的关键组件。Hive建立在Apache Hadoop之上,通过hdfs支持S3、adls、gs等存储。Hive允许用户使用SQL读取、写入和管理pb级的数据。
Hive安装的前提是Hadoop已安装启动。Hive的安装相对简单得多,只需要将安装包下载解压配置环境变量即可。Hive的集群只需要在hadoop0上安装。
Hive下载和解压
官网下载: Apache Download Mirrors
我们以Hive4.0.0为例:
下载链接:
Index of /hive/hive-4.0.0
命令行操作:
cd /usr/local
wget https://dlcdn.apache.org/hive/hive-4.0.0/apache-hive-4.0.0-bin.tar.gz
tar -zxvf apache-hive-4.0.0-bin.tar.gz
mv apache-hive-4.0.0-bin hive
Hive配置环境变量
编辑/etc/profile
# Hive
export HIVE_HOME=/usr/local/hive
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$HIVE_HOME/bin:$PATH
更新环境配置
source /etc/profile
执行以上命令会立即生效。
进入beeline命令行
直接输入hive命令,进入beeline命令行:
MySQL安装
Hive默认使用Derby数据库作为存储引擎,存储Hive的元数据信息,但Derby引擎的缺点是一次只能打开一个会话,不能多用户并发访问。所以需要安装MySQL,并将Hive的存储引擎改为MySQL。MySQL采用在线安装,需要配置域名服务器。
下载mysql8安装包
cd /usr/local
wget https://dev.mysql.com/get/mysql80-community-release-el9-3.noarch.rpm
安装rpm包
rpm -ivh mysql80-community-release-el9-3.noarch.rpm
mysql源是否安装成功
yum repolist enabled | grep "mysql.*-community.*"
安装mysql
dnf install mysql-community-server -y --nogpgcheck
启动并设置开机启动
执行如下命令:
systemctl enable --now mysqld.service # 开机自启并立即启动mysql
systemctl status mysqld.service # 查看mysql状态
修改root本地登录密码
1)查看mysql密码
grep 'temporary password' /var/log/mysqld.log
2)连接mysql,密码为上图中的红色框出来的部分
mysql -uroot -p
输入查询到的默认密码即可登录进去,修改密码:
# 密码要设置复杂一点否则通不过安全校验
ALTER USER 'root'@'localhost' IDENTIFIED BY 'Boonya@2024';
开启远程访问:
update user set host='%' where user='root';
创建MySQL hive用户
语法: CREATE USER 'username' @'host' IDENTIFIED BY 'password'
CREATE USER 'hive' @'%' IDENTIFIED BY 'Boonya@2024';
create user 'boonya' @'%' identified by 'Boonya@2024';
create user 'test' @'%' identified by 'Boonya@2024';
参考:CentOS Stream 9中安装MySQL的详细步骤 - 小白典 - 博客园
Hive配置Mysql
/usr/local/hive/conf目录下有两个配置文件模版,需要复制出来配置:
cd /usr/local/hive/conf
cp hive-env.sh.template hive-env.sh
cp hive-default.xml.template hive-site.xml
hive-config.sh配置
该文件位于/usr/local/hive/bin,编辑文件hive-config.sh
vi /usr/local/hive/bin/hive-config.sh
追加环境变量到最后:
export JAVA_HOME=/usr/local/jdk
export HADOOP_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
复制Mysql连接驱动
下载对应的Mysql驱动:https://mvnrepository.com/artifact/mysql/mysql-connector-java
然后放入/usr/local/hive/lib目录下:
cp mysql-connector-java-8.0.16.redhat-00001.jar /usr/local/hive/lib/
创建hive临时目录
mkdir /usr/local/hive/tmp
配置hive-site.xml数据连接
找到ConnectionURL、ConnectionDriverName、ConnectionUserName、ConnectionPassword修改为Mysql jdbc连接地址、账号、密码登信息。
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop0:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<!-- 驱动名称:Mysql8使用com.mysql.cj.jdbc.Driver,之前版本用com.mysql.jdbc.Driver -->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Boonya@2024</value>
<description>password to use against metastore database</description>
</property>
<!-- -------Hive必要的连接配置,解决jdbc无法连接问题------------------ -->
<!-- 指定存储元数据要连接的地址 -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop0:9083</value>
<description>URI for client to connect to metastore server</description>
</property>
<!-- 指定hiveserver2连接的host -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop0</value>
</property>
<!-- 指定hiveserver2连接的端口号 -->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<!-- hiveserver2的高可用参数,如果不开会导致了开启tez session导致hiveserver2无法启动 -->
<property>
<name>hive.server2.active.passive.ha.enable</name>
<value>true</value>
</property>
<!--解决Error initializing notification event poll问题-->
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
配置hive-site.xml目录信息
找到${system:Java.io.tmpdir}批量替换为/usr/local/hive/tmp 共有4处。
找到${system:user:name}批量替换为root共有3处。
注:早期的hive没有这两个配置。
初始化Hive表
初始化Hive数据库
schematool -dbType mysql -initSchema
注:早期Hive版本不需要执行schematool初始化操作。
Hive组件启动和验证访问
需要开启两个窗口分别启动metastore和hiveserver2:
# 端口:8093(Hive Metastore是Hive的元数据存储服务,需要确保Metastore服务已经启动,并且在Beeline的配置文件中正确配置了Metastore的地址。)
hive --service metastore
# 端口:10000(HiveServer2是Hive的查询服务,需要确保HiveServer2服务已经启动,并且在Beeline的配置文件中正确配置了HiveServer2的地址。)
hive --service hiveserver2
进入Hive验证:
hive
hive> show databases;
如果没有报错表示集群搭建完成。不过应该会有问题比如连不上hive,请参考hive疑难杂症处理。
验证过程输出:
[root@hadoop0 ~]# hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Beeline version 4.0.0 by Apache Hive
beeline> show databases;
No current connection
beeline> show tables;
No current connection
beeline> !connect jdbc:hive2://hadoop0:10000
Connecting to jdbc:hive2://hadoop0:10000
Enter username for jdbc:hive2://hadoop0:10000: hive
Enter password for jdbc:hive2://hadoop0:10000: ***********
Connected to: Apache Hive (version 4.0.0)
Driver: Hive JDBC (version 4.0.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hadoop0:10000> show tables;
INFO : Compiling command(queryId=root_20241030223228_238fdf76-8a00-4593-b1ba-77b5a296a644): show tables
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=root_20241030223228_238fdf76-8a00-4593-b1ba-77b5a296a644); Time taken: 12.536 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=root_20241030223228_238fdf76-8a00-4593-b1ba-77b5a296a644): show tables
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=root_20241030223228_238fdf76-8a00-4593-b1ba-77b5a296a644); Time taken: 3.105 seconds
+-----------+
| tab_name |
+-----------+
+-----------+
No rows selected (20.096 seconds)
0: jdbc:hive2://hadoop0:10000> show databases;
INFO : Compiling command(queryId=root_20241030223255_c3ed7a41-5f16-40cc-a390-2cf5edb1283a): show databases
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=root_20241030223255_c3ed7a41-5f16-40cc-a390-2cf5edb1283a); Time taken: 0.28 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=root_20241030223255_c3ed7a41-5f16-40cc-a390-2cf5edb1283a): show databases
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=root_20241030223255_c3ed7a41-5f16-40cc-a390-2cf5edb1283a); Time taken: 0.972 seconds
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (2.003 seconds)
0: jdbc:hive2://hadoop0:10000> create database test;
INFO : Compiling command(queryId=root_20241030223729_629d6d72-e6f5-4725-9e8b-4d5b71054337): create database test
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=root_20241030223729_629d6d72-e6f5-4725-9e8b-4d5b71054337); Time taken: 0.308 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=root_20241030223729_629d6d72-e6f5-4725-9e8b-4d5b71054337): create database test
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=root_20241030223729_629d6d72-e6f5-4725-9e8b-4d5b71054337); Time taken: 4.287 seconds
No rows affected (5.252 seconds)
0: jdbc:hive2://hadoop0:10000> show databases;
INFO : Compiling command(queryId=root_20241030223738_75a94fc5-951c-4100-beff-10bb4f54db25): show databases
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=root_20241030223738_75a94fc5-951c-4100-beff-10bb4f54db25); Time taken: 0.656 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=root_20241030223738_75a94fc5-951c-4100-beff-10bb4f54db25): show databases
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=root_20241030223738_75a94fc5-951c-4100-beff-10bb4f54db25); Time taken: 0.164 seconds
+----------------+
| database_name |
+----------------+
| default |
| test |
+----------------+
2 rows selected (1.087 seconds)
0: jdbc:hive2://hadoop0:10000> use test;
INFO : Compiling command(queryId=root_20241030223806_c9639ade-6800-4cc4-b000-e17f67e2152b): use test
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=root_20241030223806_c9639ade-6800-4cc4-b000-e17f67e2152b); Time taken: 0.105 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=root_20241030223806_c9639ade-6800-4cc4-b000-e17f67e2152b): use test
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=root_20241030223806_c9639ade-6800-4cc4-b000-e17f67e2152b); Time taken: 0.06 seconds
No rows affected (0.277 seconds)
0: jdbc:hive2://hadoop0:10000> create table sqoop_students (
. . . . . . . . . . . . . . .> id int primary key DISABLE NOVALIDATE RELY,
. . . . . . . . . . . . . . .> name varchar(20),
. . . . . . . . . . . . . . .> sex varchar(20),
. . . . . . . . . . . . . . .> age int,
. . . . . . . . . . . . . . .> department varchar(20)
. . . . . . . . . . . . . . .> );
INFO : Compiling command(queryId=root_20241030224500_382d7aa3-0716-40c3-a892-8501ac854863): create table sqoop_students (
id int primary key DISABLE NOVALIDATE RELY,
name varchar(20),
sex varchar(20),
age int,
department varchar(20)
)
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=root_20241030224500_382d7aa3-0716-40c3-a892-8501ac854863); Time taken: 0.674 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=root_20241030224500_382d7aa3-0716-40c3-a892-8501ac854863): create table sqoop_students (
id int primary key DISABLE NOVALIDATE RELY,
name varchar(20),
sex varchar(20),
age int,
department varchar(20)
)
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=root_20241030224500_382d7aa3-0716-40c3-a892-8501ac854863); Time taken: 2.751 seconds
No rows affected (4.499 seconds)
0: jdbc:hive2://hadoop0:10000> show tables;
INFO : Compiling command(queryId=root_20241030224512_83456b37-d200-497e-b717-f4420134e449): show tables
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=root_20241030224512_83456b37-d200-497e-b717-f4420134e449); Time taken: 0.31 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=root_20241030224512_83456b37-d200-497e-b717-f4420134e449): show tables
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=root_20241030224512_83456b37-d200-497e-b717-f4420134e449); Time taken: 0.458 seconds
+-----------------+
| tab_name |
+-----------------+
| sqoop_students |
+-----------------+
1 row selected (1.374 seconds)
0: jdbc:hive2://hadoop0:10000>
注意:建表语句要符合Hive的DDL规范。
Hive疑难杂症问题
No current connection
由于笔者在搭建环境过程中忽略了上一步hive的metastore和hiveserver2的启动所以一直连不上数据库。通过处理有3种方式可以连:
#1.第一种:采用beeline命令
beeline -u jdbc:hive2://hadoop0:10000 --verbose=true
#2.第二种,输入hive,通过连接处理
!connect jdbc:hive2://hadoop0:10000
#3.第三种:采用beeline用户登录
beeline -u jdbc:hive2://hadoop0:10000 -n hive -pBoonya@2024
上面的hadoop0和相应的端口就是我们要启动的hiveserver2,在hive-site.xml中已有相关处理的配置。参考:(最新版本)hive4.0.0 + hadoop3.3.4 集群安装(无坑版)-大数据学习系列(一)-优快云博客
Spark集群安装
Apache Spark是用于大规模数据处理的统一分析引擎。它提供了Java、Scala、Python和R语言的高级api,以及一个支持通用执行图的优化引擎。它还支持一组丰富的高级工具,包括用于SQL和结构化数据处理的Spark SQL,用于pandas工作负载的Spark pandas API,用于机器学习的MLlib,用于图形处理的GraphX,以及用于增量计算和流处理的结构化流。
Apache Spark™ - Unified Engine for large-scale data analytics
安装Spark的前提是Hadoop服务已安装并启动。Spark集群安装只需要处理hadoop0节点,其余节点拷贝相关文件即可。由于 spark依赖scala,所以需要先安装scala。
scala安装
下载和解压
官网地址:https://www.scala-lang.org/
下载LTS版本:Release 3.3.3 · scala/scala3 · GitHub
这里以3.3.3为例:
cd /usr/local
wget https://github.com/scala/scala3/releases/download/3.3.3/scala3-3.3.3.tar.gz
tar -zxvf scala3-3.3.3.tar.gz
mv scala3-3.3.3 scala
设置环境变量
#scala
export SCALA_HOME=/usr/local/scala
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$HIVE_HOME/bin:$SCALA_HOME/bin:$PATH
更新环境配置
source /etc/profile
spark安装
官网地址: Apache Spark™ - Unified Engine for large-scale data analytics
下载和解压
下载地址:https://dlcdn.apache.org/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz
通过命令行操作:
cd /usr/local
wget https://dlcdn.apache.org/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz
tar -zxvf spark-3.5.2-bin-hadoop3.tgz
mv spark-3.5.2-bin-hadoop3 spark
设置环境变量
# Spark
export SPARK_HOME=/usr/local/spark
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$HIVE_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH
更新环境配置
source /etc/profile
配置Spark
需要进入到/usr/local/spark/conf目录,生成slaves、spark-env.sh、spark-defaults.conf等文件。需要分别处理slaves.template 、spark-env.sh.template、spark-defaults.conf.template文件。
注意:在最新版本里面slaves.template已改为workers.template
cd /usr/local/spark/conf
cp workers.template salves
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
配置salves
在该文件中追加从节点域名:
hadoop1
hadoop2
配置spark-env.sh
在spark-env.sh中追加如下内容:
export JAVA_HOME=/usr/local/jdk
export SCALA_HOME=/usr/local/scala
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_MASTER_HOST=hadoop0
export SPARK_PID_DIR=/usr/local/spark/data/pid
export SPARK_LOCAL_DIRS=/usr/local/spark/data/spark_shuffle
export SPARK_EXECUTOR_MEMORY=500M
export SPARK_WORKER_MEMORY=4G
配置spark-defaults.conf
在spark-defaults.conf中追加如下内容:
spark.master spark://hadoop0:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop0:9000/eventLog
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 1g
注意这里需要手动创建/eventLog目录到hdfs:
hadoop fs -mkdir /eventLog
配置/root/.bashrc
编辑/root/.bashrc,加入JAVA_HOME配置:
复制Spark文件给从节点
仍然使用scp命令复制文件给hadoop1和hadoop2。
scp -r /usr/local/scala hadoop1:/usr/local
scp -r /usr/local/scala hadoop2:/usr/local
scp -r /usr/local/spark hadoop1:/usr/local
scp -r /usr/local/spark hadoop2:/usr/local
scp /root/.bashrc hadoop1:/root/
scp /root/.bashrc hadoop2:/root/
scp /etc/profile hadoop1:/etc/
scp /etc/profile hadoop2:/etc/
更新环境配置
source /etc/profile
验证Spark
要验证Spark之前先启动Hadoop,这里Spark作为Hadoop的组件。
启动Spark
进入/usr/local/spark/sbin目录:
cd /usr/local/spark/sbin
./start-all.sh
或者直接一条命令执行:
/usr/local/spark/sbin/start-all.sh
启动日志如下:
此时在hadoop0上创建了Master进程,在hadoop1和hadoop2上创建了Worker进程。
hadoop0
hadoop1
hadoop2
运行示例程序
引导执行程序/usr/local/spark/bin/run-example,我们试着来运行一个π运算结果:
/usr/local/spark/bin/run-example SparkPi
控制台输出日志:
[root@hadoop0 ~]# /usr/local/spark/bin/run-example SparkPi
24/09/20 15:24:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/09/20 15:24:44 INFO SparkContext: Running Spark version 3.5.2
24/09/20 15:24:44 INFO SparkContext: OS info Linux, 5.14.0-505.el9.x86_64, amd64
24/09/20 15:24:44 INFO SparkContext: Java version 1.8.0_421
24/09/20 15:24:44 INFO ResourceUtils: ==============================================================
24/09/20 15:24:44 INFO ResourceUtils: No custom resources configured for spark.driver.
24/09/20 15:24:44 INFO ResourceUtils: ==============================================================
24/09/20 15:24:44 INFO SparkContext: Submitted application: Spark Pi
24/09/20 15:24:44 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(memory -> name: memory, amount: 500, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
24/09/20 15:24:44 INFO ResourceProfile: Limiting resource is cpu
24/09/20 15:24:44 INFO ResourceProfileManager: Added ResourceProfile id: 0
24/09/20 15:24:44 INFO SecurityManager: Changing view acls to: root
24/09/20 15:24:44 INFO SecurityManager: Changing modify acls to: root
24/09/20 15:24:44 INFO SecurityManager: Changing view acls groups to:
24/09/20 15:24:44 INFO SecurityManager: Changing modify acls groups to:
24/09/20 15:24:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: root; groups with view permissions: EMPTY; users with modify permissions: root; groups with modify permissions: EMPTY
24/09/20 15:24:44 INFO Utils: Successfully started service 'sparkDriver' on port 39635.
24/09/20 15:24:44 INFO SparkEnv: Registering MapOutputTracker
24/09/20 15:24:45 INFO SparkEnv: Registering BlockManagerMaster
24/09/20 15:24:45 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
24/09/20 15:24:45 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
24/09/20 15:24:45 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
24/09/20 15:24:45 INFO DiskBlockManager: Created local directory at /usr/local/spark/data/spark_shuffle/blockmgr-cb2c39b8-d13d-496d-b41c-3fa90ffcc888
24/09/20 15:24:45 INFO MemoryStore: MemoryStore started with capacity 413.9 MiB
24/09/20 15:24:45 INFO SparkEnv: Registering OutputCommitCoordinator
24/09/20 15:24:45 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
24/09/20 15:24:45 INFO Utils: Successfully started service 'SparkUI' on port 4040.
24/09/20 15:24:45 INFO SparkContext: Added JAR file:///usr/local/spark/examples/jars/spark-examples_2.12-3.5.2.jar at spark://hadoop0:39635/jars/spark-examples_2.12-3.5.2.jar with timestamp 1726817084030
24/09/20 15:24:45 INFO SparkContext: Added JAR file:///usr/local/spark/examples/jars/scopt_2.12-3.7.1.jar at spark://hadoop0:39635/jars/scopt_2.12-3.7.1.jar with timestamp 1726817084030
24/09/20 15:24:45 WARN SparkContext: The JAR file:/usr/local/spark/examples/jars/spark-examples_2.12-3.5.2.jar at spark://hadoop0:39635/jars/spark-examples_2.12-3.5.2.jar has been added already. Overwriting of added jar is not supported in the current version.
24/09/20 15:24:46 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://hadoop0:7077...
24/09/20 15:24:46 INFO TransportClientFactory: Successfully created connection to hadoop0/192.168.158.100:7077 after 108 ms (0 ms spent in bootstraps)
24/09/20 15:24:47 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20240920152447-0000
24/09/20 15:24:47 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41663.
24/09/20 15:24:47 INFO NettyBlockTransferService: Server created on hadoop0:41663
24/09/20 15:24:47 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
24/09/20 15:24:47 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop0, 41663, None)
24/09/20 15:24:47 INFO BlockManagerMasterEndpoint: Registering block manager hadoop0:41663 with 413.9 MiB RAM, BlockManagerId(driver, hadoop0, 41663, None)
24/09/20 15:24:47 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop0, 41663, None)
24/09/20 15:24:47 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop0, 41663, None)
24/09/20 15:24:47 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20240920152447-0000/0 on worker-20240920151607-192.168.158.101-33765 (192.168.158.101:33765) with 1 core(s)
24/09/20 15:24:47 INFO StandaloneSchedulerBackend: Granted executor ID app-20240920152447-0000/0 on hostPort 192.168.158.101:33765 with 1 core(s), 500.0 MiB RAM
24/09/20 15:24:47 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20240920152447-0000/1 on worker-20240920151607-192.168.158.102-37905 (192.168.158.102:37905) with 1 core(s)
24/09/20 15:24:47 INFO StandaloneSchedulerBackend: Granted executor ID app-20240920152447-0000/1 on hostPort 192.168.158.102:37905 with 1 core(s), 500.0 MiB RAM
24/09/20 15:24:48 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20240920152447-0000/0 is now RUNNING
24/09/20 15:24:48 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20240920152447-0000/1 is now RUNNING
24/09/20 15:24:49 INFO SingleEventLogFileWriter: Logging events to hdfs://hadoop0:9000/eventLog/app-20240920152447-0000.inprogress
24/09/20 15:24:53 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
24/09/20 15:24:54 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.158.102:40686) with ID 1, ResourceProfileId 0
24/09/20 15:24:54 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.158.101:52264) with ID 0, ResourceProfileId 0
24/09/20 15:24:55 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.158.102:35887 with 110.0 MiB RAM, BlockManagerId(1, 192.168.158.102, 35887, None)
24/09/20 15:24:55 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.158.101:34089 with 110.0 MiB RAM, BlockManagerId(0, 192.168.158.101, 34089, None)
24/09/20 15:24:55 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
24/09/20 15:24:55 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
24/09/20 15:24:55 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
24/09/20 15:24:55 INFO DAGScheduler: Parents of final stage: List()
24/09/20 15:24:55 INFO DAGScheduler: Missing parents: List()
24/09/20 15:24:55 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
24/09/20 15:24:55 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.0 KiB, free 413.9 MiB)
24/09/20 15:24:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.3 KiB, free 413.9 MiB)
24/09/20 15:24:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop0:41663 (size: 2.3 KiB, free: 413.9 MiB)
24/09/20 15:24:56 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1585
24/09/20 15:24:57 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
24/09/20 15:24:57 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks resource profile 0
24/09/20 15:24:57 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (192.168.158.102, executor 1, partition 0, PROCESS_LOCAL, 9167 bytes)
24/09/20 15:24:57 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (192.168.158.101, executor 0, partition 1, PROCESS_LOCAL, 9167 bytes)
24/09/20 15:24:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.158.102:35887 (size: 2.3 KiB, free: 110.0 MiB)
24/09/20 15:24:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.158.101:34089 (size: 2.3 KiB, free: 110.0 MiB)
24/09/20 15:24:59 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2352 ms on 192.168.158.102 (executor 1) (1/2)
24/09/20 15:24:59 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2547 ms on 192.168.158.101 (executor 0) (2/2)
24/09/20 15:24:59 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 4.146 s
24/09/20 15:24:59 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
24/09/20 15:24:59 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
24/09/20 15:24:59 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
24/09/20 15:24:59 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 4.337072 s
Pi is roughly 3.1383756918784593
24/09/20 15:24:59 INFO SparkContext: SparkContext is stopping with exitCode 0.
24/09/20 15:24:59 INFO SparkUI: Stopped Spark web UI at http://hadoop0:4040
24/09/20 15:24:59 INFO StandaloneSchedulerBackend: Shutting down all executors
24/09/20 15:24:59 INFO StandaloneSchedulerBackend$StandaloneDriverEndpoint: Asking each executor to shut down
24/09/20 15:25:01 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/09/20 15:25:01 INFO MemoryStore: MemoryStore cleared
24/09/20 15:25:01 INFO BlockManager: BlockManager stopped
24/09/20 15:25:01 INFO BlockManagerMaster: BlockManagerMaster stopped
24/09/20 15:25:01 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/09/20 15:25:02 INFO SparkContext: Successfully stopped SparkContext
24/09/20 15:25:02 INFO ShutdownHookManager: Shutdown hook called
24/09/20 15:25:02 INFO ShutdownHookManager: Deleting directory /usr/local/spark/data/spark_shuffle/spark-35d0c193-af28-46ae-a5a9-52f7b54e5012
24/09/20 15:25:02 INFO ShutdownHookManager: Deleting directory /tmp/spark-381e4650-7f49-4aef-b949-a5996b06a4ea
[root@hadoop0 ~]#
Spark shell命令
通过spark-shell开启
spark-shell
我们会启动一个Scala命令行窗口:
查看Spark允许状态
通过hadoop0:4040查看任务调度情况
通过hadoop0:8080查看Spark节点信息
集群快捷启动命令
启动Hadoop集群
/usr/local/hadoop/sbin/start-all.sh
启动HBase集群
/usr/local/hbase/bin/start-hbase.sh
启动Spark集群
/usr/local/spark/sbin/start-all.sh
大数据项目
bigdata项目
该项目从《大数据Hadoop3.x分布式处理实战》一书中得到学习动力,也推荐大家好好去看一看这本书。谈到大数据就绕不开Hadoop,它的生态环境相当丰富,这也给我们学习带来了一定难度。所以持续学习大数据就要不断深挖,不断在未知领域去探索,这样才能有更多的收获,从而在职场上拥有一项核心竞争力!未来我将持续在该项目中加入大数据的一些处理经验,让更多的人少走弯路。
GitHub - boonya-bigdata/bigdata: Based Hadoop ecosystem components in bigdata fields.Based Hadoop ecosystem components in bigdata fields. - boonya-bigdata/bigdatahttps://github.com/boonya-bigdata/bigdata该项目目前有Hadoop和Spark两个核心模块,由于Spark不依赖于Hadoop而独自可实现数据内存任务流式处理,所以目前暂定这部分为两个核心模块。
在bigdata-hadoop中 bigdata-hadoop-3.x模块目前是《大数据Hadoop3.x分布式处理实战》一书中提供的源码。
bigdata的定位就是深入学习和掌握大数据处理方法,让大数据不再神秘而高不可攀。
获取配置资源
获得全套大数据集群环境实例配置文件,欢迎评论留言。
获得配置集群实例
欢迎评论留言,一起交流!