文章目录
参考
- CentOS7下超详细搭建完全分布式集群——hadoop2.7.7
https://blog.youkuaiyun.com/zht245648124/article/details/88093071
- Intellij IDEA编写Spark应用程序超详细步骤(IDEA+Maven+Scala)
https://blog.youkuaiyun.com/Wing_kin666/article/details/111246201
- spark-在IDEA中搭建scala编程环境
https://blog.youkuaiyun.com/weixin_52831324/article/details/127038011
版本
-
VMware 16
-
3台虚拟机 NAT模式:
- CentOS7(hadoopmaster,hadoopnode2,hadoopnode3)
- 这里搭建的是3个节点的完全分布式,即1个nameNode,2个dataNode
-
java 1.8.0_221
-
hadoop 2.7.7
-
winutils
-
zookeeper 3.4.14
-
mysql 5.7.25
-
apache-hive 2.3.4
-
scala 2.11.8
-
spark 2.4.3
-
mysql-connector-java 5.1.47
-
idea 2020.1.4
WINDOWS
使用IDEA需要先配置hadoop环境
1、JAVA安装
-
运行程序
jdk-8u221-windows-x64.exe
-
中间需要自定义安装目录,记下来:
E:\mysoft\Java\jdk1.8.0_221
- 新增系统变量
JAVA_HOME
E:\mysoft\Java\jdk1.8.0_221
- 新增系统变量
CLASSPATH
%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\tools.jar;
- 编辑系统变量
PATH
,编辑文本
%JAVA_HOME%\bin;%JAVA_HOME%\jre\bin;
- 测试
java -version
2、Hadoop
-
解压
hadoop-2.7.7.tar.gz
-
用
winutils\hadoop2.7.7\bin
替换hadoop/bin
-
把
hadoop/bin
中winutils.exe
和hadoop.dll
放入C:\Windows\System32
-
新增系统变量
HADOOP_HOME
E:\mysoft\hadoop-2.7.7
- 编辑系统变量
PATH
,编辑文本
%HADOOP_HOME%\bin
- (舍弃)管理员模式切换E盘运行cmd
# cd /d e:\
# cd E:\mysoft\hadoop-2.7.7\etc\hadoop
# notepad hadoop-env.cmd
# 加入
# set JAVA_HOME=E:\mysoft\Java\jdk1.8.0_221
LINUX
依次配置java, zookeeper, hadoop, mysql, hive, spark, scala
1、关闭防火墙
hadoopmaster,hadoopnode2,hadoopnode3
systemctl status firewalld.service
systemctl stop firewalld.service
systemctl disable firewalld.service
# 关闭selinux安全机制:
vi /etc/sysconfig/selinux
# 修改SELINUX=disabled
2、修改主机名
hadoopmaster,hadoopnode2,hadoopnode3
# hadoopmaster
hostnamectl set-hostname hadoopmaster
vi /etc/sysconfig/network
# 添加或修改
NETWORKING=yes
HOSTNAME=hadoopmaster
# hadoopnode2
hostnamectl set-hostname hadoopnode2
vi /etc/sysconfig/network
# 添加或修改
NETWORKING=yes
HOSTNAME=hadoopnode2
# hadoopnode3
hostnamectl set-hostname hadoopnode3
vi /etc/sysconfig/network
# 添加或修改
NETWORKING=yes
HOSTNAME=hadoopnode3
- 在三台机器分别输入
vi /etc/hosts
修改文件,其作用是将一些常用的网址域名与其对应的IP地址建立一个关联,当用户在访问网址时,系统会首先自动从Hosts
文件中寻找对应的IP地址- 三个文件中都加入如下内容,保存并退出,注意这里要根据自己实际IP和节点主机名进行更改,IP和主机名中间要有一个空格
vi /etc/hosts
192.168.xx.xxx hadoopnode2
192.168.xx.xxx hadoopnode3
192.168.xx.xxx hadoopmaster
3、ntp时间同步
- 检验是否有ntp服务
- hadoopmaster,hadoopnode2,hadoopnode3
rpm -qa |grep ntp
# 包含以下即可
# ntpdate-4.2.6p5-28.el7.centos.x86_64
# ntp-4.2.6p5-28.el7.centos.x86_64
- 配置
- hadoopmaster
# 参考https://www.cnblogs.com/gzgBlog/p/14636108.html
# server指定ntp服务器的地址 将当前主机作为时间服务器
# fudge设置时间服务器的层级 stratum 0~15 ,0:表示顶级 , 10:通常用于给局域网主机提供时间服务
echo "server 127.127.1.0
fudge 127.127.1.0 stratum 10" >> /etc/ntp.conf
- 修改时区
- hadoopmaster
tzselect
5
9
1
1
# 看到语句 TZ='Asia/Shanghai'; export TZ
# 接下来只需要把这句话复制下来,加入profile并使其生效
vi /etc/profile
TZ='Asia/Shanghai'; export TZ
source /etc/profile
date
- 启动
- hadoopmaster
# master
/bin/systemctl restart ntpd.service
- 添加定时任务
- hadoopnode2,hadoopnode3
ntpdate hadoopmaster
crontab -e
# 写入
*/30 8-17 * * */usr/sbin/ntpdate hadoopmaster 早8晚五时间段每隔半个小时同步
# */10 * * * */usr/sbin/ntpdate hadoopmaster 每隔10分钟同步一次
# 查看定时任务列表
# crontab –l
4、ssh免密登录
- 创建一个无密码的公钥
- hadoopmaster
# ssh-keygen 默认密钥类型rsa
# -t是类型的意思,dsa是生成的密钥类型,-P是密码,’’表示无密码,-f后是秘钥生成后保存的位置
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# ssh-copy-id命令会自动将id_dsa.pub文件的内容追加到远程主机root用户下.ssh/authorized_keys文件中
ssh-copy-id localhost
ssh-copy-id hadoopnode2
ssh-copy-id hadoopnode3
- 赋予密钥文件权限
- hadoopmaster,hadoopnode2,hadoopnode3
chmod 600 ~/.ssh/authorized_keys
- 测试免密码登陆;如果有询问,则输入
yes
,回车
# master
ssh hadoopmaster
ssh hadoopnode2
ssh hadoopnode3
5、JAVA安装
5.1.卸载OpenJDK
hadoopmaster,hadoopnode2,hadoopnode3
java -version
- 查看已安装的
OpenJDK
rpm -qa | grep java
- 卸载系统带的
OpenJDK
以及相关的java
文件
rpm -e --nodeps java-1.8.0-openjdk-1.8.0.262.b10-1.el7.x86_64
rpm -e --nodeps java-1.7.0-openjdk-1.7.0.261-2.6.22.2.el7_8.x86_64
rpm -e --nodeps java-1.7.0-openjdk-headless-1.7.0.261-2.6.22.2.el7_8.x86_64
rpm -e --nodeps java-1.8.0-openjdk-headless-1.8.0.262.b10-1.el7.x86_64
5.2.安装
hadoopmaster
- 安装包存放于
/usr/package277/
,将jdk安装包解压到/usr/java
目录
mkdir /usr/java
tar -zxvf /usr/package277/jdk-8u221-linux-x64.tar.gz -C /usr/java/
/etc/profile
中配置系统环境变量JAVA_HOME
,同时将JDK
安装路径中bin
目录加入PATH
系统变量,注意生效变量,查看JDK
版本
vim /etc/profile
export JAVA_HOME=/usr/java/jdk1.8.0_221
export CLASSPATH=$JAVA_HOME/lib/
export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile
- 验证
java -version
5.3.复制
hadoopmaster
scp -r /usr/java/ root@hadoopnode2:/usr/java/
scp -r /usr/java/ root@hadoopnode3:/usr/java/
6、zookeeper安装
6.1.安装
- 将zookeeper安装包解压到指定路径/usr/zookeeper(安装包存放于/usr/package277/)
- hadoopmaster
# master
mkdir /usr/zookeeper
tar -zxvf /usr/package277/zookeeper-3.4.14.tar.gz -C /usr/zookeeper/
- 配置系统变量ZOOKEEPER_HOME,同时将Zookeeper安装路径中bin目录加入PATH系统变量,注意生效变量
- hadoopmaster,hadoopnode2,hadoopnode3
vi /etc/profile
#zookeeper
export ZOOKEEPER_HOME=/usr/zookeeper/zookeeper-3.4.14
export PATH=$PATH:$ZOOKEEPER_HOME/bin
source /etc/profile
- Zookeeper的默认配置文件为Zookeeper安装路径下conf/zoo_sample.cfg,将其修改为zoo.cfg
- hadoopmaster
cd /usr/zookeeper/zookeeper-3.4.14/conf
mv zoo_sample.cfg zoo.cfg
- 设置数据存储路径(dataDir)为/usr/zookeeper/zookeeper-3.4.14/zkdata
- 设置日志文件路径(dataLogDir)为/usr/zookeeper/zookeeper-3.4.14/zkdatalog
- 设置集群列表
- 创建所需数据存储文件夹、日志存储文件夹
- hadoopmaster
mkdir /usr/zookeeper/zookeeper-3.4.14/zkdata /usr/zookeeper/zookeeper-3.4.14/zkdatalog
cd /usr/zookeeper/zookeeper-3.4.14/conf
vi zoo.cfg
#配置数据存储路径
dataDir=/usr/zookeeper/zookeeper-3.4.14/zkdata
#配置日志文件路径
dataLogDir=/usr/zookeeper/zookeeper-3.4.14/zkdatalog
#配置集群列表
server.1=hadoopmaster:2888:3888
server.2=hadoopnode2:2888:3888
server.3=hadoopnode3:2888:3888
6.2.复制
hadoopmaster
scp -r /usr/zookeeper/ root@hadoopnode2:/usr/zookeeper/
scp -r /usr/zookeeper/ root@hadoopnode3:/usr/zookeeper/
6.3.myid
- 数据存储路径下创建myid,写入对应的标识主机服务器序号
# master
cd /usr/zookeeper/zookeeper-3.4.14/zkdata
echo "1">>myid
# node2
cd /usr/zookeeper/zookeeper-3.4.14/zkdata
echo "2">>myid
# node3
cd /usr/zookeeper/zookeeper-3.4.14/zkdata
echo "3">>myid
6.4.启动服务
hadoopmaster,hadoopnode2,hadoopnode3
- 查看进程QuorumPeerMain是否存在
zkServer.sh start
jps
- 查看各节点服务器角色是否正常(leader/follower)
zkServer.sh status
7、HADOOP安装
7.1.安装
- 安装包存放于
/usr/package277/
,将jdk安装包解压到/usr/hadoop
目录 - hadoopmaster
mkdir /usr/hadoop
tar -zxvf /usr/package277/hadoop-2.7.7.tar.gz -C /usr/hadoop/
- 配置环境变量
- hadoopmaster,hadoopnode2,hadoopnode3
vim /etc/profile
export HADOOP_HOME=/usr/hadoop/hadoop-2.7.7
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile
- 配置
hadoop-env.sh
,注释原来的JAVA_HOME
,添加JAVA_HOME
- hadoopmaster
vi /usr/hadoop/hadoop-2.7.7/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_221
-
任意目录输入
hado
,然后按Tab
,如果自动补全为hadoop
,则说明环境变量配的没问题,否则检查环境变量哪出错了 -
配置
yarn-env.sh
,添加JAVA_HOME
vi /usr/hadoop/hadoop-2.7.7/etc/hadoop/yarn-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_221
- 创建3个之后要用到的文件夹,分别如下
mkdir -p /usr/local/hadoop/tmp
mkdir -p /usr/local/hadoop/hdfs/name
mkdir /usr/local/hadoop/hdfs/data
- 配置
core-site.xml
,在configuration
标签中,添加如下内容,保存并退出- 注意这里配置的
hdfs:hadoopmaster:9000
是不能在浏览器访问的,hadoopmaster
注意根据自己的master hostname
来指定,图片可能不一致,根据自己实际情况修改
- 注意这里配置的
vi /usr/hadoop/hadoop-2.7.7/etc/hadoop/core-site.xml
<property>
<name> fs.default.name </name>
<value>hdfs://hadoopmaster:9000</value>
<description>指定HDFS的默认名称</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoopmaster:9000</value>
<description>HDFS的URI</description>
<final>true</final>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>节点上本地的hadoop临时文件夹</description>
</property>
- 配置
hdfs-site.xml
,在configuration
标签中,添加如下内容,保存并退出
vi /usr/hadoop/hadoop-2.7.7/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/hdfs/name</value>
<description>namenode上存储hdfs名字空间元数据 </description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/hdfs/data</value>
<description>datanode上数据块的物理存储位置</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>副本个数,默认是3,应小于datanode机器数量</description>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>true</value>
</property>
- 将
mapred-site.xml.template
文件复制到当前目录,并重命名为mapred-site.xml
- 设置计算框架参数,指定
MR
运行在yarn
上
- 设置计算框架参数,指定
cd /usr/hadoop/hadoop-2.7.7/etc/hadoop
cp mapred-site.xml.template mapred-site.xml
vi /usr/hadoop/hadoop-2.7.7/etc/hadoopmapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>指定mapreduce使用yarn框架</description>
</property>
- 编辑
yarn-site.xml
,在configuration
标签中,添加如下内容,保存并退出
vi /usr/hadoop/hadoop-2.7.7/etc/hadoop/yarn-site.xml
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoopmaster</value>
<description>指定resourcemanager所在的hostname</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>
NodeManager上运行的附属服务。
需配置成mapreduce_shuffle,才可运行MapReduce程序
</description>
</property>
- 编辑
slaves
vi /usr/hadoop/hadoop-2.7.7/etc/hadoop/slaves
hadoopnode2
hadoopnode3
7.2.复制
hadoopnode2,hadoopnode3
- 如果
node
节点还没有java;hadoop
,则master
机器上分别输入如下命令将java;hadoop
复制
# master
scp -r /usr/hadoop/ root@hadoopnode2:/usr/hadoop/
scp -r /usr/hadoop/ root@hadoopnode2:/usr/hadoop/
7.3.格式化
- 在
master
机器上,任意目录输入hdfs namenode -format
格式化namenode
,第一次使用需格式化一次,之后就不用再格式化,如果改一些配置文件了,可能还需要再次格式化 - hadoopmaster
hdfs namenode -format
7.4.启动
- 在
master
机器上,进入hadoop
的sbin
目录,输入./start-all.sh
启动hadoop
;输入yes
,回车 - hadoopmaster
/usr/hadoop/hadoop-2.7.7/sbin/start-all.sh
yes
7.4.1.jps
- 输入
jps
查看当前java
的进程,该命令是JDK1.5
开始有的,作用是列出当前java
进程的PID
和Java
主类名,nameNode
节点除了JPS
,还有3个进程,启动成功 - hadoopmaster,hadoopnode2,hadoopnode3
jps
7.4.2.浏览器
hadoopmaster,hadoopnode2,hadoopnode3
- 在浏览器访问
nameNode
节点的8088端口和50070端口可以查看hadoop
的运行状况
192.168.xx.xxx:8088
192.168.xx.xxx:50070
7.4.3.命令行
- 查看集群状态
- hadoopmaster
hdfs dfsadmin -report
7.5.关闭
hadoopmaster
cd /usr/hadoop/hadoop-2.7.7/sbin
./stop-all.sh
7.6.运行wordcount
- 在
master
机器上,进入hadoop
的sbin
目录,输入./stop-all.sh
关闭hadoop
,运行wordcount
测试,先/usr/local/hadoop/file
中创建file.txt1
和file.txt2
- hadoopmaster
mkdir -p /usr/local/hadoop/file
cd /usr/local/hadoop/file
echo "hello World
hello java
hello c
" >file1.txt
echo "hello merlin
hello fei
hello python
hello world
hello math
" >file2.txt
hadoop fs -mkdir /input
hadoop fs -put /usr/local/hadoop/file/file*.txt /input
hadoop fs -ls /input
hadoop fs -cat /input/file1.txt
# 输出如下
# hello World
# hello java
# hello c
hadoop jar /usr/hadoop/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount /input /output
hadoop fs -ls /output
hadoop fs -cat /output/part-r-00000
# 结果如下图
8.创建maven+mapreduce项目
8.1.安装插件
- 大数据工具
8.2.创建
-
新建maven项目
-
log4j-properties
加入src/resource
log4j.rootLogger=debug,stdout,R
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%5p - %m%n
log4j.appender.R=org.apache.log4j.RollingFileAppender
log4j.appender.R.File=mapreduce_test.log
log4j.appender.R.MaxFileSize=1MB
log4j.appender.R.MaxBackupIndex=1
log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=%p %t %c - %m%n
log4j.logger.com.codefutures=DEBUG
- 修改
pom.xml
中插入pom-xml导入依赖包.TXT
<!-- 导入hadoop依赖环境 -->
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.4</version>
<configuration>
<archive>
<manifest>
<!--是否要把第三方jar放到manifest的classpath中-->
<addClasspath>true</addClasspath>
<!--生成的manifest中classpath的前缀,因为要把第三方jar放到lib目录下,所以classpath的前缀是lib/-->
<classpathPrefix>lib/</classpathPrefix>
<!-- 执行的主程序路径 -->
<mainClass>org.example.WordCountDriver_v1</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
- 点击等待下载
8.3.windows改hosts
cd C:\Windows\System32\drivers\etc
notepad hosts
192.168.xx.xxx hadoopmaster
8.4.IDEA连接HDFS
- 点击右侧big data tools,新建连接
- 重启计算机
- 不重启点击Test connection会报错,打开虚拟机、再次启动集群
- 成功
8.5.wordcount
8.5.1.数据
- 和上面一样,还是需要建好input目录和数据文件
- hadoopmaster
mkdir -p /usr/local/hadoop/file
cd /usr/local/hadoop/file
echo "hello World
hello java
hello c
" >file1.txt
echo "hello merlin
hello fei
hello python
hello world
hello math
" >file2.txt
hadoop fs -mkdir /input
hadoop fs -put /usr/local/hadoop/file/file*.txt /input
hadoop fs -ls /input
hadoop fs -cat /input/file1.txt
# hello World
# hello java
# hello c
8.5.2.运行模式
- 运行前修改pom.xml中的main class
- mainclass获取方法:在类名右键点击copy reference即可,在pom.xml中粘贴
8.5.2.1.本地运行
WordCountDriver_v1
package org.example;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver_v1 {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("mapreduce.framework.name", "local");
// conf.set("mapreduce.framework.name", "yarn");
// yarn模式
conf.set("fs.defaultFS", "hdfs://hadoopmaster:9000");
Job job = Job.getInstance(conf, WordCountDriver_v1.class.getSimpleName());
job.setJarByClass(WordCountDriver_v1.class);
job.setMapperClass(WordcountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
Path inPath=new Path("/input");
Path outpath=new Path("/output");
FileSystem fs= FileSystem.get(conf);
if(fs.exists(outpath)){
fs.delete(outpath,true);
}
FileInputFormat.setInputPaths(job,inPath);
FileOutputFormat.setOutputPath(job, outpath);
boolean waitForCompletion = job.waitForCompletion(true);
System.exit(waitForCompletion ? 0 : 1);
}
}
WordCountReducer
package org.example;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
private LongWritable outvalue = new LongWritable();
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for (LongWritable value : values) {
count += value.get();
}
outvalue.set(count);
context.write(key, outvalue);
}
}
WordcountMapper
package org.example;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordcountMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
private Text outkey = new Text();
private final static LongWritable outvalue = new LongWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");// [I, have, a, dream]
for (String word : words) {
outkey.set(word);
context.write(outkey, outvalue);
}
}
}
- 查看结果
- hadoopmaster
- 也可以直接在IDEA右边的big data tools看到
hadoop fs -cat /output/part-r-00000
8.5.2.2.打jar包
- 写好程序,点这边打包
- 顺序:clean–package
- 在虚拟机运行jar包
- 把target文件夹下在jar文件上传至虚拟机
hadoop jar /root/movie/1.jar
- 查看结果
hadoop fs -cat /output/part-r-00000
9、HIVE安装
- 仅安装hadoopnode3
- 另一种集群部署见最后“笔记”中的第六个问题
9.1.mysql安装
- 解压
mkdir /usr/mysql
tar -xvf /usr/package277/mysql-5.7.25-1.el7.x86_64.rpm-bundle.tar -C /usr/mysql
- 安装
mysql
组件,顺序为:
common–libs–libs-compat–client–server
# mysql-community-common(服务器和客户端库的公共文件)
# mysql-community-libs(MySQL数据库客户端应用程序的共享库)
# mysql-community-libs-compat(MySQL 之前版本的共享兼容库)
# mysql-community-client(MySQL客户端应用程序和工具)
# mysql-community-server(数据库服务器和相关工具)
rpm -ivh /usr/mysql/mysql-community-common-5.7.25-1.el7.x86_64.rpm --force --nodeps
rpm -ivh /usr/mysql/mysql-community-libs-5.7.25-1.el7.x86_64.rpm --force --nodeps
rpm -ivh /usr/mysql/mysql-community-libs-compat-5.7.25-1.el7.x86_64.rpm --force --nodeps
rpm -ivh /usr/mysql/mysql-community-client-5.7.25-1.el7.x86_64.rpm --force --nodeps
rpm -ivh /usr/mysql/mysql-community-server-5.7.25-1.el7.x86_64.rpm --force --nodeps
- 登录
mysql
# 初始化
/usr/sbin/mysqld --initialize-insecure --user=mysql
/usr/sbin/mysqld --user=mysql &
# 首次登录
mysql -uroot
# 修改密码123456
alter user 'root'@'localhost' identified by '123456';
\q
- 增加远程登录
root
用户的权限
mysql -uroot -p123456
# 设置远程登录权限
create user 'root'@'%' identified by '123456';
grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;
# 刷新配置信息
flush privileges;
\q
9.2.安装HIVE
- 将Hive安装包解压到指定路径/usr/hive(安装包存放于/usr/package277/)
mkdir /usr/hive
tar -zxvf /usr/package277/apache-hive-2.3.4-bin.tar.gz -C /usr/hive/
- 配置环境变量HIVE_HOME,将Hive安装路径中的bin目录加入PATH系统变量,注意生效变量
vi /etc/profile
export HIVE_HOME=/usr/hive/apache-hive-2.3.4-bin
export PATH=$PATH:$HIVE_HOME/bin
source /etc/profile
- 验证
hive --version
- Hive 元数据配置到 mysql
cp /usr/hive/apache-hive-2.3.4-bin/conf/hive-env.sh.template /usr/hive/apache-hive-2.3.4-bin/conf/hive-env.sh
vi /usr/hive/apache-hive-2.3.4-bin/conf/hive-env.sh
# 配置Hadoop安装路径
HADOOP_HOME=/usr/hadoop/hadoop-2.7.7
# 配置Hive配置文件存放路径
export HIVE_CONF_DIR=/usr/hive/apache-hive-2.3.4-bin/conf
# 配置Hive运行资源库路径
export HIVE_AUX_JARS_PATH=/usr/hive/apache-hive-2.3.4-bin/lib
9.3.Hive 元数据配置到 mysql
-
驱动拷贝
将
mysql
驱动包mysql-connector-java-5.1.47-bin.jar
拷贝到${HIVE_HOME}/lib
目录下
cp /usr/package277/mysql-connector-java-5.1.47-bin.jar /usr/hive/apache-hive-2.3.4-bin/lib/
-
配置
Metastore
到 mysql:hive-site.xml
根据官方文档配置参数(https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration),拷贝数据到 hive-site.xml 文件中
cd /usr/hive/apache-hive-2.3.4-bin/conf
# 创建并编辑hive-site.xml
vi /usr/hive/apache-hive-2.3.4-bin/conf/hive-site.xml
<configuration>
<!--连接元数据库的链接信息 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hivedb?createDatabaseIfNotExist=true&useSSL=false&useUnicode=true&characterEncoding=UTF-8</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<!--连接数据库驱动 -->
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<!--连接数据库用户名称 -->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<!--连接数据库用户密码 -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
</configuration>
- 初始化元数据库
cd /usr/hive/apache-hive-2.3.4-bin/bin
./schematool -dbType mysql -initSchema
- Hive 连接
# CIL启动
hive
# 或者
hive --service cli
# 退出
exit;
10、SPARK安装
10.1.安装
- 将Spark安装包解压到指定路径/usr/spark/spark-2.4.3-bin-hadoop2.7(安装包存放于/usr/package277/)
- hadoopmaster
mkdir /usr/scala
tar -zxvf /usr/package277/scala-2.11.8.tgz -C /usr/scala/
mkdir /usr/spark
tar -zxvf /usr/package277/spark-2.4.3-bin-hadoop2.7.tgz -C /usr/spark/
- 文件/etc/profile中配置环境变量SPARK_HOME,将Spark安装路径中的bin目录加入PATH系统变量,注意生效变量
- hadoopmaster,hadoopnode2,hadoopnode3
vi /etc/profile
#scala
export SCALA_HOME=/usr/scala/scala-2.11.8
export PATH=$PATH:$SCALA_HOME/bin
#spark
export SPARK_HOME=/usr/spark/spark-2.4.3-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile
- 修改配置文件spark-env.sh,设置主机节点为master
- 修改配置文件spark-env.sh,设置java安装路径
- 修改配置文件spark-env.sh,设置节点内存为8g
- 修改配置文件spark-env.sh,设置hadoop安装目录、hadoop集群的配置文件的目录
- 修改slaves文件,添加spark从节点hadoopnode2,hadoopnode3
- hadoopmaster
# master
mv /usr/spark/spark-2.4.3-bin-hadoop2.7/conf/spark-env.sh.template /usr/spark/spark-2.4.3-bin-hadoop2.7/conf/spark-env.sh
echo "export SPARK_MASTER_IP=hadoopmaster
export SCALA_HOME=/usr/scala/scala-2.11.8
export SPARK_WORKER_MEMORY=8g
export JAVA_HOME=/usr/java/jdk1.8.0_221
export HADOOP_HOME=/usr/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native" >> spark-env.sh
# vi /usr/spark/spark-2.4.3-bin-hadoop2.7/conf/spark-env.sh
mv /usr/spark/spark-2.4.3-bin-hadoop2.7/conf/slaves.template /usr/spark/spark-2.4.3-bin-hadoop2.7/conf/slaves
echo "
hadoopnode2
hadoopnode3">>slaves
# vi /usr/spark/spark-2.4.3-bin-hadoop2.7/conf/slaves
10.2.复制
hadoopnode2,hadoopnode3
scp -r /usr/spark/ root@hadoopnode2:/usr/spark/
scp -r /usr/spark/ root@hadoopnode3:/usr/spark/
scp -r /usr/scala/ root@hadoopnode2:/usr/scala/
scp -r /usr/scala/ root@hadoopnode3:/usr/scala/
10.3.启动
- 开启集群,查看各节点进程(主节点进程为Master,子节点进程为Worker)
- hadoopmaster
/usr/spark/spark-2.4.3-bin-hadoop2.7/sbin/start-all.sh
- spark版本
- spark-shell和安装的scala版本可能不一样的,这里只需要看spark版本
/usr/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-shell
- scala版本
scala -version
10.4.wordcount
-
安装scala插件
file–settings-plugins
-
新建maven项目
-
填写信息
-
项目目录
-
添加scala
-
第一步:
-
第二步:
-
成功
- pom.xml加依赖
<properties>
<spark.version>2.4.3</spark.version>
<scala.version>2.11.8</scala.version>
</properties>
<dependencies>
<!--依赖Scala语言-->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive-thriftserver_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
</execution>
</executions>
</plugin>
</plugins>
</build>
- 删除resource目录,重命名java目录为scala
- 新建package com.spark.wordcount
- 新建scala class,选择object,名称为WordCount
package com.spark.wordcount
import org.apache.spark.{SparkConf, SparkContext} //import spark包
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object WordCount {
def main(args: Array[String]) {
val inputFile = "hdfs://hadoopmaster:9000/input"
val conf = new SparkConf().setAppName("WordCount").setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile(inputFile)
val wordCount = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCount.foreach(println)
}
}
10.4.1.本地模式
- hadoopmaster:8080或hadoopmaster:8081
- 看到spark master在hadoopmaster:7077
- run一次,终止程序,修改配置
-Dspark.master=spark://hadoopmaster:7077
- 可在IDEA看到结果
10.4.2.jar包
- 先右键运行一下WordCount.scala,运行成功后出现红色的info信息,等待几秒后即可关闭。不进行此步骤的话在服务器运行会出错。
- 点击右边的maven面板,双击lifecycle下的package即可打包
- 虚拟机运行
/usr/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-submit --class com.spark.wordcount.WordCount /root/Desktop/MyScala-1.0-SNAPSHOT.jar
- /usr/spark/spark-2.4.3-bin-hadoop2.7/bin/spark-submit:使用安装spark的目录中的spark-submit
- –class com.spark.wordcount.WordCount:class后面的是WordCount的引用,在类名右键点击copy reference即可
- /root/Desktop/MyScala-1.0-SNAPSHOT.jar:服务器中jar包存放的位置
笔记
1、虚拟机与本地ping不通解决
- 修改防火墙设置-高级,对如图选项点击启用规则
2、ipc.Client: Retrying connect to server
22/12/14 11:59:20 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
- 注释掉主程序中的守护进程设置
3、重新格式化HDFS前需要删除如下
# master
rm -rf /usr/local/hadoop/hdfs/name/current
# node
rm -rf /usr/local/hadoop/hdfs/data/current
hdfs namenode -format
4、MTSQL初始密码
systemctl disable mysqld
systemctl start mysqld
# 得到密码
grep "temporary password" /var/log/mysqld.log
mysql -uroot -p
E&po>2BCjekL
5、MYSQL启动安全插件
ERROR 1193 (HY000): Unknown system variable ‘validate_password_policy’
vi /etc/my.cnf
#在[mysqld]的下方加入如下代码:
plugin-load-add=validate_password.so
validate-password=FORCE_PLUS_PERMANENT
systemctl restart mysqld.service
-
lugin-load-add=validate_password.so:插件的加载方法,每次服务器启动时都必须给出该选项;
-
validate-password=FORCE_PLUS_PERMANENT:validate-password在服务器启动时使用该选项来控制插件的激活
-
设置密码强度为低级:set global validate_password_policy=???;
-
设置密码长度:set global validate_password_length=???;
-
修改本地密码:alter user ‘root’@‘localhost’ identified by ‘???’;
mysql -uroot -p123456
SHOW VARIABLES LIKE 'validate_password%';
# 安全策略修改
set global validate_password_policy=0;
set global validate_password_length=4;
flush privileges;
6、hive安装-集群模式
-
node2:mysql数据库(mysql安装参考9.1.)
-
node1:metastore服务端
-
master:客户端
-
如果alter user ‘root’@‘localhost’ identified by ‘???’;执行不成功需要先执行:
-
node2
mysql -uroot -p
set global validate_password_policy=0;
set global validate_password_length=4;
flush privileges;
\q
-
将Hive安装包解压到指定路径/usr/hive(安装包存放于/usr/package277/)
-
master、slave1
mkdir /usr/hive
tar -zxvf /usr/package277/apache-hive-2.3.4-bin.tar.gz -C /usr/hive/
-
配置环境变量HIVE_HOME,将Hive安装路径中的bin目录加入PATH系统变量,注意生效变量
-
master、slave1
vi /etc/profile
export HIVE_HOME=/usr/hive/apache-hive-2.3.4-bin
export PATH=$PATH:$HIVE_HOME/bin
source /etc/profile
- 修改HIVE运行环境,配置Hadoop安装路径HADOOP_HOME
- 修改HIVE运行环境,配置Hive配置文件存放路径HIVE_CONF_DIR
- 修改HIVE运行环境,配置Hive运行资源库路径HIVE_AUX_JARS_PATH
- master、slave1
echo "export HADOOP_HOME=/usr/hadoop/hadoop-2.7.7
export HIVE_CONF_DIR=/usr/hive/apache-hive-2.3.4-bin/conf
export HIVE_AUX_JARS_PATH=/usr/hive/apache-hive-2.3.4-bin/lib" >> /usr/hive/apache-hive-2.3.4-bin/conf/hive-env.sh
- 由于客户端需要和Hadoop通信,为避免jline版本冲突问题,将Hive中lib/jline-2.12.jar拷贝到Hadoop中,保留高版本。解决jline的版本冲突,将HIVE_HOME/lib/jline-2.12.jar同步至$HADOOP_HOME/share/hadoop/yarn/lib/下
- master、slave1
cp /usr/hive/apache-hive-2.3.4-bin/lib/jline-2.12.jar /usr/hadoop/hadoop-2.7.7/share/hadoop/yarn/lib/
- 驱动JDBC拷贝至hive安装目录对应lib下(依赖包存放于/usr/package277/)
- slave1
cp /usr/package277/mysql-connector-java-5.1.47-bin.jar /usr/hive/apache-hive-2.3.4-bin/lib/
- 配置元数据数据存储位置为/user/hive_remote/warehouse
- 配置数据库连接为MySQL
- 配置连接JDBC的URL地址主机名及默认端口号3306,数据库为hive,如不存在自行创建,ssl连接方式为false
- 配置数据库连接用户
- 配置数据库连接密码
- slave1
vi /usr/hive/apache-hive-2.3.4-bin/conf/hive-site.xml
<configuration>
<!--Hive产生的元数据存放位置-->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive_remote/warehouse</value>
</property>
<!--数据库连接JDBC的URL地址-->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://slave2:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&useSSL=false</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<!--数据库连接driver,即MySQL驱动-->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<!--MySQL数据库用户名-->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<!--MySQL数据库密码-->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>datanucleus.schema.autoCreateALL</name>
<value>true</value>
</property>
</configuration>
- 配置元数据存储位置为/user/hive_remote/warehouse
- 关闭本地metastore模式
- 配置指向metastore服务的主机为slave1,端口为9083
- master
vi /usr/hive/apache-hive-2.3.4-bin/conf/hive-site.xml
<configuration>
<!--Hive产生的元数据存放位置-->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive_remote/warehouse</value>
</property>
<!---使用本地服务连接Hive,默认为true-->
<property>
<name>hive.metastore.local</name>
<value>false</value>
</property>
<!--连接服务器-->
<property>
<name>hive.metastore.uris</name>
<value>thrift://slave1:9083</value>
</property>
</configuration>
- 服务器端初始化数据库,启动metastore服务
- slave1
cd /usr/hive/apache-hive-2.3.4-bin/bin
./schematool -dbType mysql -initSchema
cd /usr/hive/apache-hive-2.3.4-bin/
bin/hive --service metastore
- 客户端开启进入hive,创建hive数据库
- master
cd /usr/hive/apache-hive-2.3.4-bin/
bin/hive
create database hive;
7、NativeCodeLoade
spark报错WARN NativeCodeLoader: Unable to load native-hadoop library for your platform
在spark目录spark/conf/spark-env.sh文件,末尾添加如下内容:
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native