spark+hive环境搭建

最新推荐文章于 2024-12-24 14:03:07 发布

ZL小屁孩

最新推荐文章于 2024-12-24 14:03:07 发布

阅读量1.7k

点赞数

CC 4.0 BY-SA版权

分类专栏： spark

本文链接：https://blog.youkuaiyun.com/ZH519080/article/details/80014085

spark 专栏收录该内容

23 篇文章

订阅专栏

本文详细介绍如何在UbuntuKylin16.04 LTS环境下安装配置Spark与Hive，并实现两者之间的集成。包括Spark的安装配置、MySQL安装及配置、Hive配置并与MySQL同步等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

UbuntuKylin 16.04 LTS环境

配置无密码ssh登录和ip与hostname，在此不阐述。

需要安装java、scala和hadoop，在此不阐述。我用的java版本是java version "1.8.0_45"，用的scala版本是：Scala code runner version 2.10.5，hadoop的版本：hadoop-2.6.0

spark与hive有版本兼容性问题，可以访问：

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Getting+Started

查看spark与hive的兼容性

在/etc/profile中配置：

export JAVA_HOME=/opt/jdk1.8.0_45

export SCALA_HOME=/opt/scala-2.10.5

exportHADOOP_HOME=/home/zhuhaichuan/hadoop-2.6.0

exportSPARK_HOME=/home/zhuhaichuan/spark-1.6.0-bin-hadoop2.6

exportHIVE_HOME=/home/zhuhaichuan/hive-2.1.1

exportPATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$HIVE_HOME/bin:$PATH

exportCLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

一、安装spark

版本：spark-1.6.0-bin-hadoop2.6.tgz

解压到相应的目录，在此目录为：/home/zhuhaichuan/spark-1.6.0-bin-hadoop2.6

需要配置文件：slaves、spark-env.sh

slaves文件中输入的是spark集群的从节点，在此我配置的伪分布式，因此salves文件的内容是本机的ip或者本机的hostname。

Spark-env.sh文件需要添加的内容是JAVA_HOME、SCALA_HOME、HADOOP_HOME三者的路径。

到此spark伪分布式安装完成。

启动spark和thriftserver：

…/sbin/start-all.sh

…/sbin/start-thriftserver.sh

二、安装mysql

在ubuntu系统中使用使用以下命令安装mysql：

sudo apt-getinstall mysql-server

sudo apt isntallmysql-client

sudo apt installlibmysqlclient-dev

安装完成后通过以下命令检测是否安装成功：

sudo netstat-tap | grep mysql

通过如下命令进入MySQL服务：

mysql -uroot -p你的密码

现在设置mysql允许远程访问，首先编辑文件/etc/mysql/mysql.conf.d/mysqld.cnf：

sudo vi /etc/mysql/mysql.conf.d/mysqld.cnf

注释掉bind-address =127.0.0.1：

保存退出，然后进入mysql服务。

用户允许远程链接：在root用户下执行

create user 'hive' identified by 'hive'; //创建用于连接的hive用户密码为hive

grant all privileges on *.* to 'hive'@'%' identifiedby "hive" withgrant option;//%表示任意ip地址。允许在任意ip条件下以用户hive和密码hive能远程链接

flush privileges; //刷新权限

set global binlog_format='MIXED'; //设置格式必须执行。不然报错

然后执行exit命令退出mysql服务，执行如下命令重启mysql：

exit;

service mysql restart //重启服务

测试连接

mysql -uhive-phive //能进去则表示设置成功

create database hive; //创建连接数据库hive

alter database hive character set latin1;

三、配置hive

hive的版本：apache-hive-2.1.1-bin.tar.gz

配置hive-env.sh

添加HADOOP_HOME、HIVE_CONF_DIR、HIVE_AUX_JARS_PATH的内容

exportHADOOP_HOME=/home/zhuhaichuan/hadoop-2.6.0

export HIVE_CONF_DIR=/home/zhuhaichuan/hive-2.1.1/conf

exportHIVE_AUX_JARS_PATH=/home/zhuhaichuan/hive-2.1.1/lib

配置hive-site.xml文件

<name>hive.metastore.warehouse.dir</name>

<value>/data/hive/warehouse</value>

<description>location of default database for thewarehouse</description>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

<description>Username to use against metastoredatabase</description>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

<description>password to use against metastoredatabase</description>

</property>

<name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://192.168.1.144:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8</value>

JDBC connectstring for a JDBC metastore.

To use SSL toencrypt/authenticate the connection, provide database-specific SSL flag in theconnection URL.

For example,jdbc:postgresql://myhost/db?ssl=true for postgres database.

</description>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBCmetastore</description>

</property>

<name>hive.metastore.uris</name>

<value>thrift://zhumaster:9083</value>

<description>Thrift URI for the remote metastore. Used bymetastore client to connect to remote metastore.</description>

</property>

<name>hive.execution.engine</name>

<value>spark</value>

Expects oneof [mr, tez, spark].

Choosesexecution engine. Options are: mr (Map reduce, default), tez, spark. While MR

remains thedefault engine for historical reasons, it is itself a historical engine

and isdeprecated in Hive 2 line. It may be removed without further warning.

</description>

</property>

<name>hive.server2.thrift.bind.host</name>

<value>zhumaster</value>

<description>Bind host on which to run the HiveServer2 Thriftservice.</description>

</property>

<name>hive.exec.scratchdir</name>

<value>hdfs://zhumaster:8020/tmp/hive</value>

<description>HDFS root scratch dir for Hive jobs which getscreated with write all (733) permission. For each connecting user, an HDFSscratch dir: ${hive.exec.scratchdir}/<username> is created, with${hive.scratch.dir.permission}.</description>

</property>

<name>hive.exec.local.scratchdir</name>

<value>/home/zhuhaichuan/tmp/hive</value>

<description>Local scratch space for Hive jobs</description>

</property>

<name>hive.metastore.schema.verification</name>

<value>false</value>

Enforce metastore schemaversion consistency.

True: Verify that versioninformation stored in is compatible with one from Hive jars. Also disable automatic

schema migrationattempt. Users are required to manually migrate schema after Hive upgrade whichensures

proper metastoreschema migration. (Default)

False: Warn if the versioninformation stored in metastore doesn't match with one from in Hive jars.

</description>

</property>

<name>hive.server2.enable.doAs</name>

<value>false</value>

Setting this property totrue will have HiveServer2 execute

Hive operations as the usermaking the calls to it.

</description>

</property>

配置完成

hive使用mysql进行存储metastore数据，

把mysql的驱动jar（mysql-connector-java-5.1.40-bin.jar）包复制到hive-2.1.1的lib路径下

cp …/mysql-connector-java-5.1.40-bin.jar .../hive-2.1.1/lib/

hive元数据与mysql同步： …/hive-2.1.1/bin/schematool -dbType mysql -initSchema

必须把hive-default.xml.template 复制一份命名为 hive-site.xml ，一定要有hive-site.xml 才行，然后再运行 schematool -dbType mysql -initSchema 把之前创建的元数据都同步到mysql 里。

后台运行hive的metastore：hive --servicemetastore > …/hivemetastore.log 2>&1 &

后台运行hive的hiveserver2：hive --servicehiveserver2 > …/hiveserver2.log 2>&1 &

需要把spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar复制到hive的lib路径下（hive-2.1.1/lib/）。把hive的hive-site.xml文件复制到spark-1.6.0-bin-hadoop2.6/conf/中

在intelliJ_IDEA中写程序的话，需要把hive-site.xml放入项目的src中。

在使用intelliJ_IDEA工具开发spark+hive时，需要开启thriftserver，使用spark的“start-all.sh”命令是开启spark的master、worker节点。