参考
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
说明
每次运行HiveQL的时候都会有这么一条警告【WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.】说明Hive-on-mr在Hive2中是不被推荐的,并且在将来的版本中可能不可用,那我们就考虑使用Tez来代替MR。接下来看一下版本兼容问题,我使用hive2.3.4和spark2.0.0
Spark安装
注意Spark中是绝对不能包含Hive jars的,所以需要重新编译Spark源码
spark源码包准备【spark-2.0.0.tgz】
略
在spark2.0.0之后我们使用以下命令来编译
./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
以下是spark2.0.0编译成功后的截图
在Hive2.2.0之后需要将以下三个jar包从Spark的jars中找到并复制到Hive的lib中
scala-library-2.11.8.jar
spark-core_2.11-2.0.0.jar
spark-network-common_2.11-2.0.0.jar
Yarn配置
需要将Yarn的调度器修改为Fair Scheduler,在yarn-site.xml中增加如下配置
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
Hive配置
修改hive执行引擎为Spark
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
增加Hive启动Spark应用程序所需要的配置(hive-site.xml或是spark-deafults.conf)
<property>
<name>spark.master</name>
<value>spark://hadoop:7077</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>hdfs://hadoop:9000/tmp/log/SparkOnYarn</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>512m</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
在Hive2.2.0之后需要上传Spark jars中的所有jar包到HDSF中(比如:hdfs:///xxx:8020/spark-jars)并且增加一下配置到hive-site.xml中
<property>
<name>spark.yarn.jars</name>
<value>hdfs://hadoop:8020/spark-jars/*</value>
</property>
最后是关于Spark执行器的一些配置
<property>
<name>spark.executor.memory</name>
<value>2048M</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>4</value>
</property>
<property>
<name>spark.yarn.executor.memoryOverhead</name>
<value>500M</value>
</property>
<property>
<name>spark.executor.instances</name>
<value>1</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>1024M</value>
</property>
<property>
<name>spark.yarn.driver.memoryOverhead</name>
<value>400M</value>
</property>
Spark配置
修改spark-env.sh
export JAVA_HOME=/opt/java
export SCALA_HOME=/opt/scala
export SPARK_MASTER_IP=hadoop
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=2g
export HADOOP_CONF_DIR=/home/hadoop/hadoop
修改slave
hadoop
启动Spark
报错一
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
将hadoop/share/hadoop/common/lib目录下的slf4j-api-1.7.5.jar文件,slf4j-log4j12-1.7.5.jar文件和commons-logging-1.1.3.jar文件拷贝到spark/lib目录下
报错二
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.deploy.master.Master$.main(Master.scala:1006)
at org.apache.spark.deploy.master.Master.main(Master.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more
在spark-env.sh文件中添加:export SPARK_DIST_CLASSPATH=$(hadoop classpath),再次启动Spark
执行Hive作业报错
FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.
将hive-site.xml复制到spark conf下再次尝试
查看SparkUI中可以发现Application已经变成了Hive on Spark
其实这种方式跟之前Hive on Tez并不一样,我们还可以将Spark on Yarn也整合进来
Hive on Spark on Yarn
因为并不依赖Spark,所以简单对spark-env.sh做一些配置就行,那么继续尝试
可以看到Yarn上也变成了Hive on Spark,这样就独立将作业提交到了Yarn上
总结
其实整个过程坑还是挺多的,写的中途断了一两个月后来又再次尝试,终于也是成功了,不容易!