安装spark时,默认的spark assembly 不包含hive支持。spark官网上说明“Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, it is not included in the default Spark assembly.” ,要想spark sql在hive上运行,需要编辑与自己使用spark版本相同的源码,将依赖包重新打入assembly中,编译后将所需要的包加入到之前spark安装位置。
1、首先重新编译与使用版本一样的spark源码
本文hadoop版本为2.3.0-cdh5.1.2,spark版本为1.0.2。
本文是用sbt工具进行编译,也可使用maven编译。
编译过程如下:
修改spark1.0.2/project/SparkBuild.scala文件,如下:
val DEFAULT_HADOOP_VERSION = "2.3.0-cdh5.1.2"
val DEFAULT_YARN = true
val DEFAULT_HIVE = true
执行命令:sbt/bin/sbt spark1.0.2/assembly
等待编译,时间较长
编译结束后,查看spark-1.0.2/assembly/target/scala-2.10目录下,有新生成的jar包,本文生成的jar包为spark-assembly-1.0.2-hadoop2.3.0-cdh5.1.2.jar
此外,源码中 spark-1.0.2/lib_managed/jars目录下也含有依赖包。2、配置sqark sql on hive依赖包
首先执行spark-shell,查看下缺什么包。
<span style="font-size:14px;">./spark-shell \
--master yarn-client \
--driver-class-path $(echo /opt/cloudera/parcels/CDH/lib/hadoop-yarn/*.jar |sed 's/ /:/g'):/opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop-hdfs/hadoop-hdfs-2.3.0-cdh5.1.2.jar</span>
然后执行hql语句
<span style="font-size:14px;">val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
</span>
结果出现如下错误
<span style="font-size:14px;">val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
<console>:12: error: object hive is not a