spark与hive集成

1.spark和hive版本要求
    spark1.4.0
    hive-0.13.0(注意:不要使用hive0.14版本用于集成,否则会出很多问题)


2、集群规划(hadoop 采用伪分布式)

2.1 hadoop 伪分布式安装规划(同hadoop集群模式一样)
NameNode: 192.168.2.20
DataNode:192.168.2.20
ResourceManager:192.168.2.20
NodeManager:192.168.2.20

2.2 Spark 集群规划
Master: 192.168.2.20   192.168.2.33
Worker:192.168.2.33


2.3 Hive规划
hive在192.168.2.20 和 192.168.2.33 两台节点上都要安装

备注: hive使用mysql为存储元数据
[html]  view plain  copy
 print ?
  1. [hadoop@mycluster conf]$ more hive-site.xml  
  2. <configuration>  
  3. <property>  
  4.      <name>javax.jdo.option.ConnectionURL</name>  
  5.      <value>jdbc:mysql://mycluster:3306/hive?createDatabaseIfNotExist=true</value>  
  6. </property>  
  7.   
  8. <property>  
  9.      <name>javax.jdo.option.ConnectionDriverName</name>  
  10.      <value>com.mysql.jdbc.Driver</value>  
  11. </property>  
  12.   
  13. <property>  
  14.      <name>javax.jdo.option.ConnectionUserName</name>  
  15.      <value>root</value>  
  16. </property>  
  17.   
  18. <property>  
  19.      <name>javax.jdo.option.ConnectionPassword</name>  
  20.      <value>123</value>  
  21. </property>  
  22. </configuration>  

3、Spark 参数配置
3.1 在$SPARK_HOME/conf下,修改spark-env.sh 
[html]  view plain  copy
 print ?
  1. [hadoop@mycluster ~]$ cd $SPARK_HOME  
  2. [hadoop@mycluster spark]$ more conf/spark-env.sh  
  3. #!/usr/bin/env bash  
  4. SPARK_MASTER_IP=mycluster  
  5. export JAVA_HOME=/home/hadoop/app/jdk1.7.0_76  
  6. export SCALA_HOME=/home/hadoop/app/scala  
  7. export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0  
  8. export HIVE_HOME=/home/hadoop/app/hive  
  9. export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.28.jar  

注意: 所有spark节点都要相同的配置

3.2 把hive的hive-site.xml拷贝到$SPARK_HOME/conf下

3.3 .把mysql的驱动包拷贝到$SPARK_HOME/lib下

3.4 spark会打印许多INFO级别的日志,可以修改log4j.prop文件中的日志级别 设置warn
[hadoop@mycluster conf]$ more log4j.properties
# Set everything to be logged to the console
log4j.rootCategory= WARN , console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO


4、启动 spark-sql
4.1 启动spark集群
cd $SPARK_HOME
sbin/start-all.sh

4.2 启动spark-sql
bin/spark-sql

5、验证spark-sql
5.1 登录hive进行查询

(1)登录hive命令
[hadoop@mycluster hive]$ bin/hive

(2)使用mydb数据库和显示所有表
hive (default)> use mydb;
OK
Time taken: 1.833 seconds
hive (mydb)> show tables;
OK
tab_name
access_log
bucket_table
jfapp_pv
order
stu
stu_index_table
stu_view
t1
Time taken: 0.256 seconds, Fetched: 8 row(s)


(3)使用count函数
hive (mydb)> select count(*) from stu;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1441094211714_0002, Tracking URL = http://mycluster:8088/proxy/application_1441094211714_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0/bin/hadoop job  -kill job_1441094211714_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-09-01 04:05:58,315 Stage-1 map = 0%,  reduce = 0%
2015-09-01 04:06:18,500 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.54 sec
2015-09-01 04:06:28,524 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.78 sec
MapReduce Total cumulative CPU time: 4 seconds 780 msec
Ended Job = job_1441094211714_0002
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.78 sec   HDFS Read: 291 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 780 msec
OK
_c0
9
Time taken:  85.028 seconds, Fetched: 1 row(s)

注:hive的count需要通过mapreduce完成,效率上很低。

5.2 登录spark-sql如同hive一样操作

(1)登录hive命令
[hadoop@mycluster spark]$ bin/spark-sql
(2)使用mydb数据库和显示所有表
spark-sql (default)> use mydb;
OK
result
Time taken: 1.293 seconds
spark-sql (default)> show tables;
tableName       isTemporary
access_log      false
bucket_table    false
jfapp_pv        false
order   false
stu     false
stu_index_table false
stu_view        false
t1      false
Time taken: 0.195 seconds, Fetched 8 row(s)


(3) 使用count函数
spark-sql (default)> select count(*) from stu;
_c0
9
Time taken : 5.894 seconds, Fetched 1 row(s)



通过Spark-sql和hive计算的结果对比,使用spark-sql性能比hive要高很多,将近85:5约为16倍
地址http://blog.youkuaiyun.com/shenfuli/article/details/48163027
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值