https://www.cloudera.com/documentation/enterprise/latest/topics/admin_hos_oview.html
Configuring the Hive Dependency on a Spark Service
By default, if a Spark service is available, the Hive dependency on the Spark service is configured. To change this configuration, do the following:
-
In the Cloudera Manager Admin Console, go to the Hive service.
-
Click the Configuration tab.
-
Search for the Spark On YARN Service. To configure the Spark service, select the Spark service name. To remove the dependency, select none.
-
Click Save Changes.
-
Go to the Spark service.
-
Add a Spark gateway role to the host running HiveServer2.
-
Return to the Home page by clicking the Cloudera Manager logo.
-
Click the icon next to any stale services to invoke the cluster restart wizard.
-
Click Restart Stale Services.
-
Click Restart Now.
-
Click Finish.
-
In the Hive client, configure the Spark execution engine.
一、配置hive on spark(spark 的 gateway要和hiveserve2在同一个host上)
官网在5.7之前的版本都这样描述:
Important: Hive on Spark is included in CDH 5.6 but is not currently supported nor recommended for production use. To try this feature in CDH 5.6, use it in a test environment.
以下操作是在CDH5.13.1版本上执行的:
hive > select count(*) from tb_emp_info;
Query ID = hdfs_20180205153434_df36b886-80b0-4ed8-a9fc-0a2efb0ca071
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1517815091654_0007, Tracking URL = http://bd129106:8088/proxy/application_1517815091654_0007/
Kill Command = /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/bin/hadoop job -kill job_1517815091654_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-02-05 15:34:56,439 Stage-1 map = 0%, reduce = 0%
2018-02-05 15:35:02,880 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.04 sec
2018-02-05 15:35:08,214 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.33 sec
MapReduce Total cumulative CPU time: 2 seconds 330 msec
Ended Job = job_1517815091654_0007
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.33 sec HDFS Read: 7529 HDFS Write: 3 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 330 msec
OK
10
Time taken: 26.218 seconds, Fetched: 1 row(s)
hive默认执行引擎是mr;
硬件要求:
设置YARN服务上的Spark,可以为Spark也可以是none(5.7版本之前必须选择是Spark,选择Spark的话,hive必须依赖于yarn)
重启部署hive客户端:
进入hive:
hive> set hive.execution.engine=spark;
hive> select count(*) from tb_emp_info;
Query ID = hdfs_20180205160909_79be0d1b-e8a3-496a-ae5f-e990b71d4991
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Spark Job = d7b17038-8156-4f1b-b8d6-443c1db77f77
Running with YARN Application = application_1517815091654_0009
Kill Command = /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/bin/yarn application -kill application_1517815091654_0009
Query Hive on Spark job[0] stages:
0
1
Status: Running (Hive on Spark job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
2018-02-05 16:09:49,725 Stage-0_0: 0(+1)/1 Stage-1_0: 0/1
2018-02-05 16:09:51,753 Stage-0_0: 1/1 Finished Stage-1_0: 0/1
2018-02-05 16:09:52,761 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
Status: Finished successfully in 10.09 seconds
OK
10
Time taken: 23.477 seconds, Fetched: 1 row(s)
查看结果:
二、 配置修改(参数优化)