Shark性能测试

最新推荐文章于 2024-04-17 09:50:04 发布

原创最新推荐文章于 2024-04-17 09:50:04 发布 · 4k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#spark #shark #hive #hadoop #分布式计算

shark 专栏收录该内容

3 篇文章

订阅专栏

本文对Shark 0.91在Spark 1.0.0和amplab Hive 0.11上的性能进行了测试。测试包括count、sum、avg、group by、join、select、sort等操作，结果显示Shark在简单SQL查询中速度提升3-5倍，而在复杂SQL查询中表现更优，可达Hive的3倍。尽管Shark官方宣称速度提升90倍，但实际效果受多种因素影响。

按照Shark官方网站的说法，Shark在RAM的时候，比Hive快90倍，这个报告看起来很不错，但是在不同的测试环境和不同的优化条件以及不同的用例场景下，结果都是不同的，所以决定测试了一下Shark0.91搭建在Spark1.0.0和amplab Hive0.11上的性能。

一、集群环境

前面介绍了如何搭建集群：可以参见 Shark集群搭建配置

1台Master （Master仅仅是Master，不当slave）
3台Slave

二、软件环境

Spark1.0.0 with hadoop0.20.2-cdh3u5
Shark0.91 + amplab Hive0.11
对比测试VS.
Apache Hive 0.11

三、测试对象

21G 的Text File 文件建立一个表，对该表进行各种查询的性能测试。
主要分为数据全部cache在内存时的性能和 on disk 时的性能比较。

[hadoop@wh-8-210 shark]$ hadoop dfs -ls /user/hive/warehouse/log/
Found 1 items
-rw-r--r--   3 hadoop supergroup 22499035249 2014-06-16 18:32 /user/hive/warehouse/log/21gfile

create table log
(
c1 string,
c2 string,
c3 string,
c4 string,
c5 string,
c6 string,
c7 string,
c8 string,
c9 string,
c10 string,
c11 string,
c12 string,
c13 string
) row format delimited fields terminated by '\t' stored as textfile;

load data inpath '/user/hive/warehouse/21gfile' into table log;
示例数据：

[10.1.8.210:7100] shark> select * from log_cached limit 10;
        2014-05-15      101289  13836998753     2       2010-08-23 22:36:50     0       0       2010-06-02 16:55:25     2010-06-02 16:55:25             None    0
        2014-05-15      104497  15936529112     2       2011-01-11 09:58:47     0       0       2011-01-11 09:58:50     2011-01-19 09:58:50     61.172.242.36   2011-01-19 08:59:47      0
        2014-05-15      105470  15000037972     0       2013-07-21 11:35:26     0       0       2013-07-21 11:29:08     2013-07-21 11:29:08             2013-07-21 11:35:26     0
        2014-05-15      111864  13967013250     2       2010-11-28 21:06:56     0       0       2010-11-28 21:06:57     2010-12-06 21:06:57     61.172.242.36   2010-12-06 20:08:11      0
        2014-05-15      112922  13766378550     2       2010-08-23 22:36:50     0       0       2010-03-29 00:08:17     2010-03-29 00:08:17             None    0
        2014-05-15      113685  15882981310     2       2011-04-28 18:24:57     0       0       2011-04-28 17:38:37     2011-04-28 17:38:37     127.0.0.1       None    0
        2014-05-15      116368  15957685805     2       2011-06-27 17:05:55     0       0       2011-06-27 17:06:01     2011-07-05 17:06:01     10.129.20.108   2011-07-05 16:11:05      0
        2014-05-15      136020  13504661323     2       2012-02-11 18:51:17     0       0       2012-02-11 18:51:19     2012-02-19 18:51:19     10.129.20.109   2012-03-03 14:37:05      0
        2014-05-15      137597  15993791204     2       2011-12-07 00:45:03     0       0       2011-12-07 00:44:59     2011-12-15 00:44:59     10.129.20.98    2011-12-14 23:45:40      0
        2014-05-15      155020  13760211160     2       2011-05-25 14:27:24     0       0       2011-05-25 14:02:54     2011-05-25 14:02:54     127.0.0.1       2011-07-28 16:42:21      0
Time taken (including network latency): 0.33 seconds

将21G数据全部cache到内存
cache rdd

sql : CREATE TABLE log_cached TBLPROPERTIES ("shark.cache" = "true") AS SELECT * from log;
Time taken (including network latency): 282.006 seconds

[10.1.8.210:7100] shark> select * from log_cached limit 1; 
        2014-05-15      101289  13836998753     2       2010-08-23 22:36:50     0       0       2010-06-02 16:55:25     2010-06-02 16:55:25             None    0

cache后如图336个partition：

四、测试用例及结果

这个测试没有对hive和shark进行任何调优，均在相同的环境下进行测试，一下是测试结果：

用例：

1、测试count
2、测试sum
3、测试avg
4、测试group by
5、测试join
6、测试select
7、测试sort

8、测试一段稍微复杂的Sql

测试结果：

以下是测试结果图：

测试count、sum、avg、group by

单位(秒)

shark(memory)	shark(disk)	apache hive 0.11
count	5.053	26.223	41.255
sum	13.207	33.401	48.204
avg	13.88	33.519	48.159
group by	8.14	29.852	54.705

测试join,select,sort

单位(秒)

shark(memory)	shark(disk)	apache hive 0.11
join	194.98	272.36	236.203
select	178.53	161.01	172.762
sort	134.23	161.07	161.789

测试复杂sql

测试一段稍微复杂的Sql：

set mapred.reduce.tasks=200;
create table complex as 
select a.c4, a.time from 
(select c4, max(c10) time from log_cached group by c4 ) a
join 
(select c10 time, c11 from log_cached group by c10,c11 )b
on a.time = b.time

执行该sql, hive会启动5个job

Total MapReduce jobs = 5
Launching Job 1 out of 5
Number of reduce tasks not specified.Defaulting to jobconf value of: 200
In order to change the average load for areducer (in by…
Ended Job = job_201406131753_0033
Moving data to:hdfs://10.1.8.210:9000/user/hive/warehouse/complex
Table default.complex stats:[num_partitions: 0, num_files: 200, num_rows: 0, total_size: 7769834233,raw_data_size: 0]
242791443 Rows loaded tohdfs://10.1.8.210:9000/tmp/hive-hadoop/hive_2014-06-20_17-05-17_199_1498492429758626519/-ext-10000
MapReduce Jobs Launched:
Job 0: Map: 84  Reduce: 200  Cumulative CPU: 3709.42 sec   HDFSRead: 22542324372 HDFS Write: 8287517376 SUCCESS
Job 1: Map: 84  Reduce: 200  Cumulative CPU: 3015.82 sec   HDFSRead: 22542324372 HDFS Write: 3528775299 SUCCESS
Job 2: Map: 43  Reduce: 200  Cumulative CPU: 4442.77 sec   HDFSRead: 11816414510 HDFS Write: 7769834233 SUCCESS
Total MapReduce CPU Time Spent: 0 days 3hours 6 minutes 8 seconds 10 msec
OK
Time taken: 964.809seconds

单位（秒）