HAWQ 与 Hive 查询性能对比测试

本文通过模拟页面点击应用场景,对比了Hive、HAWQ外部表及HAWQ内部表在查询性能上的差异。实验显示HAWQ内部表相比HiveonTez查询速度快4至50倍。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目录

一、实验目的

二、硬件环境

三、软件环境

四、数据模型

1. 表结构

2. 记录数

五、建表并生成数据

1. 建立 hive 库表

2. 用 Java 程序生成 hive 表数据

3. 分析 hive 表

4. 建立 HAWQ 外部表

5. 建立 HAWQ 内部表

6. 生成 HAWQ 内部表数据

7. 分析 HAWQ 内部表

六、执行查询

1. 查询给定周中 support.sas.com 站点上访问最多的目录

2. 查询各月从 www.google.com 访问的页面

3. 给定年份 support.sas.com 站点上的搜索字符串计数

4. 查询使用 Safari 浏览器访问每个页面的人数

5. 查询给定周中 support.sas.com 站点上浏览超过 10 秒的页面

七、测试结果


一、实验目的

        本实验通过模拟一个典型的应用场景和实际数据量,测试并对比 HAWQ 内部表、外部表与 Hive 的查询性能。

二、硬件环境

1. 四台 VMware 虚机组成的 Hadoop 集群。
2. 每台机器配置如下:
(1)15K RPM SAS 100GB
(2)Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz,双核双 CPU
(3)8GB 内存,8GB Swap
(4)10000 Mb/s 虚拟网卡

三、软件环境

1. Linux:CentOS release 6.4,核心2.6.32-358.el6.x86_64
2. Ambari:2.4.1
3. Hadoop:HDP 2.5.0
4. Hive(Hive on Tez):2.1.0
5. HAWQ:2.1.1.0
6. HAWQ PXF:3.1.1

四、数据模型

1. 表结构

        实验模拟一个记录页面点击数据的应用场景。数据模型中包含日期、页面、浏览器、引用、状态 5 个维度表,1 个页面点击事实表。表结构和关系如图1 所示。

图1

2. 记录数

        各表的记录数如表1 所示。

表名

行数

page_click_fact

1亿

page_dim

20万

referrer_dim

100万

browser_dim

2万

status_code

70

date_dim

366

表1

五、建表并生成数据

1. 建立 hive 库表

create database test;
use test;

create table browser_dim(
  browser_sk bigint, 
  browser_nm varchar(100), 
  browser_version_no varchar(100), 
  flash_version_no varchar(100), 
  flash_enabled_flg int, 
  java_version_no varchar(100), 
  platform_desc string, 
  java_enabled_flg int, 
  java_script_enabled_flg int, 
  cookies_enabled_flg int, 
  user_language_cd varchar(100), 
  screen_color_depth_no varchar(100), 
  screen_size_txt string)
row format delimited 
  fields terminated by ',' 
stored as orc; 

create table date_dim(
  cal_dt date, 
  day_in_cal_yr_no int, 
  day_of_week_no int, 
  start_of_month_dt date, 
  start_of_quarter_dt date, 
  start_of_week_dt date, 
  start_of_year_dt date)
row format delimited 
  fields terminated by ',' 
stored as orc;

create table page_dim(
  page_sk bigint, 
  domain_nm varchar(200), 
  reachability_cd string, 
  page_desc string, 
  protocol_nm varchar(20))
row format delimited 
  fields terminated by ',' 
stored as orc;
 
create table referrer_dim(
  referrer_sk bigint, 
  referrer_txt string, 
  referrer_domain_nm varchar(200))
row format delimited 
  fields terminated by ',' 
stored as orc;

create table status_code_dim(
  status_cd varchar(100), 
  client_error_flg int, 
  status_cd_desc string, 
  server_error_flg int)
row format delimited 
  fields terminated by ',' 
stored as orc;

create table page_click_fact(
  visitor_id varchar(100), 
  detail_tm timestamp, 
  page_click_dt date, 
  page_sk bigint, 
  client_session_dt date, 
  previous_page_sk bigint, 
  referrer_sk bigint, 
  next_page_sk bigint, 
  status_cd varchar(100), 
  browser_sk bigint, 
  bytes_received_cnt bigint, 
  bytes_sent_cnt bigint, 
  client_detail_tm timestamp, 
  entry_point_flg int, 
  exit_point_flg int, 
  ip_address varchar(20), 
  query_string_txt string, 
  seconds_spent_on_page_cnt int, 
  sequence_no int, 
  requested_file_txt string)
row format delimited 
  fields terminated by ',' 
stored as orc;

        说明:hive 表使用 ORCfile 存储格式。

2. 用 Java 程序生成 hive 表数据

        ORC 压缩后的各表对应的 HDFS 文件大小如下:

2.2 M   /apps/hive/warehouse/test.db/browser_dim
641     /apps/hive/warehouse/test.db/date_dim
4.1 G   /apps/hive/warehouse/test.db/page_click_fact
16.1 M  /apps/hive/warehouse/test.db/page_dim
22.0 M  /apps/hive/warehouse/test.db/referrer_dim
1.1 K   /apps/hive/warehouse/test.db/status_code_dim

3. 分析 hive 表

analyze table date_dim compute statistics;
analyze table browser_dim compute statistics;
analyze table page_dim compute statistics;
analyze table referrer_dim compute statistics;
analyze table status_code_dim compute statistics;
analyze table page_click_fact compute statistics;

4. 建立 HAWQ 外部表

create schema ext;
set search_path=ext;

create external table date_dim(
cal_dt              date,                                    
day_in_cal_yr_no    int4,                                     
day_of_week_no      int4,                                     
start_of_month_dt   date,                                    
start_of_quarter_dt date,                                    
start_of_week_dt    date,                                    
start_of_year_dt    date  
)
location ('pxf://hdp1:51200/test.date_dim?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import'); 

create external table browser_dim(
browser_sk              int8,                                  
browser_nm              varchar(100),                            
browser_version_no      varchar(100),                            
flash_version_no        varchar(100),                            
flash_enabled_flg       int,                                     
java_version_no         varchar(100),                            
platform_desc           text,                                  
java_enabled_flg        int,    
java_script_enabled_flg int,    
cookies_enabled_flg     int,                                     
user_language_cd        varchar(100),         
screen_color_depth_no   varchar(100),            
screen_size_txt         text   
)  
location ('pxf://hdp1:51200/test.browser_dim?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import'); 

create external table page_dim(
page_sk             int8,                                  
domain_nm           varchar(200),                            
reachability_cd     text,                           
page_desc           text,                                  
protocol_nm         varchar(20) 
) 
location ('pxf://hdp1:51200/test.page_dim?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import'); 

create external table referrer_dim(
referrer_sk         int8,                 
referrer_txt        text,                
referrer_domain_nm  varchar(200) 
)
location ('pxf://hdp1:51200/test.referrer_dim?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import'); 

create external table status_code_dim(
status_cd           varchar(100),                            
client_error_flg    int4,                                     
status_cd_desc      text,                                  
server_error_flg    int4
) 
location ('pxf://hdp1:51200/test.status_code_dim?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import'); 

create external table page_click_fact(
visitor_id                varchar(100),                            
detail_tm                 timestamp,                               
page_click_dt             date,                                    
page_sk                   int8,                                  
client_session_dt         date, 
previous_page_sk          int8,                                  
referrer_sk               int8,                                  
next_page_sk              int8,                                  
status_cd                 varchar(100),                            
browser_sk                int8,                                  
bytes_received_cnt        int8,                                  
bytes_sent_cnt            int8,                                  
client_detail_tm          timestamp,                               
entry_point_flg           int4,                                     
exit_point_flg            int4,                                     
ip_address                varchar(20),                             
query_string_txt          text,                                  
seconds_spent_on_page_cnt int4,                                     
sequence_no               int4,                                     
requested_file_txt        text    
) 
location ('pxf://hdp1:51200/test.page_click_fact?profile=hiveorc')  
format 'custom' (formatter='pxfwritable_import');

        说明:HAWQ 外部表使用 PXF 协议,指向相应的 hive 表。

5. 建立 HAWQ 内部表

set search_path=public;

create table date_dim(
cal_dt              date,      
day_in_cal_yr_no    int4,
day_of_week_no      int4,
start_of_month_dt   date, 
start_of_quarter_dt date,     
start_of_week_dt    date,         
start_of_year_dt    date) with (compresstype=snappy,appendonly=true); 

create table browser_dim(
browser_sk              int8,                                  
browser_nm              varchar(100),                            
browser_version_no      varchar(100),                            
flash_version_no        varchar(100),                            
flash_enabled_flg       int,                                     
java_version_no         varchar(100),                            
platform_desc           text,                                  
java_enabled_flg        int,    
java_script_enabled_flg int,    
cookies_enabled_flg     int,                                     
user_language_cd        varchar(100),         
screen_color_depth_no   varchar(100),            
screen_size_txt         text   
) with (compresstype=snappy,appendonly=true); 

create table page_dim(
page_sk             int8,                                  
domain_nm           varchar(200),                            
reachability_cd     text,                           
page_desc           text,                                  
protocol_nm         varchar(20) 
) with (compresstype=snappy,appendonly=true); 

create table referrer_dim(
referrer_sk         int8,                 
referrer_txt        text,                
referrer_domain_nm  varchar(200) 
) with (compresstype=snappy,appendonly=true); 

create table status_code_dim(
status_cd           varchar(100),                            
client_error_flg    int4,                                     
status_cd_desc      text,                                  
server_error_flg    int4
) with (compresstype=snappy,appendonly=true); 

create table page_click_fact(
visitor_id                varchar(100),                            
detail_tm                 timestamp,                               
page_click_dt             date,                                    
page_sk                   int8,                                  
client_session_dt         date, 
previous_page_sk          int8,                                  
referrer_sk               int8,                                  
next_page_sk              int8,                                  
status_cd                 varchar(100),                            
browser_sk                int8,                                  
bytes_received_cnt        int8,                                  
bytes_sent_cnt            int8,                                  
client_detail_tm          timestamp,                               
entry_point_flg           int4,                                     
exit_point_flg            int4,                                     
ip_address                varchar(20),                             
query_string_txt          text,                                  
seconds_spent_on_page_cnt int4,                                     
sequence_no               int4,                                     
requested_file_txt        text    
) with (compresstype=snappy,appendonly=true);

        说明:内部表结构定义与 hive 表等价,使用 snappy 压缩的行存储格式。

6. 生成 HAWQ 内部表数据

insert into date_dim select * from hcatalog.test.date_dim;
insert into browser_dim select * from hcatalog.test.browser_dim; 
insert into page_dim select * from hcatalog.test.page_dim; 
insert into referrer_dim select * from hcatalog.test.referrer_dim; 
insert into status_code_dim select * from hcatalog.test.status_code_dim; 
insert into page_click_fact select * from hcatalog.test.page_click_fact;

        说明:通过 HCatalog 直接查询 hive 表,插入到 HAWQ 内部表中。snappy 压缩后的各表对应的 HDFS 文件大小如下:

6.2 K   /hawq_data/16385/177422/177677
3.3 M   /hawq_data/16385/177422/177682
23.9 M  /hawq_data/16385/177422/177687
39.3 M  /hawq_data/16385/177422/177707
1.8 K   /hawq_data/16385/177422/177726
7.9 G   /hawq_data/16385/177422/177731

7. 分析 HAWQ 内部表

analyze date_dim;
analyze browser_dim;
analyze page_dim;
analyze referrer_dim;
analyze status_code_dim;
analyze page_click_fact;

六、执行查询

        分别在 hive 表、HAWQ 外部表、HAWQ 内部表上执行以下 5 个查询语句,记录执行时间。

1. 查询给定周中 support.sas.com 站点上访问最多的目录

-- hive查询
select top_directory, count(*) as unique_visits     
  from (select distinct visitor_id, substr(requested_file_txt,1,10) top_directory  
          from page_click_fact, page_dim, browser_dim  
         where domain_nm = 'support.sas.com'   
           and flash_enabled_flg=1   
           and weekofyear(detail_tm) = 19   
           and year(detail_tm) = 2017  
       ) directory_summary  
 group by top_directory  
 order by unique_visits;  

-- HAWQ查询,只是用extract函数代替了hive的weekofyear和year函数,与hive的查询语句等价。
select top_directory, count(*) as unique_visits     
  from (select distinct visitor_id, substr(requested_file_txt,1,10) top_directory  
          from page_click_fact, page_dim, browser_dim  
         where domain_nm = 'support.sas.com'   
           and flash_enabled_flg=1   
           and extract(week from detail_tm) = 19   
           and extract(year from detail_tm) = 2017  
       ) directory_summary  
 group by top_directory  
 order by unique_visits;

2. 查询各月从 www.google.com 访问的页面

-- hive查询
select domain_nm, requested_file_txt, count(*) as unique_visitors, month  
  from (select distinct domain_nm, requested_file_txt, visitor_id, month(detail_tm) as month  
          from page_click_fact, page_dim, referrer_dim   
         where domain_nm = 'support.sas.com'   
           and referrer_domain_nm = 'www.google.com'  
       ) visits_pp_ph_summary  
 group by domain_nm, requested_file_txt, month  
 order by domain_nm, requested_file_txt, unique_visitors desc, month asc;  

-- HAWQ查询,只是用extract函数代替了hive的month函数,与hive的查询语句等价。
select domain_nm, requested_file_txt, count(*) as unique_visitors, month  
  from (select distinct domain_nm, requested_file_txt, visitor_id, extract(month from detail_tm) as month  
          from page_click_fact, page_dim, referrer_dim   
         where domain_nm = 'support.sas.com'   
           and referrer_domain_nm = 'www.google.com'  
       ) visits_pp_ph_summary  
 group by domain_nm, requested_file_txt, month  
 order by domain_nm, requested_file_txt, unique_visitors desc, month asc;

3. 给定年份 support.sas.com 站点上的搜索字符串计数

-- hive查询
select query_string_txt, count(*) as count  
  from page_click_fact, page_dim  
 where query_string_txt <> ''   
   and domain_nm='support.sas.com'   
   and year(detail_tm) = '2017'  
 group by query_string_txt  
 order by count desc;
 
-- HAWQ查询,只是用extract函数代替了hive的year函数,与hive的查询语句等价。
select query_string_txt, count(*) as count  
  from page_click_fact, page_dim  
 where query_string_txt <> ''   
   and domain_nm='support.sas.com'   
   and extract(year from detail_tm) = '2017'  
 group by query_string_txt  
 order by count desc;

4. 查询使用 Safari 浏览器访问每个页面的人数

-- hive查询
select domain_nm, requested_file_txt, count(*) as unique_visitors  
  from (select distinct domain_nm, requested_file_txt, visitor_id  
          from page_click_fact, page_dim, browser_dim  
         where domain_nm='support.sas.com'   
           and browser_nm like '%Safari%'   
           and weekofyear(detail_tm) = 19   
           and year(detail_tm) = 2017  
       ) uv_summary  
 group by domain_nm, requested_file_txt  
 order by unique_visitors desc;  
 
-- HAWQ查询,只是用extract函数代替了hive的weekofyear和year函数,与hive的查询语句等价。
select domain_nm, requested_file_txt, count(*) as unique_visitors  
  from (select distinct domain_nm, requested_file_txt, visitor_id  
          from page_click_fact, page_dim, browser_dim  
         where domain_nm='support.sas.com'   
           and browser_nm like '%Safari%'   
           and extract(week from detail_tm) = 19   
           and extract(year from detail_tm) = 2017  
       ) uv_summary  
 group by domain_nm, requested_file_txt  
 order by unique_visitors desc;

5. 查询给定周中 support.sas.com 站点上浏览超过 10 秒的页面

-- hive查询
select domain_nm, requested_file_txt, count(*) as unique_visits  
  from (select distinct domain_nm, requested_file_txt, visitor_id  
          from page_click_fact, page_dim  
         where domain_nm='support.sas.com'   
           and weekofyear(detail_tm) = 19   
           and year(detail_tm) = 2017   
           and seconds_spent_on_page_cnt > 10 
       ) visits_summary  
 group by domain_nm, requested_file_txt  
 order by unique_visits desc;  

-- HAWQ查询,只是用extract函数代替了hive的weekofyear和year函数,与hive的查询语句等价。 
select domain_nm, requested_file_txt, count(*) as unique_visits  
  from (select distinct domain_nm, requested_file_txt, visitor_id  
          from page_click_fact, page_dim  
         where domain_nm='support.sas.com'   
           and extract(week from detail_tm) = 19   
           and extract(year from detail_tm) = 2017   
           and seconds_spent_on_page_cnt > 10 
       ) visits_summary  
 group by domain_nm, requested_file_txt  
 order by unique_visits desc;

七、测试结果

        Hive、HAWQ 外部表、HAWQ 内部表查询时间对比如表2 所示,每种查询情况执行三次取平均值。

查询

Hive(秒)

HAWQ外部表(秒)

HAWQ内部表(秒)

1

74.337

304.134

19.232

2

169.521

150.882

3.446

3

73.482

101.216

18.565

4

66.367

359.778

1.217

5

60.341

118.329

2.789

表2

        从图2 中的对比可以看到,HAWQ 内部表比 Hive on Tez 快的多(4-50倍)。同样的查询,在 HAWQ 的 Hive 外部表上执行却很慢。因此,在执行分析型查询时最好使用 HAWQ 内部表。如果不可避免地需要使用外部表,为了获得满意的查询性能,需要保证外部表数据量尽可能小。同时要使查询尽可能简单,尽量避免在外部表上执行聚合、分组、排序等复杂操作。

图2 

评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值