hive/impala存储格式选型

最新推荐文章于 2024-12-20 16:01:15 发布

xjping0794

最新推荐文章于 2024-12-20 16:01:15 发布

阅读量3.5k

点赞数 1

分类专栏： hive hadoop spark 文章标签： hadoop

hadoop 同时被 3 个专栏收录

24 篇文章

订阅专栏

spark

12 篇文章

订阅专栏

hive

10 篇文章

订阅专栏

转载自http://blog.youkuaiyun.com/mtj66/article/details/53968991

1、TEXTFILE
默认格式，建表时不指定默认为这个格式，存储方式：行存储
导入数据时会直接把数据文件拷贝到hdfs上不进行处理。源文件可以直接通过hadoop fs -cat 查看
磁盘开销大数据解析开销大,压缩的text文件 hive无法进行合并和拆分
2、SEQUENCEFILE 一种Hadoop API提供的二进制文件，使用方便、可分割、可压缩等特点。
SEQUENCEFILE将数据以<key,value>的形式序列化到文件中。序列化和反序列化使用Hadoop 的标准的Writable 接口实现,优势是文件和Hadoop api中的mapfile是相互兼容的。。
key为空，用value 存放实际的值，这样可以避免map 阶段的排序过程。
三种压缩选择：NONE, RECORD, BLOCK。 Record压缩率低，一般建议使用BLOCK压缩。使用时设置参数，
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK
create table test2(str STRING) STORED AS SEQUENCEFILE;
3、RCFILE
一种行列存储相结合的存储方式。
首先，其将数据按行分块，保证同一个record在一个块上，避免读一个记录需要读取多个block。
其次，块数据列式存储，有利于数据压缩和快速的列存取。
理论上具有高查询效率（但hive官方说效果不明显，只有存储上能省10%的空间，所以不好用，可以不用）。
RCFile结合行存储查询的快速和列存储节省空间的特点
1）同一行的数据位于同一节点，因此元组重构的开销很低；
2) 块内列存储，可以进行列维度的数据压缩，跳过不必要的列读取。
查询过程中，在IO上跳过不关心的列。实际过程是，在map阶段从远端拷贝仍然拷贝整个数据块到本地目录，
也并不是真正直接跳过列，而是通过扫描每一个row group的头部定义来实现的。
但是在整个HDFS Block 级别的头部并没有定义每个列从哪个row group起始到哪个row group结束。
所以在读取所有列的情况下，RCFile的性能反而没有SequenceFile高。
读记录尽量涉及到的block最少
读取需要的列只需要读取每个row group 的头部定义。
读取全量数据的操作性能可能比sequencefile没有明显的优势
4、ORC格式 hive给出的新格式，属于RCFILE的升级版,性能有大幅度提升,而且数据可以压缩存储,压缩快快速列存取
压缩比和Lzo压缩差不多，比text文件压缩比可以达到80%的空间。而且读性能非常高，可以实现高效查询。
一个ORC文件包含一个或多个stripes(groups of row data)，每个stripe中包含了每个column的min/max值的索引数据，当查询中有<,>,=的操作时，
会根据min/max值，跳过扫描不包含的stripes。
具体介绍https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
5、自定义格式用户的数据文件格式不能被当前 Hive 所识别的，时通过实现inputformat和outputformat来自定义输入输出格式，
参考代码：.\hive-0.8.1\src\contrib\src\java\org\apache\hadoop\hive\contrib\fileformat\base64
对前集中的介绍和建表语句参见：http://www.cnblogs.com/ggjucheng/archive/2013/01/03/2843318.html
注意：
只有TEXTFILE表能直接加载数据，必须，本地load数据，和external外部表直接加载运路径数据，都只能用TEXTFILE表。更深一步，hive默认支持的压缩文件（hadoop默认支持的压缩格式），也只能用TEXTFILE表直接读取。
其他格式不行。可以通过TEXTFILE表加载后insert到其他表中。换句话说，SequenceFile、RCFile表不能直接加载数据，数据要先导入到textfile表，再从textfile表通过insert select from 导入到SequenceFile,RCFile表。
SequenceFile、RCFile表的源文件不能直接查看，在hive中用select看。
RCFile源文件可以用 hive --service rcfilecat /xxxxxxxxxxxxxxxxxxxxxxxxxxx/000000_0查看，但是格式不同，很乱。
hive默认支持压缩文件格式参考http://blog.youkuaiyun.com/longshenlmj/article/details/50550580建表语句如下：同时，将ORC的表中的NULL取值，由默认的\N改为'',
方式一create table if not exists test_orc( advertiser_id string, ad_plan_id string, cnt BIGINT) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)STORED AS ORC;alter table test_orc set serdeproperties('serialization.null.format' = '');
查看结果
hive> show create table test_orc;CREATE TABLE `test_orc`( `advertiser_id` string, `ad_plan_id` string, `cnt` bigint)PARTITIONED BY ( `day` string, `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', `hour` tinyint)ROW FORMAT DELIMITED NULL DEFINED AS '' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'LOCATION 'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'TBLPROPERTIES ( 'last_modified_by'='pmp_bi', 'last_modified_time'='1465992624', 'transient_lastDdlTime'='1465992624')
方式二
drop table test_orc;
create table if not exists test_orc( advertiser_id string, ad_plan_id string, cnt BIGINT) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' with serdeproperties('serialization.null.format' = '')STORED AS ORC;
查看结果
hive> show create table test_orc;CREATE TABLE `test_orc`( `advertiser_id` string, `ad_plan_id` string, `cnt` bigint)PARTITIONED BY ( `day` string, `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', `hour` tinyint)ROW FORMAT DELIMITED NULL DEFINED AS '' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'LOCATION 'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'TBLPROPERTIES ( 'transient_lastDdlTime'='1465992726')
方式三
drop table test_orc;
create table if not exists test_orc( advertiser_id string, ad_plan_id string, cnt BIGINT) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)ROW FORMAT DELIMITED NULL DEFINED AS '' STORED AS ORC;
查看结果
hive> show create table test_orc;CREATE TABLE `test_orc`( `advertiser_id` string, `ad_plan_id` string, `cnt` bigint)PARTITIONED BY ( `day` string, `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', `hour` tinyint)ROW FORMAT DELIMITED NULL DEFINED AS '' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'LOCATION 'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'TBLPROPERTIES ( 'transient_lastDdlTime'='1465992916')
具体存储对比，下面的数据只有单列值
> desc tmp_store ;
OK
col_name data_type comment
c string // 只有一列值而且包含中英文内容
Time taken: 0.076 seconds, Fetched: 1 row(s)
1. TextInputFormat
CREATE TABLE `tmp_store`(
`c` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
> dfs -du -s -h /user/hive/warehouse/test.db/tmp_store ;
753.9 M 2.2 G /user/hive/warehouse/test.db/tmp_store ;
=========================================================
准备100万条数据 tmp_store
create table tmp_store_seq like tmp_store;
insert into tmp_store_seq as select * from tmp_store;
create table tmp_store_par like tmp_store_seq stored as Parquet;
insert into table tmp_store_par select * from tmp_store_seq;
create table tmp_store_rc like tmp_store_seq stored as RCFile ;
insert into table tmp_store_rc select * from tmp_store_seq;
create table tmp_store_orc like tmp_store_seq stored as ORC ;
insert into table tmp_store_orc select * from tmp_store_seq;
hive (test)> dfs -du -s -h /user/hive/warehouse/test.db/tmp_store ;
125.9 M 377.7 M /user/hive/warehouse/test.db/tmp_store
hive (test)> dfs -du -s -h /user/hive/warehouse/test.db/tmp_store_orc ;
20.1 M 60.3 M /user/hive/warehouse/test.db/tmp_store_orc
hive (test)> dfs -du -s -h /user/hive/warehouse/test.db/tmp_store_rc ;
101.1 M 303.3 M /user/hive/warehouse/test.db/tmp_store_rc
hive (test)> dfs -du -s -h /user/hive/warehouse/test.db/tmp_store_par ;
53.6 M 160.9 M /user/hive/warehouse/test.db/tmp_store_par
hive (test)> dfs -du -s -h /user/hive/warehouse/test.db/tmp_store_seq ;
125.9 M 377.7 M /user/hive/warehouse/test.db/tmp_store_seq
table store format size /repsize store efficience(参照textfile)
tmp_store_orc orc 20.1 M 60.3 M 6.26
tmp_store_par parquet 53.6 M 160.9 M 2.35 (spark default output format)
tmp_store_rc rcfile 101.1 M 303.3 M 1.24
tmp_store_seq sequencefile 125.9 M 377.7 M 1
tmp_store textfile 125.9 M 377.7 M 1 (hive default store format)
表1
sparksql 处理 |947482003|记录时间
tablename store format size /repsize store efficience(参照textfile) count cost time
(first /second) (s)
gps_log_par parquet 47.0 G 140.9 G 2.23 28s 3s
gps_log_orc orc 15.3 G 45.8 G 6.87 45s 25s
gps_log textfile 104.8 G 314.5 G 1 (hive default store format) 133s 104s
表2
impala 对parquet的支持也是极限了第一次count 6.55s 第二次2.30s
这也可以作为交互式查询的一个选择，但是嘛除了parquet 其他的效率就差远了。
前几次直接卡住(142s)，后面试了几次等到查出来结果，之后再次查询挺快的（看来申请资源花费时间）
Query: select count(*) from gps_log_201608_201610
+-----------+
| count(*) |
+-----------+
| 947482003 |
+-----------+
Fetched 1 row(s) in 142.54s
[10.1.16.40:21000] > select count(*) from gps_log_201608_201610 ;
Query: select count(*) from gps_log_201608_201610
+-----------+
| count(*) |
+-----------+
| 947482003 |
+-----------+
Fetched 1 row(s) in 10.27s
tool format option cost time (first/second) s
sparksql textfile count 54s 50s
max,min 83s 未统计
max,min,groupby 150s 未统计
sparksql parquet count 4s 2s
max min 26s 12s
max,min,
groupby 50s 49s
orc 支持
impala textfile count 142.53s 10.27s
max,min 20.81s 18.30s
max,min,groupby 37.27s 37.21s
parquet count 2.39s 2.29s
max,min 8.87s 8.78s
max,min,groupby 36.67s 36.67s
orc 不支持
表3
总结：
textfile 存储空间消耗比较大，并且压缩的text 无法分割和合并查询的效率最低,可以直接存储，加载数据的速度最高
sequencefile 存储空间消耗最大,压缩的文件可以分割和合并查询效率高，需要通过text文件转化来加载
rcfile 存储空间最小，查询的效率最高，需要通过text文件转化来加载，加载的速度最低,由于列式存储方式，数据加载时性能消耗较大，但是具有较好的压缩比和查询响应。
parquet sparksql默认的输出格式，由上面的表2看出parquet具有要较快的处理效率显然牺牲点存储(相对于orc格式)但是计算速度可以很大提升，加快响应速度，提供交互式查询
orcfile 存储效率，处理效率兼顾
表3数据也表明 impala做交互式查询效率是sparksql 的三倍左右，交互式查询的最佳组合是impala + parquet 。
不过也要考虑的是 impala的初始化需要耗费的时间应该也要考虑在内，（不过即使初始化对parquet首次查询的响应时间在3s 内，可见一斑）
当然压缩格式没有列入本次对比的范畴。
给出这些数据对于平台的存储格式应该有一个总体的规划了。