Hive 存储文件格式

最新推荐文章于 2024-12-23 08:34:03 发布

leveretz

最新推荐文章于 2024-12-23 08:34:03 发布

阅读量560

点赞数

分类专栏： hive 文章标签： hive

原文链接：https://blog.youkuaiyun.com/hereiskxm/article/details/42171325

版权

hive 专栏收录该内容

13 篇文章

订阅专栏

原文地址：https://blog.youkuaiyun.com/yangshaojun1992/article/details/85124287

hive文件存储格式包括以下几类：

1、TEXTFILE

2、SEQUENCEFILE

3、RCFILE

4、ORCFILE(0.11以后出现)

5、PARQUET

其中TEXTFILE为默认格式，建表时不指定默认为这个格式，导入数据时会直接把数据文件拷贝到hdfs上不进行处理；SEQUENCEFILE，RCFILE，ORCFILE,PARQUET格式的表不能直接从本地文件导入数据，数据要先导入到textfile格式的表中，然后再从表中用insert导入SequenceFile,RCFile,ORCFile,PARQUET表中；或者用复制表结构及数据的方式（create table as select * from table ）。

1、textfile

默认格式；

存储方式为行存储；

磁盘开销大数据解析开销大；

但使用这种方式，hive不会对数据进行切分，从而无法对数据进行并行操作。

可结合Gzip、Bzip2使用(系统自动检查，执行查询时自动解压)

create table if not exists textfile_table(
site string,
url  string,
pv   bigint,
label string)
row format delimited
fields terminated by '\t'
stored as textfile;
插入数据操作：
set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
insert overwrite table textfile_table select * from textfile_table;

2、sequencefile

二进制文件,以<key,value>的形式序列化到文件中；
存储方式：行存储；
可分割压缩；
SequenceFile支持三种压缩选择：NONE，RECORD，BLOCK。Record压缩率低，一般建议使用BLOCK压缩；
优势是文件和Hadoop api中的mapfile是相互兼容的。

create table if not exists seqfile_table(
site string,
url  string,
pv   bigint,
label string)
row format delimited
fields terminated by '\t'
stored as sequencefile;
插入数据操作：
set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
SET mapred.output.compression.type=BLOCK;
insert overwrite table seqfile_table select * from textfile_table;

3、rcfile

存储方式：数据按行分块每块按照列存储；
压缩快快速列存取；
读记录尽量涉及到的block最少；
读取需要的列只需要读取每个row group 的头部定义；
读取全量数据的操作性能可能比sequencefile没有明显的优势；

create table if not exists rcfile_table(
site string,
url  string,
pv   bigint,
label string)
row format delimited
fields terminated by '\t'
stored as rcfile;
插入数据操作：
set hive.exec.compress.output=true;  
set mapred.output.compress=true;  
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;  
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;  
insert overwrite table rcfile_table select * from textfile_table;

4、orcfile

存储方式：数据按行分块每块按照列存储；

压缩快快速列存取；

效率比rcfile高,是rcfile的改良版本。

5、parquet
类似于orc，相对于orc文件格式，hadoop生态系统中大部分工程都支持parquet文件。

6、再看TEXTFILE、SEQUENCEFILE、RCFILE三种文件的存储情况：

[hadoop@node3 ~]$ hadoop dfs -dus /user/hive/warehouse/*
hdfs://node1:19000/user/hive/warehouse/hbase_table_1    0
hdfs://node1:19000/user/hive/warehouse/hbase_table_2    0
hdfs://node1:19000/user/hive/warehouse/orcfile_table    0
hdfs://node1:19000/user/hive/warehouse/rcfile_table    102638073
hdfs://node1:19000/user/hive/warehouse/seqfile_table   112497695
hdfs://node1:19000/user/hive/warehouse/testfile_table  536799616
hdfs://node1:19000/user/hive/warehouse/textfile_table  107308067
[hadoop@node3 ~]$ hadoop dfs -ls /user/hive/warehouse/*/
-rw-r--r--   2 hadoop supergroup   51328177 2014-03-20 00:42 /user/hive/warehouse/rcfile_table/000000_0
-rw-r--r--   2 hadoop supergroup   51309896 2014-03-20 00:43 /user/hive/warehouse/rcfile_table/000001_0
-rw-r--r--   2 hadoop supergroup   56263711 2014-03-20 01:20 /user/hive/warehouse/seqfile_table/000000_0
-rw-r--r--   2 hadoop supergroup   56233984 2014-03-20 01:21 /user/hive/warehouse/seqfile_table/000001_0
-rw-r--r--   2 hadoop supergroup  536799616 2014-03-19 23:15 /user/hive/warehouse/testfile_table/weibo.txt
-rw-r--r--   2 hadoop supergroup   53659758 2014-03-19 23:24 /user/hive/warehouse/textfile_table/000000_0.gz
-rw-r--r--   2 hadoop supergroup   53648309 2014-03-19 23:26 /user/hive/warehouse/textfile_table/000001_1.gz

总结:
相比TEXTFILE和SEQUENCEFILE，RCFILE由于列式存储方式，数据加载时性能消耗较大，但是具有较好的压缩比和查询响应。数据仓库的特点是一次写入、多次读取，因此，整体来看，RCFILE相比其余两种格式具有较明显的优势。