HIVE 第八章 schema

最新推荐文章于 2025-12-30 14:45:00 发布

原创最新推荐文章于 2025-12-30 14:45:00 发布 · 1.4k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#hive #schema

hadoop hive 专栏收录该内容

33 篇文章

订阅专栏

本文深入探讨了 Hive 数据仓库的优化策略，包括分区设计、唯一键和规范化、多数据源操作优化、每张表分区考量、桶化表数据存储、新增列到表以及压缩策略等关键点。通过案例分析，阐述了如何通过合理设计和实施这些策略来提升数据处理效率和性能。

schema设计

hive pattern && hive anti-pattern

1.Table by day 按照天分割数据，在relation中，这个参数不推荐，在hive中使用

create table supply(id int,part string,quantity int) partitioned by (int day)

alter table supply add partition (day=20120102)

partition的负面影响:

1.namenode limition

但是partition产生的子目录，子文件都会保存在hdfs中，namenode会存在内存中，所以这得负面效果是namenode的filesystem的容量上限(hadoop has this upper limit on the total number of file,mapr and amazon s3 don't have this limitation)

2.一个job分解成几个task，每个task是一个jvm实例，每一个file对应一个独立的task，每个task是jvm中独立的一个实例（进程），过多的实例会给jvm压力（start up and tear down），这使得计算速度降低

因此不能有太多partition，每个文件要尽可能的大

一个好的table by day的设计，是设计出相似大小的数据在不同的时间间断，时间间断可以适当增大。同时保证每个file大于filesystem block size。目的是让partition足够的大。另一种方法，是用多维度的partition分解数据。

2.unique keys and normalization 主键，格式化数据

关系数据库最爱用地策略，但是在hive中没有这种概念。因为hive可以存储denormalized data非格式化的数据，如array,map,struct。这样可以避免one-to-many的关联关系，加快了io速度。但是也pay the penalty of denormalization，比如数据复制，数据不一致的概率

3.making multiple passes over the same data 同数据源的操作优化

insert overwrite table sales

select * from history where action='purchased';

insert overwrite table credits

select * from history where action='returned';

from history

insert overwrite sales select * where action='purchased'

insert overwrite credits select * where action = 'returned'

4.the case for partitioning every table

为了避免job fail而使得数据被删除，在insert数据的时候可以使用table pardae table1 partition(day=20120102).但是需要删除这个中间换转者partition

5.bucketing table data storage

当table没有明显的partition特征时，或是减轻filesystem的负担,可以使用bucketing,他的优点是不会随着增加数据使得文件个数变动，而且对于取样sample是很容易的，对于一些joins操作也比较便利。