Hive分区/分桶

最新推荐文章于 2023-10-27 11:34:13 发布

弗瑞得姆

最新推荐文章于 2023-10-27 11:34:13 发布

阅读量193

点赞数

文章标签： hive 大数据

本文链接：https://blog.youkuaiyun.com/aiyin9511/article/details/104389778

版权

本文介绍了Hive中的分区表和分桶表。分区表通过指定分区字段（如国家、日期等）减少全局扫描，提高查询效率。创建分区表后，使用`load data`命令导入数据并进行查询。而分桶表则是根据特定字段将数据分成多个部分，通过哈希函数确定分桶位置，用于优化JOIN查询。创建分桶表时需开启分桶功能，并在导入数据时使用`insert+select`语句。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

分区表：

创建分区表

 	create table t_user_partition(id int, name string) partitioned by (country string)  row format delimited fields terminated by ",";

分区表数据导入

hadoop fs -put 不能导入分区表的数据

  load data local inpath '/root/hivedata/china.txt' into table t_user_partition partition(country ='china');

load data local inpath ‘/root/hivedata/china.txt’ into table t_user_partition partition(country =‘china’);

load data local inpath ‘/root/hivedata/china2.txt’ into table t_user_partition partition(country =‘china’);

load data local inpath ‘/root/hivedata/usa.txt’ into table t_user_partition partition(country =‘usa’);

分区表的查询使用（根据分区标识扫描指定的文件）

   select * from t_user_partition where country='usa';

总结：
- 分区表的出现时为了避免全局扫描表，提高查询效率
- 建表的时候根据数据的属性自己指定分区的字段（标识，比如国家日期省份）
- 分区字段不是表中真实字段虚拟字段（它的值只是分区的标识值）

  create table t_user_partition2(id int, name string) partitioned by (name string)  row format delimited fields terminated by ",";

语法错误

分区的字段不能是表中已经存有的字段否则编译出错

hive 中多分区

create table t_user_partition_duo(id int, name string) partitioned by (country string, province string) row format delimited fields terminated by ",";
加载多分区数据

  load data local inpath '/root/hivedata/china.txt' into table t_user_partition_duo partition(country ='china',province='beijing');

多分区查询

select * from t_user_partition_duo where country=“china”;

 select * from t_user_partition_duo where country="china" and province="shanghai";

hive 分桶表、分簇表

- clustered by (col_name)into xx buckets

解释：根据指定的字段把数据文件分成几个部分

根据谁分---->clustered by (col_name)

分成几个部分—> xx buckets

如何分—>默认的分桶规则。

hashfunc(col_name),

如果col_name是int类型，hashfunc(col_name)= col_name % xx buckets 取模

如果是其他类型String ，hashfunc(col_name)=对col_name取哈希值 % xx buckets 取模

分桶表的创建
```
create table stu_buck(Sno int,Sname string,Sex string,Sage int,Sdept string)
clustered by(Sno) 
into 4 buckets
row format delimited
fields terminated by ',';
```
分桶表的数据导入：

hadoop fs -put students.txt /user/hive/warehouse/itcast.db/stu_buck 错误不能满足数据被分开

load data local inpath '/root/hivedata/students.txt' into table stu_buck cluster(Sno); 错误语法就没有
分桶表的真正导入 insert+select

insert into table t1 values(1,allen);普通数据库

insert into table t1 select id,name from t2; 把select查询语句返回的结果作为数据插入的insert表中
- 首先需要开启分桶的功能

set hive.enforce.bucketing = true;

分为几桶需要指定

set mapreduce.job.reduces=4; ---->注意跟建表的时候分为几桶对应
分桶数据导入
- 创建中间临时表

 insert overwrite table stu_buck select * from student cluster by(Sno);

首先创建student

create table student (Sno int,Sname string,Sex string,Sage int,Sdept string)
row format delimited
fields terminated by ‘,’;
导入student 数据

hadoop fs -put students.txt /user/hive/warehouse/itcast.db/student

- 最终导入分桶表数据

insert overwrite table stu_buck select * from student cluster by(Sno);

分桶表总结
- 分桶表是在文件的层面把数据划分的更加细致
- 分桶表定义需要指定根据那个字段分桶
- 分桶表分为几个桶最后自己设置的时候保持一致
- 分桶表的好处在于提高join查询效率减少笛卡尔（交叉相差）的数量

在这里插入图片描述