Hive-分桶的操作

最新推荐文章于 2024-11-20 18:48:31 发布

原创最新推荐文章于 2024-11-20 18:48:31 发布 · 351 阅读

1 ·

CC 4.0 BY-SA版权

大数据学习同时被 2 个专栏收录

57 篇文章

订阅专栏

Hive学习

42 篇文章

订阅专栏

分桶的操作:

先创建表：

create table t_stu(
sno int,
sname string,
sex string,
sage int,
sdept string
)
row format delimited 
fields terminated by ','
stored as textfile
;

分桶查询

select * from t_stu;


select * from t_stu cluster by (sno);

set mapreduce.job.reduces=4;

select * from t_stu cluster by (sno); 负责分区还负责排序，排序字段就是分区字段

select * from t_stu distribute by (sno) sort by (sage desc); 
指定分区字段和排序字段，排序字段可和分区字段不一致，排序规则还可以指定

*************************************************************

创建分桶表并加载数据

create table if not exists buc1(
sno int,
sname string,
sex string,
sage int,
sdept string
)
clustered by (sno) sorted by (sage desc) into 4 buckets
row format delimited
fields terminated by ','
stored as textfile
;

使用load的方式进行加载数据(load方式加载数据不能体现分桶的结果)

load data local inpath '/data/students.txt' into table buc1;

分桶表数据的加载：

**************************
第一步：要在hive中创建一个临时表，将数据导入临时表中
第二步：通过对临时表查询的方式完成数据导入，分桶的实现就是对分桶的字段做了hash然后存放到对应的文件中，也就是说向分通表中插入数据的时候
必然要执行一次MR，这也就是为什么分桶表的数据基本上只能通过从结果集的查询插入方式导入

**************************

先创建表：

create table if not exists buc2(
sno int,
sname string,
sex string,
sage int,
sdept string
)
clustered by (sno) sorted by (sage desc) into 4 buckets
row format delimited 
fields terminated by ','
stored as textfile
;

加载数据

(要注意reduce数量的设置与分桶数量一致) 
set mapreduce.job.reduces=4;


insert into table buc2
select * from t_stu
cluster by (sno)
;