hive 分区表分组表分桶表

最新推荐文章于 2025-05-30 11:17:41 发布

请大佬带带我

最新推荐文章于 2025-05-30 11:17:41 发布

阅读量430

点赞数

文章标签： hive hadoop

本文链接：https://blog.youkuaiyun.com/weixin_45967421/article/details/109114557

版权

本文介绍了Hive中创建表的各种方式，包括内部表、外部表、分区表和分桶表的创建及特点。强调了分区表在避免全表扫描、提高查询效率方面的作用，以及分桶表对于提升JOIN查询效率的意义。同时，讲解了如何添加、删除分区，以及如何插入数据到分区表和分桶表中。最后，提到了数据加载到Hive的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

第一种常用新建原始表：
create [EXTERNAL] table vv_stat_fact
(
userid string,
stat_date string,
tryvv int,
sucvv int,
ptime float
)
PARTITIONED BY ( 非必选；创建分区表
dt string)
clustered by (userid) into 3000 buckets // 非必选；分桶子
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ // 必选;指定列之间的分隔符
STORED AS rcfile // 非必选；指定文件的读取格式，默认textfile格式
location ‘/testdata/’; //非必选；指定文件在hdfs上的存储路径，如果已经有文件，会自动加载，默认在hive的warehouse下

建表1，全部使用默认配置。
CREATE TABLE emp2(
id string,
name string,
job string,
mgr string,
hiredate date,
sal double,
comm double,
deptid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
如果不指定存储路径，默认在hdfs上/user/hive/warehouse/shujia/emp2`
hive是读时模式：当查询数据的时候才会校验数据格式，加载的时候忽略。
需要在建表的时候指定分隔符。

CREATE TABLE emp(
id string,
name string,
job string,
mgr string,
hiredate date,
sal double,
comm double,
deptid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ //列之间的分隔符。
LOCATION ‘/hivedata/’;//指定表对应的hdfs存储路径，这个地方只能是

建表2
指定存储文件格式
CREATE TABLE emp_rc(
id string,
name string,
job string,
mgr string,
hiredate date,
sal double,
comm double,
deptid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS rcfile

 :rcfile存储格式的表，不能直接加载数据。
 只能通过其他表加载数据。

建表3：
从其他表中加载数：
create table emp_r as 查询语句
create table emp_r as select job,avg(sal) as s from emp_rc group by job order by s desc;

外部表;
CREATE EXTERNAL TABLE emp_ex(
id string,
name string,
job string,
mgr string,
hiredate date,
sal double,
comm double,
deptid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
外部表和内部表的区别：
当删除外部表的时候，表会被删除，但是hdfs上的数据不会被删除。
普通表会把表和数据都删除。
两者都会将mysql中的元信息都删除。

创建表：
复制表结构,但是不加载数据：like
create table emp_l like emp2;

分区表：必须在建表的时候指定是分区表，如果建表不是分区表，不可以更改。
避免全表扫描，提高查询效率。可以在where后面指定分区：分区裁剪
场景：事实表用分区表。按天或者按地区进行分区。一般不要超过三级。
多级分区，数据必须存放到最后一级目录中。

CREATE TABLE emp_p(
id string,
name string,
job string,
mgr string,
hiredate date,
sal double,
comm double,
deptid string)partitioned by (dt string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
增加、删除分区：
alter table test_table add partition (pt=xxxx)
alter table test_table drop if exists partition(…);

在向分区表中插入数据的时候：需要指定具体的分区名称。
如果没有指定会报错，分区如果不存在，会自动创建。如果已经存在，直接使用。
insert into emp_b_p partition(dt=“2020-10-16”) select * from emp_0;

分桶表：
对文件进一步划分。
作用：提高查询效率，尤其是join查询效率。
一般分桶，分多少？一般可以跟分桶字段预估分区数来判断。
CREATE TABLE emp_b_p(
id string,
name string,
job string,
mgr string,
hiredate date,
sal double,
comm double,
deptid string)partitioned by (dt string)
clustered by (job) into 10 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
执行set hive.enforce.bucketing = true;启动
insert into emp_b_p partition(dt=“2020-10-16”) select * from emp_0;

select * from emp_b_p tablesample(bucket 1 out of 2 on id)

id代表以什么字段分桶。
10是除数，2是被除数，10\2=3，则一共取5个桶的数据。
1代表从哪个桶开始取，也就是取1，2,3,4,5这三个桶中的数据。

hive加载数据：
load data [local] inpath ‘对应的绝对路径’ into table 表名 [partition(分区名称)]

例子：将本地文件加载到hive中
load data local inpath ‘/usr/local/soft/data/empldata.csv’ into table emp_p partition(dt=“2020-10-18”);
例子:将hdfs路径下的数据加载到hive中
注意：从hdfs中加载数据的时候，是将数据移动到hive对应的表目录。是移动。