hive与分区

最新推荐文章于 2022-09-01 16:01:57 发布

Mr.Sheep-Y

最新推荐文章于 2022-09-01 16:01:57 发布

阅读量983

点赞数

分类专栏： Hive 文章标签： hive分区

本文链接：https://blog.youkuaiyun.com/weixin_41736752/article/details/102463008

版权

Hive 专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍了Hive的分区概念，包括内部表与外部表的区别、加载数据的方法、Hive的两种模式以及分区的基本操作。重点讲解了静态分区和动态分区的使用，包括如何创建、加载、查看、修改和删除分区，并提到了在严格模式下进行动态分区的注意事项。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.创建及添加数据

#创建外部表
create external table if not exists tablename(
id bigint,name string,age int)
	#分区
partitioned by (sex string,class string)
	#文件格式
row format delimited fields terminated by ' '
	#上传数据格式(可以不写即为默认)
stored as textfile;


#创建内部表
create table if not.....同上

1.1内部表和外部表的区别？

删除内部表会直接删除元数据（metadata）及存储数据；删除外部表仅仅会删除元数据，HDFS上的文件并不会被删除；
对内部表的修改会将修改直接同步给元数据，而对外部表的表结构和分区进行修改，则需要修复；

简单说：删除外部表，hdfs里的文件还在；删除内部表，hdfs里的文件也被删除了

创建外部表

create external table if not exists tablename()......

创建内部表

create table if not exists tablename()......

1.2加载数据（包含hdfs）

文件添加

#加载系统文件 从本地添加数据
load data local inpath '文件路径/文件名' into table tablename;
#加载hdfs文件
load data inpath 'hdfs文件路径/文件名' into table tablename;

表添加

insert into table tableone partition(sex='male')
select id,name,age from tabletwo where sex='male';

hdfs上传

#新建文件夹
hadoop fs -mkdir /home/hive....
#上传文件
hadoop fs -put 本地文件地址  hdfs上传路径
#例：
hadoop fs -put ./hive/one /home/hive

2.hive的两种模式

严格模式和非严格模式

默认非严格模式

2.1查看模式及更改

#查看
set hive.mapred.mode;
#更改为严格模式
set hive.mapred.mode=strict;
#更改为非严格模式
set hive.mapred.mode=nonstrict

2.2严格模式

hive 严格模式下有三种限制？
1.限制分区查询必须指定where条件
2.限制order by 必须加limit
3.笛卡尔积

注意：分区规则
分区字段=分区值 sex=nan
对分区表查询时，如果hive是在严格模式下,查询数据必须用where过滤分区字段

3.分区

3.1分区的概念

为什么要创建分区：单个表数据量越来越大的时候，在Hive select查询中一般会扫描整个表内容(暴力扫描)，会消耗很多时间做没必要的工作。有时候只需要扫描表中关心的一部分数据，因此建表时引入了partition概念。
(1)、Hive的分区和mysql的分区差异：mysql分区是将表中的字段拿来直接作为分区字段，而hive的分区则是分区字段不在表中。
(2)、怎么分区：根据业务分区，(完全看业务场景)选取id、年、月、日、男女性别、年龄段或者是能平均将数据分到不同文件中最好,分区不好将直接导致查询结果延迟。

(3)、分区细节:
1、一个表可以拥有一个或者多个分区，每个分区以文件夹的形式单独存在表文件夹的目录下。
2、表和列名不区分大小写。
3、分区是以字段的形式在表结构中存在，通过describe table命令可以查看到字段存在(算是一个伪列)，但是该字段不存放实际的数据内容，仅仅是分区的表示。
4、分区有一级、二级、三级和多级分区：
5、创建动态分区、静态分区、混合分区：
动态分区：可以动态加载数据
静态分区：可以静态加入数据
混合分区：动态和静态结合加入数据

3.2分区的基本操作

创建分区

#建表时
partitioned by (sex string,class string......)
#例：
create external table if not exists tablename(
id bigint,name string,age int)
partitioned by (sex string,class string)
row format delimited fields terminated by ' '
stored as textfile;

加载数据

load data local inpath '文件路径/文件名' into table tablename partition(sex='male');

查看分区

show partitions tablename;

查看一个分区描述

describe formatted tablename partition(sex='male');

添加分区

alter table tablename add partition(sex='male');

添加多个分区

alter table tablename add partition(sex='male') partition(sex='female');

修改分区

alter table tablename partition(sex='female') rename to partition(sex='nomale');

删除分区

alter table tablename drop partition(sex='male');

3.3静态分区和动态分区

3.3.1静态方式插入数据

#例：
load  data inpath '/home/one' into table tablename partition(sex='male');

像这样在分区字段指定确定值的情况（sex=‘male’），这种方式交静态的插入。

3.3.2动态方式插入数据

在分区不确定情况下使用动态分区！！！例如年龄

创建一个动态分区表

create table if not exists tablename(
id bigint,name string,sex string)
partitioned by (age int)
row format delimited fields terminated by ' '
stored as textfile;

如果按age=1,age=2…这样非常麻烦，且分区数量不确定

所以使用动态插入

insert into table tableone partition(age)
select id,name,sex,age from tabletwo;

补充：如果报错

 FAILED: SemanticException [Error 10096]: Dynamic partition strict mode 
 requires at least one static partition column. To turn this off 
 set hive.exec.dynamic.partition.mode=nonstrict

原因：严格模式下不允许动态分区

设置hive分区模式为非严格模式

set hive.exec.dynamic.partition.mode=nonstrict;

打开动态分区

set hive.exec.dynamic.partition.mode;

重新进行动态分区

insert into table tableone partition(age)
select id,name,sex,age from tabletwo;

混合分区

insert into table tableone partition(sex='male',age)
select id,name,age from tabletwo where sex='male';