大数据基础之HIVE(二)——HIVE分区、分桶以及视图等，初学必看

最新推荐文章于 2023-12-21 14:23:50 发布

Clozzz

最新推荐文章于 2023-12-21 14:23:50 发布

阅读量499

点赞数 1

分类专栏： HIVE 文章标签： hive 大数据

本文链接：https://blog.youkuaiyun.com/Clozzz/article/details/106675653

版权

HIVE 专栏收录该内容

10 篇文章

订阅专栏

HIVE分区（partitions）

分区主要用于提高性能
分区列的值将表划分为很多segments（文件夹）
查询时使用分区列和常规列类似
查询时HIVE自动过滤不用与提高性能的分区
分区主要分为静态分区和动态分区

HIVE分区操作

静态分区：

create table mypart(
	userid int,
	username string,
	gender string,
	score int
	
)
partitioned by (year int,month int)
//如何分割列（字段）
row format delimited fields terminated by ','  
//如何分割集合和映射
collection items terminated by ',' 
map keys terminated by ',';


//静态分区操作(添加分区)
alter table mypart add partition (year=2019,month=3) partition (year=2019,month=4)
//静态分区操作(删除分区)
alter table mypart drop partition(year=2019,month=4);

//向分区导入数据
insert into     //追加
insert into table mypart partition(gender='male')
select * from userinfos where gender='male';
insert overwrite     //覆盖
insert overwrite table mypart partition(gender='female')
select * from userinfos where gender='female';
//在静态分区中，如果插入的数据和分区字段不一致，会强行把数据的分区字段变为和分区一致，简单来说，就是如果将female插入到male分区里，该条数据的female会被强行改变成male。

动态分区：
使用动态分区时需设定属性以开启动态分区

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

动态分区设置方法(表数据如上)

create table mypart(
	userid int,
	username string,
	gender string,
	score int
	
)
partitioned by (gender)
//如何分割列（字段）
row format delimited fields terminated by ','  ;
//向表中插入数据，实现动态分区
insert into table mypart partition(gender)
select userid,username,gender,score from userinfos;
//动态分区不需要单独设置分区名称，只根据表字段分区，分区数量根据该表字段的不同值来决定。

如果有些电脑无法执行上述语句，并报错，原因大概是超出动态分区个数上限，则需手动设置动态分区上限。

set hive.exec.max.created.files=600000;
set hive.exec.max.dynamic.partitions.pernode=600000;
set hive.exec.max.dynamic.partitions=600000;

批量数据导入HVIE

load data local inpath '/opt/mydata.csv' overwrite into table mydemo.customs;
//如果没有local则表示从HDFS上拉取文件，如果有local则表示从linux系统本地拉取文件。

HIVE数据分桶（Buckets）：

分桶对应与HDFS中的文件
更高的查询处理效率
使抽样（sampling）更高效
根据“桶列”的哈希函数将数据进行分桶
分桶只有动态分桶

set hive.enforce.bucketing=true

定义分桶

clustered by (userid) into 2 buckets
//分桶的列是表中已存在的字段，分桶数最好是2的n次方

注意：分桶操作必须用insert方式加载数据！！！

分桶抽样
随机抽样基于整行数据

select * from table_name tablesample(bucket 3 out of 32 on rand()) s;

随机抽样基于指定列

select* from table_name tablesample(bucket 3 out of 32 on userid) s;

随机抽样基于block size

select *from table_name tablesample(10 percent) s;
select *from table_name tablesample(1M) s;
select *from table_name tablesample(10 rows) s;

HIVE视图（views）

视图概述
通过隐藏子查询、连接和函数来简化查询的逻辑结构。
虚拟表，从真实表中选取数据。
只保存定义，不存储数据。
如果删除或更改基础表，则查询视图将失败。
视图是只读的，不能插入或装载数据。
应用场景
将特定的列提供给用户，保护数据隐私。
查询语句复杂的场景

HIVE视图操作

//创建视图
create view view_name as select * from userinfos;
//查找视图
show tables;     //在hive 2.2.0之后用show views
//删除视图
drop view_name;
//更改视图属性
alter view view_name set talproperties('comment' = 'This is a view')
//更改视图定义
alter view view_name as select * from userinfos;

HIVE侧视图（lateral view）

常与表生成函数结合使用，将函数的输入和输出连接
outer关键字：即时output为空也会生成结果

select name,work_place,loc from employee lateral view outer explode(split(null,',')) a as loc;

支持多层级：

select name,wps,skill,score from employee 
lateral view explode(work_place) work_place_single as wps
lateral view explode(skills_score) sks as skill,score;

通常用于规范化行或解析JSON

侧视图案例
表数据：
在这里插入图片描述

create external table mytest(name string,likes string)
row format delimited fields terminated by ','
location '/mydemo/0610';

按照“|”展开数据：

select explode(split(likes,'\\|')) from mytest;

结果展示：
在这里插入图片描述
构建侧视图：

select name,myview form mytest laternal view outer explode(split(likes,'\\|')) a as myview;

结果展示：
在这里插入图片描述