数据仓库Hive入门( 二 )

最新推荐文章于 2024-06-17 10:37:13 发布

原创最新推荐文章于 2024-06-17 10:37:13 发布 · 369 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#hive #big data #hadoop

Hadoop 专栏收录该内容

18 篇文章

订阅专栏

本文深入介绍了Hive数据仓库的基本概念，包括数据类型、DDL操作、内部表与外部表的区别，以及如何创建、管理和查询数据表。涵盖了分隔符设置、表结构定义、分区和分桶表的创建，以及数据加载和导出的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

数据仓库Hive入门( 二 )

hive是将结构化文件映射为一张数据表, 通过sql操作数据的

hive数据类型除了支持sql的类型外还支持java类型且大小写不敏感

DDL

结构化文件存放的位置
内部表
结构化文件存储在hive指定的hdfs文件夹 /user/hive/warehouse
sql执行删除表操作时, 对应的hdfs文件夹也被删除, 数据清空

外部表
sql创建表时指定文件所在位置 create external localtion
sql执行删除表操作时, 对应的hdfs文件夹不会被删除, 数据还在

结构化文件通过分隔符辨别数据字段
默认分隔符
如果不指定,hive默认分隔符为’\001’ linux vi 编辑模式按 ctrl+V ctrl+A

指定分隔符
row format 表明自己指定分隔符
delimited 表明使用hive模拟热的分隔符类进行数据分割 (默认LazySimpleSerDe )
fields terminated by ‘,’ 字段通过’ , ’ 切割

创建表

#-- 语法结构  []为可选项
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]

测试数据 user.txt

1,zhangsan,beijing
2,lisi,beijing
3,wangwu,beijing
4,kobe,shanghai
5,james,shanghai
6,living,shanghai
7,jjl,guangzhou
8,eason,guangzhou
9,edison,guangzhou

创建内部表默认分隔符为 ‘\001’

结构化文件需放在指定目录默认为hdfs: /user/hive/warehouse/db_name/tb_name/结构化文件

create table tb_user(id int,name string,city string);

创建外部表
- 外部表的位置为hdfs中的位置 hdfs:/user.txt
- hdfs: /user/hive/warehouse/db_name/ 中不会创建对应表的文件夹

create external table tb_user(id int,name string,city string) 
location '/user.txt'

创建内部表,指定分隔符为 ’ , ’

create table tb_user(id int,name string,city string) row format delimited
fields terminated by ','

创建内部表, 并且字段为复杂字段的, 主要是指定分隔符

#-- 数据 集合字段 
#-- zhangsan	beijing,shanghai,tianjin,hangzhou
#-- wangwu	shanghai,chengdu,wuhan,haerbin
create table tb_user(name string,citys array<string>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';


#-- 数据 map字段
#-- 1,zhangsan,唱歌:非常喜欢-跳舞:喜欢-游泳:一般般  
#-- 2,lisi,打游戏:非常喜欢-篮球:不喜欢
create table tb_user(id int,name string,hobby map<string,string>)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';

创建分区表

将上面的user.txt分为三个文件, 这里根据城市划分

 - 分区表是一个查询优化手段, 减少查询时的全局扫描
 - 分区的字段(province) 不能和表中的字段重复(id,name,city)
 - 分区的字段(province)是虚拟字段, 并不在真正的表数据中,结构化文件中, 它的数据来自partition指定
 - 分区的字段(province)可以用在sql中, 优化查询
 - local 指的是hiveserver所在服务器本地 
 - 去掉local的话 path为hdfs的位置,直接将该文件剪切到对应的目录

#-- 建表
create table tb_user(id int,name string,city string)
partitioned by(province string)

#-- 加载文件到数据表  在hive命令行 
load data local inpath '/root/data/beijing.txt' into table tb_user partition(province='beijing')
load data local inpath '/root/data/beijing.txt' into table tb_user partition(province='shanghai')
load data local inpath '/root/data/beijing.txt' into table tb_user partition(province='guangzhou')

#-- 全表查询
select * from tb_user where city = 'beijing'
#-- 分区查询
select * from tb_user where province = 'beijing'

创建分桶(分片分簇)表
- 分桶是指将表中数据按照指定字段分成指定的N个桶
- 分桶的字段必须是表中已有的字段
- 分桶需手动打开并设置桶的个数
- 分桶的好处优化join查询减少笛卡尔积
- 分桶规则默认为hash
  - 分桶字段为数值型 hash(xxx)=xxx xxx % N 取模取到几去哪个桶
  - 分桶字段为字符串类型 hash(xxx)=xxx.hashcode xxx.hashcode % N

#-- 开启分桶功能 默认是关闭的
set hive.enforce.bucketing = true;
#-- 指定分为几桶
set mapreduce.job.reduces=4;

#-- 创建分桶表
create table tb_user_buck(id int,name string,city string)
clustered by(city) 
into 4 buckets
row format delimited
fields terminated by ',';

#-- 创建一张临时表
create table tb_user(id int,name string,city string)
row format delimited
fields terminated by ',';

#-- 给临时表加入数据
hadoop fs -put user.txt /user/hive/warehouse/test.db/user

#-- 查询临时表 分桶插入最终分桶表 
insert overwrite table user_buck
select * from student cluster by(city)

LIKE 允许用户复制现有的表结构，但是不复制数据
- CREATE [EXTERNAL] TABLE [IF NOT EXISTS][db_name.] table_name LIKE existing_table;
STORED AS SEQUENCEFILE|TEXTFILE|RCFILE
如果文件数据是纯文本，可以使用 STORED AS TEXTFILE ( 默认 )
如果数据需要压缩, 使用 STORED AS SEQUENCEFILE(序列化)

修改表

增加分区

ALTER TABLE table_name ADD

PARTITION (dt=‘20170101’) location ‘/user/hadoop/warehouse/table_name/dt=20170101’;

ALTER TABLE table_name ADD

PARTITION (dt=‘2008-08-08’, country=‘us’) location ‘/path/to/us/part080808’

PARTITION (dt=‘2008-08-09’, country=‘us’) location ‘/path/to/us/part080809’;

删除分区

ALTER TABLE table_name DROP IF EXISTS PARTITION (dt=‘2008-08-08’);
ALTER TABLE table_name DROP IF EXISTS PARTITION (dt=‘2008-08-08’, country=‘us’);

修改分区

ALTER TABLE table_name PARTITION (dt=‘2008-08-08’) RENAME TO PARTITION (dt=‘20080808’);

添加列 add代表新增一个字段, 放在所有列后(分区列前) replace 表示替换表中所有字段

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name STRING);

修改列 test_change (a int, b int, c int);

ALTER TABLE test_change CHANGE a a1 INT; //修改 a 字段名 —> (a1 int,b int,c int)

table’s structure is: b int, a1 string, c int

ALTER TABLE test_change CHANGE a a1 STRING AFTER b; // (b int,a1 string,c int)

ALTER TABLE test_change CHANGE b b1 INT FIRST; // (b1 int,a int,c int)

ALTER TABLE table_name RENAME TO new_table_name

显示命令

显示所有库 show databases |schemas;
显示当前库的表 show tables;
显示表分区信息 show partitions table_name;
显示当前版本 hive 支持的所有方法 show functions;
查看表信息 desc formatted table_name;
查看数据库相关信息 describe database database_name;

DML

load
- local
  - local指的是hiveserver服务端所在服务器的文件系统
  - load local 上传本地文件时将文件复制到hdfs hive的对应路径
- filepath
  - 引用类型
    - 文件
      - 如果文件在本地则复制到对应的hive文件夹
      - 如果文件已经在hdfs中 , 则移动到对应的hive文件夹
    - 文件夹
      - 将该目录下的所有文件复制/移动到对应的hive文件夹中
- overwrite
  - 目标表中的内容被删除, 然后filepath中的文件添加到目标表中

#-- 在将数据加载到表中时，Hive 不会进行任何转换。加载操作是将数据文件移动到与 Hive
#-- 表对应的位置的纯复制/移动操作
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

insert

insert主要和select结合使用 , 将查询结果插入到表中
- 确保查询结果列与需要插入数据的表列数目一致
- 如果对应的列数据类型不一致,会自动转换,转换失败的数据为NULL

insert overwrite table stu_buck
select * from student;

多重插入
- 查询一次源表, 插入多个目标表

#-- 语法格式
from source_table
insert overwrite table tablename1 [partition (partcol1=val1,partclo2=val2)]
select_statement1
insert overwrite table tablename2 [partition (partcol1=val1,partclo2=val2)]
select_statement2

#-- 测试表
create table source_table (id int, name string) row format delimited fields terminated by ',';
create table test_insert1 (id int) row format delimited fields terminated by ',';
create table test_insert2 (name string) row format delimited fields terminated by ',';
#-- 插入sql
from source_table                     
insert overwrite table test_insert1 
select id
insert overwrite table test_insert2
select name;

#-- 普通插入：  分别扫描两次 
insert into table test_insert1 select id from source_table;
insert into table test_insert2 select name from source_table;

动态分区插入
- hive当前支持两个分区

在这里插入图片描述

#-- 需求：
#-- 将dynamic_partition_table中的数据按照时间(day)，插入到目标表d_p_t的相应分区中。

set hive.exec.dynamic.partition=true;    #--是否开启动态分区功能，默认false关闭
set hive.exec.dynamic.partition.mode=nonstrict;   #-- 动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。

#-- 原始表：
create table dynamic_partition_table(day string,ip string)row format delimited fields terminated by ","; 

load data local inpath '/root/hivedata/dynamic_partition_table.txt' into table dynamic_partition_table;
2015-05-10,ip1
2015-05-10,ip2
2015-06-14,ip3
2015-06-14,ip4
2015-06-15,ip1
2015-06-15,ip2
 
#-- #-- 目标表：
create table d_p_t(ip string) partitioned by (month string,day string);

#-- 动态插入：
insert overwrite table d_p_t partition (month,day) 
select ip,substr(day,1,7) as month,day from dynamic_partition_table;

导出表数据

insert overwrite local directory '/root/123456'
select * from t_p;

insert overwrite directory '/aaa/test'
select * from t_p;

select
- cluster by 分桶查询
  - 根据该字段将数据分到不同reducer查询 , 并且在每个桶中根据该字段进行排序(局部排序)
  - 如果 distribute 和 sort 的字段是同一个时，此时，cluster by = distribute by + sort by
- distribute by ( 分 )+ sort by (排序)
  - 根据该字段将数据分到不同reducer查询
  - 需要设置reducetask数量大于1
  - 只保证每个 reducer 的输出有序
- order by
  - 全局排序, 只会有一个reducer
```
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
JOIN table_other ON expr
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
[LIMIT number]
```

join

hive 支持内关联, 左关联, 右关联, 全关联 , 还支持left semi join 和 cross join
hive 支持等值连接（a.id = b.id ）不支持非等值连接 ( a.id>b.id)
join时, 应该将最大的表写在最后

reducer 会缓存 join 序列中除了最后一个表的所有表的记录，再通过最后一个表将
结果序列化到文件系统。这一实现有助于在 reduce 端减少内存的使用量。
Join 发生在 WHERE 子句之前
Join 是不能交换位置的

准备数据
1,a
2,b
3,c
4,d
7,y
8,u

2,bb
3,cc
7,yy
9,pp



建表：
create table a(id int,name string)
row format delimited fields terminated by ',';

create table b(id int,name string)
row format delimited fields terminated by ',';

导入数据：
load data local inpath '/root/hivedata/a.txt' into table a;
load data local inpath '/root/hivedata/b.txt' into table b;


实验：
** inner join
select * from a inner join b on a.id=b.id;

select a.id,a.name from a join b on a.id = b.id;

select a.* from a join b on a.id = b.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 7     | y       | 7     | yy      |
+-------+---------+-------+---------+--+





**left join   
select * from a left join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 1     | a       | NULL  | NULL    |
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 4     | d       | NULL  | NULL    |
| 7     | y       | 7     | yy      |
| 8     | u       | NULL  | NULL    |
+-------+---------+-------+---------+--+





**right join
select * from a right join b on a.id=b.id;

select * from b right join a on b.id=a.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 7     | y       | 7     | yy      |
| NULL  | NULL    | 9     | pp      |
+-------+---------+-------+---------+--+




**
select * from a full outer join b on a.id=b.id;
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 1     | a       | NULL  | NULL    |
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 4     | d       | NULL  | NULL    |
| 7     | y       | 7     | yy      |
| 8     | u       | NULL  | NULL    |
| NULL  | NULL    | 9     | pp      |
+-------+---------+-------+---------+--+


**hive中的特别join
select * from a left semi join b on a.id = b.id;


select a.* from a inner join b on a.id=b.id;


+-------+---------+--+
| a.id  | a.name  |
+-------+---------+--+
| 2     | b       |
| 3     | c       |
| 7     | y       |
+-------+---------+--+
相当于
select a.id,a.name from a where a.id in (select b.id from b); 在hive中效率极低

select a.id,a.name from a join b on (a.id = b.id);

select * from a inner join b on a.id=b.id;


cross join（##慎用）
返回两个表的笛卡尔积结果，不需要指定关联键。
select a.*,b.* from a cross join b;

hive参数配置

三种设定方式

配置文件全局有效
命令行参数对hive启动实例有效
参数声明对hive的连接session有效

范围一次变小, 优先级一次变大

命令行参数说明：
1、 -i 初始化 HQL 文件。
2、 -e 从命令行执行指定的 HQL
3、 -f 执行 HQL 脚本
4、 -v 输出执行的 HQL 语句到控制台
5、 -p <port> connect to Hive Server on port number
6、 -hiveconf x=y Use this to set hive/hadoop configuration variables.
例如：
$HIVE_HOME/bin/hive -e 'select * from tab1 a'
$HIVE_HOME/bin/hive -f /home/my/hive-script.sql
$HIVE_HOME/bin/hive -f hdfs://<namenode>:<port>/hive-script.sql
$HIVE_HOME/bin/hive -i /home/my/hive-init.sql
$HIVE_HOME/bin/hive -e 'select a.col from tab1 a'
--hiveconf hive.exec.compress.output=true
--hiveconf mapred.reduce.tasks=32

hive本地模式

hive提供了智能本地模式，根据一些条件判断是否自动切换

SET hive.exec.mode.local.auto=true;
满足下述三个条件智能切换成为本地模式  否则还是集群模式

The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)

The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)

The total number of reduce tasks required is 1 or 0.

hive自定义函数

当内置函数不满足时, 使用自定义函数 UDF 自定义函数本质是临时函数跟连接的会话有关

hive-exec-1.2.1.jar 和 hadoop-common-2.7.4.jar
编写java类继承UDF 重载evaluate方法
打包上传到服务器
添加jar到classpath add JAR /xxx.jar;
创建临时函数与java class关联 create temporary function 自定义函数名 as ‘类全路径’;
使用 : Select tolowercase(name),age from t_test;

hive特殊分隔符

hive默认只支持单字节的分隔符, 如果分隔符是多字符的, 可以使用正则抽取字段

input.regex：输入的正则表达式表示 || 左右两边任意字符被抽取为一个字段
output.format.string：输出的正则表达式 %1 $s$ s 则分别表示表中的第一个字段、第二个字段
使用 RegexSerDe 类时，所有的字段必须为 string
01||zhangsan
02||lisi

create table t_bi_reg(id string,name string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties(
'input.regex'='(.*)\\|\\|(.*)',
'output.format.string'='%1$s %2$s'
)