Hive入门教程：基本操作详解-优快云博客

本文链接：https://blog.youkuaiyun.com/w_mchen/article/details/122146792

本文介绍了Hive的基本操作，包括创建数据库、表，加载数据，执行SQL查询以及数据导出等步骤。通过实例展示了Hive如何作为大数据处理的工具，提供便捷的数据存储和分析能力。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

创建数据库/表

create database if not exists 库名；

create table if not exists 表名；

数据库切换

use 库名；

删除/强制删除

drop database 库名；如果库里面有东西删不掉，需要强制删除

drop database 库名 cascade;

drop table 表名；

drop table 表名 cascade；

truncate table 表名; -- 清空表数据

msck repair table table_name;恢复分区

查看库 /表

show databases;

show tables；

show partitions dept_partition;查看分区

select current_database();当前使用的数据库

describe database 库名; -- 可以查看数据库更多的描述信息

desc formatted 表名；查询表类型

desc database 库名；查看数据库详情

desc 表名；查看表结构

插入数据

insert into table 表名 values(1,'zhangsan');

insert overwrite table 表名 values(2,'lisi');

修改数据库的使用者：

alter database myhivebook set owner user dayongd;

alter table 表名 rename to 新表名；

alter table 表名 set fileformat 文件格式；修改文件格式

alter table 表名 change name new_name string;修改列属性

alter table 表名 add columns(age int,gender string); 添加列

alter table 表名 replace columns(id int,username string);删除替换--原有的所有字段被替换成括号内字段

alter table t1 add partition (pt_d = ‘333333’);添加分区

内/外部表转换

alter table 表名 set tblproperties('EXTERNAL'='TRUE');//区分大小写内部表转换为外部表

alter table 表名 set tblproperties('EXTERNAL'='FALSE');外部表转为外部表

内部表（管理表）：

hdfs中所属数据库目录下的子文件。

删除表时，数据也一起删除

创表语句：

create table studentp(

id int,

name string,

likes array<string>,

address map<string,string>,

)

row format delimited fields terminated by ',' //-- 指定列分隔符语法

collection items terminated by '-' //分隔集合和映射

map keys terminated by ':' //分隔集合和映射

lines terminated by '\n'； //分隔行

创建内部表，此内部表会在数据的子目录下，有和表名相同的目录。

添加数据：

load data [local] inpath ‘文件所处的路径’ into [overwriter] table A;

有local：从linux下拷贝数据，

无local：从hdfs上剪切数据到hive的指定位置下。

有overwrite：覆盖原有的数据

无overwrite：追加

外部表：External Tables

数据保存在指定位置的HDFS路径中

Hive不完全管理数据，删除表(元数据)不会删除数据

多用于ods层，

create external table if not exists A(

id string,

name string,

age string)

row format delimited

fields terminated by ','

lines terminated by '\n';

location '/home/hadoop/hive/warehouse/student';

--------------------------------------------------------------

-- 查询表的类型 desc formatted student;

location '数据保存的指定路径'

上传数据到这个目录下：

load data [local] inpath '文件所处的路径' into [overwriter] table A;

或者

hdfs dfs -put 源数据路径 hdfs文件路径（local下的路径）

此时数据可以自动映射为一张表

可以使用select * from A 查看这张映射的表。

临时表：

临时表是应用程序自动管理在复杂查询期间生成的中间数据的方法

表只对当前session有效，session退出后自动删除

表空间位于/tmp/hive-<user_name>(安全考虑)

如果创建的临时表表名已存在，实际用的是临时表

CREATE TEMPORARY TABLE tmp_table_name1 (c1 string);

CREATE TEMPORARY TABLE tmp_table_name2 AS..

CREATE TEMPORARY TABLE tmp_table_name3 LIKE..

————————————————

分区表（重要）：hadoop fs -chmod -R 777 /

分区：就是在表的目录下根据一些特定的条件再创建一些子目录，这些子目录下有我们原始数据的划分号的一部分数据。

查询时我们可以通过分区列和常规列来查询，大大提高查询速度。

分区又分为是动态分区和静态分区

但是不管是静态还是动态，都要先创建分区表（创表方式一样）：最好创建外部表，安全，create external table if not exists A

create table student2(

id int,

name string,

likes array<string>,

address map<string,string>

)

partitioned by (age int)

row format delimited fields terminated by ','

collection items terminated by '-'

map keys terminated by ':'

lines terminated by '\n';

注意：根据创建的表中的某个字段进行分割时，在创建表时不可以在添加这个字段，如上表中根据sex分区，创建的表的字段不能再有sex

表创建完成后，我们要导入数据，然后映射出一张表，供我们后期使用：

此时静态分区和动态分区才会有很大的不同，或者说是依据导数据的方式来判断是动态分区还是静态分区：

静态分区：是自己定义数据多用于增量表（不断增加表的内容）比如新闻表，每天都会变化增加需要事先知道有多少分区，每个分区都要手动插入。

如果你的原始数据文件已经根据分区关键字，如sex 分好了，即男是一个文件女是一个文件

导入分区数据指定分区

load data local inpath '/opt/tmp/student.txt' into table student2 partition(age=10);

load data local inpath '/opt/tmp/student.txt' into table student2 partition(age=20);

load data [local] inpath '数据存放的路径' into [overwriter] table A。

如果是一个完整的大文件

此时可以使用如下方法导入：//注意查询的字段要与创建分区表时的字段一致//产生一级子目录，需手动创建每一个分区，当分区交多时，麻烦

insert into table employee_hrpar partition(sex='male') select id,username,bir from ods_users where sex='male';

insert into table employee_hrpar partition(sex='female') select id,username,bir from ods_users where sex='female';//产生二级子目录

insert into table empar partition (year=2015,month=11) select name,id,number from employee_hr where year(start_date)=2015 and month(start_date)=11;

加载数据到二级分区表中

load data local inpath '/opt/datas/dept.txt' into table dept_partition2 partition(month='201909', day='13');

添加分区

alter table dept_partition add partition(month='201906') ;

alter table dept_partition add partition(month='201905') partition(month='201904');

删除分区

alter table dept_partition drop partition (month='201904');

alter table dept_partition drop partition (month='201905'), partition (month='201906');

动态分区：事先先用group 或者distinct 看一下字段值的种类,多用于每年每月每日等多用于全量导入数据量不能太大

首先设置这两个配置：（短暂性设置，关闭后需重新设置）

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

创建分区表都是上面的创建方式

添加数据的方式如下：

再次之前可以使用groupby 或者 distinct 了解下数据信息，方便下面的分区.

//不需要手动一个一个分区，可以根据关键字自动分区。//此时select中的字段要和创建表时的字段加上partition的关键字的一致。根据一个字段分区

insert into table myuser partition(sex) select * from ods_users;//根据两个字段分区

insert into table empar partition(year,month) select name,id,number,year(start_date),month(start_date) from employee_hr;

Hive的数据分桶

1.分桶对应于HDFS中的文件更高的查询处理效率

2.使抽样（sampling）更高效

3.一般根据"桶列"的哈希函数将数据进行分桶

1.什么叫分桶：类似分区，根据表中的某一字段进行哈希后，分到不同的桶里。

2.为何要分桶：

1.分区满足不了用户分隔开数据的意愿，分区有数量限制， Hive会阻止过多小分区。

2.因此有了分桶，可将数据分到固定数目的桶中。没有数据波动。

只有动态分桶，开启

set hive.enforce.bucketing = true;

clustered by (employee_id) into 2 buckets；定义分桶

必须使用INSERT方式加载数据

随机抽样基于整行数据

select * from table_name tablesample(bucket 3 out of 32 on rand()) s;

随机抽样基于指定列（使用分桶列更高效）

select * from table_name tablesample(bucket 3 out of 32 on id) s;

hive的基本操作