Programming Hive ( Hive编程指南) 三

最新推荐文章于 2019-06-01 13:27:19 发布

原创最新推荐文章于 2019-06-01 13:27:19 发布 · 280 阅读

0 ·

CC 4.0 BY-SA版权

Hive 专栏收录该内容

16 篇文章

订阅专栏

本文围绕HiveQL展开，介绍了数据操作相关内容，包括向管理表装载数据，可选择覆盖或新增；通过查询语句插数据，涉及动态分区插入及相关属性；还提及单个查询语句创建表并加载数据、数据导出方法，最后提到了HiveQL的查询部分。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Chapter5 HiveQL：数据操作

5.1 向管理表中装载数据

使用第四章的employees表：

[root@master chapter5]# cat 4.create_employees.sql 
create table employees (
        name    string,
        salary  float,
        subordinates    Array<string>,
        dedcutions      map<string,float>,
        address struct<street:string,city:string,state:string,zip:int>
)
partitioned by (country string,state string);

加载数据：（非分区表就不要partition字段） inpath路径下不可以包含任何文件夹

[root@master chapter5]# cat 5.1california-employees.sql 
load data local inpath '/usr/local/src/test3/hive/Programming_Hive/chapter5/'
overwrite into table employees
partition (country='US',state='CA');

使用overwrite，目标文件夹中之前存在的数据会被先删掉再上传文件，如果没有使用，仅会把新增的文件上传到目标文件夹下，如果重名，旧文件会被重写。

HDFS目录下的数据：（如果分区目录不存在，执行程序时会自动创建，然后再拷贝数据到该目录）

[root@master chapter5]# hadoop fs -ls /user/hive/warehouse/employees/country=US/state=CA | ll
总用量 16
-rw-r--r--. 1 root root 226 5月  29 13:50 4.create_employees.sql
-rw-r--r--. 1 root root   5 5月  29 13:39 5.1california-employees
-rw-r--r--. 1 root root 147 5月  29 13:43 5.1california-employees.sql
-rw-r--r--. 1 root root 153 5月  29 13:40 employee2.txt

5.2 通过查询语句向表中插数据（两个表的column要相同）将源表中的数据插入目标表

hive> insert overwrite table employees
    > partition (country = 'US',state='OR')
    > select * from staged_mployees se
    > where se.cnty = 'US' and se.st='OR';

如果源表staged_mployees非常大，用户需要对65个州都执行这些语句，需要扫描staged_mployees表65次，解决方法：

只扫描一次数据，然后按照多种方式进行划分，下例显示了如何为3个洲创建表employees分区

hive> from staged_mployees
    > insert overwrite table employees
    > partition (country = 'US',state='OR')
    > select * where se.cnty = 'US' and se.st = 'OR'
    > insert overwrite table employees
    > partition (country = 'US',state='CA')
    > select * where se.cnty = 'US' and se.st = 'CA'
    > insert overwrite table employees
    > partition (country = 'US',state='IL')
    > select * where se.cnty = 'US' and se.st = 'IL'

动态分区插入：

上述语法中还有一个问题，如果要插入多个分区，需要写很多的SQL，动态分区插入功能解决了这个问题：

源表字段值和输出分区值之间的关系是根据位置而不是命名

insert overwrite table employees
partition (country,state)
slect ...,se.cnty,se.st
from staged_employees se;

--不设置具体的值，只需检查出源表的cnty和st的值，源表有100个国家和洲，目标表就会有100个分区

动态和静态结合的方式

静态分区必须出现在动态分区键之前。

--country为静态，state为静态
insert overwrite table employees
partition (country = 'US',state)
select ...,se.cnty,se.st
from staged_mployees se
where se.cnty = 'US';

--将源表中国家为US的数据，以及所有的洲的分区插到目标表employees中

hive动态分区属性

hivePartitionParams

动态分区默认是关闭的，开启之后，默认是严格strict模式，要求至少有一列分区字段是静态的，有助于阻止因设计错误导致每秒都对应一个分区；（错误的使用时间戳（秒）作为分区字段，但实际是想用天或者小时）

hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> set hive.exec.max.dynamic.partitions.pernode=1000;


insert overwrite table employees
partition (country,state)
slect ...,se.cnty,se.st
from staged_employees se;

5.3 单个查询语句中创建表并加载数据

create table ca_employees
as select name,salary,address
from employees se
where se.state = 'CA';

5.4数据导出如果文件格式恰好是用户所需的，只需简单拷贝文件夹或者文件就可以了

hadoop fs -cp source_path  target_path

或者使用insert语句(Hive将所有字段序列化成字符串写入到文件中)

insert overwrite local directory '/usr/local/src/test3/hive/ca_eployees'
select name,salary,address
from employees se
where se.state = 'CA';

查看本地导出的数据：

hive> ! cat /usr/local/src/test3/hive/ca_eployees/000000_0;

chapter 6 HiveQL：查询