hive建库建表与数据导入导出

最新推荐文章于 2024-12-12 22:12:57 发布

原创最新推荐文章于 2024-12-12 22:12:57 发布 · 6.3k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#hive #建表

hive 专栏收录该内容

2 篇文章

订阅专栏

本文介绍了Hive中创建内外部表的区别、建表语句示例、索引的使用方法、事务支持及数据导入导出的方式。同时，还详细说明了如何更新与删除Hive表中的数据。

hive建表：
hive分内部表与外部表，创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。这样外部表相对来说更加安全些，数据组织也更加灵活，方便共享源数据。

创建数据库：

<span style="font-size:18px;">hive> CREATE DATABASE [IF NOT EXISTS] mydb;
hive> CREATE SCHEMA mydb
hive> use mydb</span>

创建表：

<pre name="code" class="plain"><span style="font-size:18px;">hive> create EXTERNAL table IF NOT EXISTS tb(...//外部表
hive> CREATE TABLE table (...//内部表</span>

建表语句示例：

<span style="font-size:18px;">hive> create table customers (id string, name string, email string, street_address string, company string) 
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
> with SERDEPROPERTIES ('escape.delim'='\\', 'field.delim'=',', 'serialization.format'=',');</span>

<span style="font-size:18px;">hive> set hive.enforce.bucketing = true
hive> create table customers (id string, name string, email string, street_address string, company string)
> partitioned by (time string)
> clustered by (id) into 5 buckets stored as orc
> location '/user/bedrock/salescust'
> TBLPROPERTIES ('transactional'='true');</span>

Hive可以通过在指定列上建立索引来提高查询速度，可以建立压缩索引与位图索引二种类型的索引，
索引数据保存在另外的表中,可以自定义索引表名也可以取默认值,索引表的基本包含几列：
1. 源表的索引列；
2. _bucketname hdfs中文件地址
3. 索引列在hdfs文件中的偏移量。
原理是通过记录索引列在HDFS中的偏移量，精准获取数据，避免全表扫描

<span style="font-size:18px;">hive> create index customer_index on table customers(id)
> as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild
> in table customer_index_table

hive> create index customer_index on table customers(id)
> as 'org.apache.hadoop.hive.ql.index.bitmap.BitmapIndexHandler' with deferred rebuild stored as rcfile 

hive> alter index customer_index on customers rebuild;//填充索引数据
hive> alter index customer_index on customers partition(columnx='', columny='') rebuild;//在分区上重建索引
hive> show formatted index on customers;
hive> desc customer_index;//查看索引表结构
hive> select * from customer_index_table limit 10; 
hive> drop index customer_index_table on customers;</span>

<span style="font-size:18px;">为了确保Hive能够有效处理事务数据，以下设置要求在Hive配置中进行：</span>

<span style="font-size:18px;">hive> hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</span>

Hive表更新与删除数据(表必须开启事务属性才支持update与delete)：

<span style="font-size:18px;">hive> delete from customer_index where id=1;
hive> truncate table customer_index;</span>

Hive数据导入方式：

a.从本地文件系统导入

<span style="font-size:18px;">hive> load data local inpath '/usr/local/hive/customers.info' (overwrite) into table customers;</span>

SEQUENCEFILE，RCFILE，ORCFILE格式的表不能直接从本地文件导入数据，数据要先导入到textfile格式的表中，然后再从表中用insert导入SequenceFile,RCFile,ORCFile表中

b.从HDFS导入

<span style="font-size:18px;">hive> load data inpath '/hive/customers.info' (overwrite) into table customers;</span>

c.从其它表中查询数据导入

<span style="font-size:18px;">hive> insert into table customers select * from customers_tmp;</span>

d.在创建表时导入

<span style="font-size:18px;">hive> create table customers as select * from customers_tmp;</span>

复制表但不导入数据：

<span style="font-size:18px;">hive> CREATE TABLE customers LIKE customers_tmp;</span>

数据导出：
导出到本地文件系统：

<span style="font-size:18px;">hive> insert overwrite local directory '/usr/local/hive/customers.info' select * from customers;</span>

导出到hdfs:

<span style="font-size:18px;">hive> insert overwrite directory '/hive/customers.info' select * from customers;</span>