Hive 批量数据迁移

最新推荐文章于 2024-07-27 17:12:36 发布

原创最新推荐文章于 2024-07-27 17:12:36 发布 · 6.3k 阅读

3 ·

CC 4.0 BY-SA版权

hive 专栏收录该内容

1 篇文章

订阅专栏

测试环境
HDP 2.6.2 到 HDP 2.5.0
hdfs 2.7.3 到 hdfs 2.7.1
两个集群都没有启用kerberos以及ranger权限
1. 通过hive export/import 迁移数据
1.1 导出hive表数据
beeline -u "jdbc:hive2://dc2.xx.com:2181,dc3.xx.com:2181,dc4.xx.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -n "hive"
export table customer to '/tmp/export/customer' ;
1.2 采用distcp把文件夹导出到目标系统
hadoop distcp hdfs://dc1.xx.com:8020/tmp/export/ hdfs://dc2.xx.com:8020/tmp/export
1.3 在目标系统中把表导入
beeline -u "jdbc:hive2://dc2.xx.com:2181,amb03.v120.ubuntu:2181,amb02.v120.ubuntu:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -n "hive"
import from '/tmp/export/customer';
import table ds_customer from '/tmp/export/customer';
1.4 特殊表导入导出
1.4.1 导出整个分区表
源端 hive beeline
export table saleslineitem_orc to '/tmp/export/saleslineitem_orc' ;
export table saleslineitem_parquet to '/tmp/export/saleslineitem_parquet' ;
同步数据
hadoop distcp hdfs://dc1.xx.com:8020/tmp/export/ hdfs://dc2.xx.com:8020/tmp/export
目标端
import from '/tmp/export/saleslineitem_orc';
import from '/tmp/export/saleslineitem_parquet';
检查
show tables;
show partitions saleslineitem_orc ;
1.4.2 导出一个分区并导入到指定分区
目标端执行
export table saleslineitem_orc partition(dt='20171216') to '/tmp/export/saleslineitem_orc_dt_20171216' ;
同步数据
hadoop distcp hdfs://dc1.xx.com:8020/tmp/export/ hdfs://dc2.xx.com:8020/tmp/export
目标端执行
alter table saleslineitem_orc drop partition(dt='20171216');
import from '/tmp/export/saleslineitem_orc_dt_20171216'

注意事项：
1. 目标分区必须不存在
2. 不能把导出得分区导入到其他分区,例如导出得是 dt='20171216' 分区数据，不能使用 import table saleslineitem_orc partition(dt='20171215') from '/tmp/export/saleslineitem_orc_dt_20171216' 导入到201715分区中

2. 通过distcp 命令导出到目标集群，然后建立表
此出省略
3. 总结
可以使用hive export/import 进行hive数据的批量迁移，本实验测试了text，orc，parquet，分区表，并测试了不同版本的导入导出。理论上hive导入导出的数据迁移不受版本，数据格式以及表的限制，可以得出结论可以适应hive export/import进行任何hive数据的迁移

4. 使用脚本
create table if not exists proc.saleslineitem (
saleslineitemid int,
productid int,
customerid int,
quantity int,
extendedamount float ,
transactiondate timestamp
)
comment 'saleslineitem information'
row format delimited
fields terminated by '|'
lines terminated by '\n'
stored as textfile;

load data inpath '/tmp/saleslineitem.txt' overwrite into table proc.saleslineitem;

create table if not exists proc.saleslineitem_orc (
saleslineitemid int,
productid int,
customerid int,
quantity int,
extendedamount float ,
transactiondate timestamp
) partitioned by (dt string)
stored as orc;

create table if not exists proc.saleslineitem_parquet (
saleslineitemid int,
productid int,
customerid int,
quantity int,
extendedamount float ,
transactiondate timestamp
) partitioned by (dt string)
stored as parquet;

insert overwrite table proc.saleslineitem_orc partition(dt='20171216') select * from saleslineitem;
insert overwrite table proc.saleslineitem_orc partition(dt='20171217') select * from saleslineitem;
insert overwrite table proc.saleslineitem_orc partition(dt='20171218') select * from saleslineitem;

show partitions saleslineitem_orc
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrick;

insert overwrite table proc.saleslineitem_parquet partition(dt) select * from proc.saleslineitem_orc;