hive导入CSV数据，使用动态分区重新分区

最新推荐文章于 2022-05-19 20:34:24 发布

db_guy

最新推荐文章于 2022-05-19 20:34:24 发布

阅读量1.6k

点赞数

CC 4.0 BY-SA版权

分类专栏：商业智能-ETL 商业智能-数据分析文章标签： hive csv 动态分区 nonstrict buckets

本文链接：https://blog.youkuaiyun.com/db_guy/article/details/78742071

商业智能-ETL 同时被 2 个专栏收录

8 篇文章

订阅专栏

商业智能-数据分析

2 篇文章

订阅专栏

本文介绍如何在Hive中创建数据表，并通过动态分区插入数据的方法。同时，对比了动态分区与桶表在数据分割方式、执行时间及关联查询时间上的表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

创建数据表

hive> create database cus;
hive> use cus;
hive> create table telno_md5(
    > phone string,
    > md5 string )
    >  ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;

导入数据

hive> load data local inpath '/home/etluser/data/' into table telno_md5;

创建重新分区表

hive> create table telno_md5_prt(
    > phone string,
    > md5 string )
    > partitioned by (prefix string);

使用动态分区，插入数据

hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> set hive.exec.max.dynamic.partitions.pernode=100000;  
hive> set hive.exec.max.dynamic.partitions=100000;  
hive> set hive.exec.max.created.files=1000000000; 

hive> insert into table telno_md5_prt
    > partition (prefix)
    > select phone,md5,substr(md5,1,2) as prefix 
    > from telno_md5;

* 参数的含义参考https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts*

与桶表的比较

create table telno_md5_bucketed(
phone string,
md5 string )
clustered by(md5) into 1024 buckets;

insert overwrite table telno_md5_bucketed
select phone,md5 from telno_md5;

执行结果比较