在CDH集群上搭建数据仓库时,通常使用CDH自带的Sqoop导数工具将业务数据表从MySQL导入到Hive。当MySQL表数据量很大时,应采用创建分区表每日进行增量同步的方式。但第一次用sqoop导入时我们一般会将历史数据统一放到某个分区下面,因为这样会比按天依次将历史数据导入到对应日期的分区节省时间。但这样会使得查询某个时间范围的历史数据效率很低,故需要再进行数据重分区。由于我们主要使用Spark SQL来做复杂的查询分析,简单的查询分析使用HUE界面上的Hive查询入口。使用这两个工具进行数据重分区的具体操作步骤如下。
一、重命名以备份原分区表并新建同原分区表一模一样的空的表
进入到spark-sql,执行show create table order;
语句获得建表语句,结果如下
spark-sql> use ods_db;
Time taken: 0.863 seconds
spark-sql> show create table order;
CREATE TABLE `order`(`biz_order_no` BIGINT, `biz_type` INT, `order_no` STRING, `city_id` INT, `city_name` STRING, `store_id` INT, `store_name` STRING, `product_no` INT, `product_name` STRING, `origin_price` INT, `fee` INT, `source_fee` INT, `settle_price` INT, `sale_price` INT, `final_settle_amount` INT, `pay_amount` INT, `pay_point` INT, `count_num` INT, `activity_id` INT, `activity_name` STRING, `merchant_no` STRING, `terminal_no` STRING, `partner_id` INT, `partner_name` STRING, `thd_order_no` STRING, `thd_order_sn` STRING, `budget_used_amount` INT, `ship_status` INT, `return_fee` INT, `return_fee_party` STRING, `status` INT, `create_time` TIMESTAMP, `update_time` TIMESTAMP, `activity_type` INT, `receive` INT, `hidden` INT, `point_act_id` STRING, `brand_id` INT, `reserv3` STRING, `relate_act_amount` INT, `other_info` STRING, `wanda_ticket_id` INT, `coupon_code` STRING, `play_time` TIMESTAMP)
COMMENT 'Imported by sqoop on 2019/11/06 17:07:29'
PARTITIONED BY (`record_date` STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'line.delim' = '
',
'field.delim' = '',
'serialization.format' = ''
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (
'transient_lastDdlTime' = '1573031252'
)
Time taken: 0.05 seconds, Fetched 1 row(s)
然后在HUE界面上的Hive查询面板上执行以下语句对原表重命名
alter table order rename to bak_order;
然后在Hive查询面板上执行上面得到的建表语句创建空的原表
CREATE TABLE `order`(`biz_order_no` BIGINT, `biz_type` INT, `order_no` STRING, `city_id` INT, `city_name` STRING, `store_id` INT, `store_name` STRING, `product_no` INT, `product_name` STRING, `origin_price` INT, `fee` INT, `source_fee` INT, `settle_price` INT, `sale_price` INT, `final_settle_amount` INT, `pay_amount` INT, `pay_point` INT, `count_num` INT, `activity_id` INT, `activity_name` STRING, `merchant_no` STRING, `terminal_no` STRING, `partner_id` INT, `partner_name` STRING, `thd_order_no` STRING, `thd_order_sn` STRING, `budget_used_amount` INT, `ship_status` INT, `return_fee` INT, `return_fee_party` STRING, `status` INT, `create_time` TIMESTAMP, `update_time` TIMESTAMP, `activity_type` INT, `receive` INT, `hidden` INT, `point_act_id` STRING, `brand_id` INT, `reserv3` STRING, `relate_act_amount` INT, `other_info` STRING, `wanda_ticket_id` INT, `coupon_code` STRING, `play_time` TIMESTAMP)
COMMENT 'Imported by sqoop on 2019/11/06 17:07:29'
PARTITIONED BY (`record_date` STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'line.delim' = '
',
'field.delim' = '',
'serialization.format' = ''
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES (
'transient_lastDdlTime' = '1573031252'
)
二、将备份表的数据插入到新表
在spark-sql命令行用hive非严格动态分区模式按分区字段插入数据,命令如下
insert into order
select biz_order_no,biz_type,order_no,city_id,city_name,store_id,store_name,product_no,product_name,origin_price,fee,source_fee,settle_price,sale_price,final_settle_amount,pay_amount,pay_point,count_num,activity_id,activity_name,merchant_no,terminal_no,partner_id,partner_name,thd_order_no,thd_order_sn,budget_used_amount,ship_status,return_fee,return_fee_party,status,create_time,update_time,activity_type,receive,hidden,point_act_id,brand_id,reserv3,relate_act_amount,other_info,wanda_ticket_id,coupon_code,play_time,date(update_time) AS RECORD_DATE
from
bak_order;
其中除分区字段外,其他所有字段顺序必须和建表时一致,from前面的date(update_time) AS RECORD_DATE表示动态的根据date(update_time)的值进行分区。当报错时需执行以下语句以开启hive非严格动态分区模式
set hive.exec.dynamic.partition.mode=nonstrict;