多种HDFS存储格式下的Sqoop全量和增量导入

最新推荐文章于 2022-08-29 21:51:51 发布

原创最新推荐文章于 2022-08-29 21:51:51 发布 · 819 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#hdfs #hadoop #sqoop #orc #hive

大数据专栏收录该内容

2 篇文章

订阅专栏

背景环境介绍

CentOS 6.7
MySQL 5.7
Coudera 5.13.0

1.准备数据（MySQL）

1.在mysql中创建一张sqp_test表

drop table if exists sqp_test;
create table sqp_test(
    id bigint PRIMARY KEY AUTO_INCREMENT  COMMENT '主键ID',
    c1 varchar(64),
    c2 text,
    c3 int,
    c4 bigint,
    c5 double,
    c6 decimal(22,6),
    c7 datetime
);

2.插入一些数据，包含各种类型，字符串里有特殊字符。

truncate table sqp_test;

INSERT INTO sqp_test
(c1, c2, c3, c4, c5, c6, c7)
VALUES('中文字符', '魑魅魍魉', 100, 100, 100.2, 100.2, '2019-01-01 12:10:10');
INSERT INTO sqp_test
(c1, c2, c3, c4, c5, c6, c7)
VALUES('null类型', null, null, null, null, null, '2019-01-01 23:59:59');

INSERT INTO sqp_test
(c1, c2, c3, c4, c5, c6, c7)
VALUES('空字符串', '', null, null, null, null, null);

INSERT INTO sqp_test
(c1, c2, c3, c4, c5, c6, c7)
VALUES('特殊字符', '\t|\\|\'|\n|,|\"|@|',
null, null, null, null, '2019-01-02 12:12:12.2345');

INSERT INTO sqp_test
(c1, c2, c3, c4, c5, c6, c7)
VALUES('end', 'ABCDEFG,ABCD',
PI(), PI(), PI(), PI(), '2019-01-02 12:12:12.2345');

2.ORC存储下的Sqoop导入

ORC（Optimized Row Columnar,优化的行列存储），数据读写速度快，支持的数据类型广泛，支持ZLIB, SNAPPY压缩。

2.1全量

创建目标表orc格式存储，SNAPPY方式压缩的目标表

drop table if exists default.sqp_test_orc;
create table default.sqp_test_orc(
    id bigint COMMENT '主键ID',
    c1 string,
    c2 string,
    c3 int,
    c4 bigint,
    c5 double,
    c6 decimal(22,6),
    c7 string
) stored as orc tblproperties ("orc.compress"="SNAPPY");

sqoop导入
1.导入之前先清空目标表

hive -e "truncate table default.sqp_test_orc"

2.执行sqoop导入

sqoop import  \
--connect "jdbc:mysql://quickstart.cloudera:3306/devdb?useUnicode=true&characterEncoding=UTF-8" \
--username devuser \
--password devuser \
--table  sqp_test \
--split-by id \
--num-mappers 1 \
--hcatalog-database default \
--hcatalog-table sqp_test_orc \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")'

** 相关参数说明 **

–num-mappers或-m 启动N个map来并行导入数据，默认是4个，最好不要将数字设置为高于集群的节点数，测试环境资源少仅指定一个
–split-by 指定一个字段进行map任务拆分，在指定建议使用唯一性高的字段。如果不指定该参数会自动去找源表的主键字段，源表没有主键字段会报错。当–num-mappers指定为1的时候，不进行map任务拆分，该选项无效。
–hcatalog-database 目标库名
–hcatalog-table 目标表名
–hcatalog-storage-stanza 指定存储格式，默认：stored as orcfile
**注意：hcatalog模式下–hive-overwrite没有效果，所以要额外的truncate数据程序 **

2.2增量

创建增量分区表

drop table if exists default.sqp_test_orc_add;
create table default.sqp_test_orc_add(
    id bigint COMMENT '主键ID',
    c1 string,
    c2 string,
    c3 int,
    c4 bigint,
    c5 double,
    c6 decimal(22,6),
    c7 string
)
COMMENT 'sqoop增量分区测试表'
partitioned by (etl_date string  COMMENT '数据日期') 
stored as orc tblproperties ("orc.compress"="NONE");

1.导入数据之前先清空目标分区数据

hive -e "alter table default.sqp_test_orc_add DROP IF EXISTS partition(etl_date='20190101')"

2.执行sqoop导入，以query的方式筛选增量筛选条件

sqoop import  \
--connect "jdbc:mysql://quickstart.cloudera:3306/devdb?useUnicode=true&characterEncoding=UTF-8" \
--username devuser \
--password devuser \
--query "select * from  sqp_test where c7>='2019-01-01' and c7<'2019-01-02' and \$CONDITIONS" \
--split-by id \
--num-mappers 1 \
--hcatalog-database default \
--hcatalog-table sqp_test_orc_add \
--hcatalog-partition-keys etl_date \
--hcatalog-partition-values 20190101

3.跑下一天的数据，以–where的方式指定增量筛选条件

hive -e "alter table default.sqp_test_orc_add DROP IF EXISTS partition(etl_date='20190102')"

sqoop import  \
--connect "jdbc:mysql://quickstart.cloudera:3306/devdb?useUnicode=true&characterEncoding=UTF-8" \
--username devuser \
--password devuser \
--table  sqp_test \
--where "c7>='2019-01-02' and c7<'2019-01-03'" \
--split-by id \
--num-mappers 1 \
--hcatalog-database default \
--hcatalog-table sqp_test_orc_add \
--hcatalog-partition-keys etl_date \
--hcatalog-partition-values 20190102

3.TextFile下的Sqoop导入

3.1全量

TextFile是hive默认的格式，第一次导入的时候如果hive没有建表，会根据关系型数据库的表结构自动产生一张hive表

sqoop import  \
--connect "jdbc:mysql://quickstart.cloudera:3306/devdb?useUnicode=true&characterEncoding=UTF-8" \
--username devuser \
--password devuser \
--table  sqp_test \
--null-string '\\N'   \
--null-non-string '\\N'   \
--hive-drop-import-delims \
--split-by id \
--num-mappers 1 \
--hive-import \
--hive-overwrite \
--hive-database default \
--hive-table sqp_test_txt

** 相关参数说明 **

–null-string 字符类型null值的格式，如果没有指定，则字符串null将被使用，\N是可以被hive识别为null值
–null-non-string 非字符类型null的格式，如果没有指定，则字符串null将被使用
–hive-drop-import-delims 在导入数据到hive中时，去掉数据中\n,\r和\01这样的字符
–hive-delims-replacement <arg> 用自定义的字符串替换掉数据中的\n, \r, and \01等字符
–direct 直接导入模式，使用的是关系数据库自带的导入导出工具。对于TextFile来讲虽然可以提升效率，但是与–null-string，–null-non-string和–hive-drop-import-delims等选项冲突。

注意–hive-drop-import-delims选项在导入数据到hive中时，只会去掉数据中\n,\r和\01这样的默认分割符号，所以对目标表来说最好使用默认的分割方式，而不指定–fields-terminated-by和–lines-terminated-by 参数

3.2增量

sqoop import  \
--connect "jdbc:mysql://quickstart.cloudera:3306/devdb?useUnicode=true&characterEncoding=UTF-8" \
--username devuser \
--password devuser \
--table  sqp_test \
--where "c7>='2019-01-01' and c7<'2019-01-02'" \
--null-string '\\N'   \
--null-non-string '\\N'   \
--hive-drop-import-delims \
--split-by id \
--num-mappers 1 \
--hive-import \
--hive-overwrite \
--hive-database default \
--hive-table sqp_test_txt_add \
--hive-partition-key etl_date \
--hive-partition-value 20190101

sqoop import  \
--connect "jdbc:mysql://quickstart.cloudera:3306/devdb?useUnicode=true&characterEncoding=UTF-8" \
--username devuser \
--password devuser \
--table  sqp_test \
--where "c7>='2019-01-02' and c7<'2019-01-03'" \
--null-string '\\N'   \
--null-non-string '\\N'   \
--hive-drop-import-delims \
--split-by id \
--num-mappers 1 \
--hive-import \
--hive-overwrite \
--hive-database default \
--hive-table sqp_test_txt_add \
--hive-partition-key etl_date \
--hive-partition-value 20190102

4.Parquet下的Sqoop导入

Parquet是一种劣势存储实现。

4.1全量

sqoop import  \
--connect "jdbc:mysql://quickstart.cloudera:3306/devdb?useUnicode=true&characterEncoding=UTF-8" \
--username devuser \
--password devuser \
--table  sqp_test \
--split-by id \
--num-mappers 1 \
--as-parquetfile \
--hive-import \
--hive-overwrite \
--hive-database default \
--hive-table sqp_test_prq