核心参数
- –check-column
用来指定一些列,这些列在增量导入时用来检查这些数据是否作为增量数据进行导入,和关系型数据库中的自增字段及时间戳类似.
注意:这些被指定的列的类型不能使任意字符类型,如char、varchar等类型都是不可以的,同时–check-column可以去指定多个列 - –incremental
用来指定增量导入的模式,两种模式分别为Append和Lastmodified - –last-value
指定上一次导入中检查列指定字段最大值
场景描述
环境信息:CDH SandBox
验证过程
1. 在Mysql中创建一个customertest表,指定一个时间戳字段。
createtable customertest(id int,name varchar(20),last_mod timestampDEFAULTCURRENT_TIMESTAMPONUPDATECURRENT_TIMESTAMP);
2. 插入数据。
insertinto customertest(id,name) values(1,'neil');
insertinto customertest(id,name) values(2,'jack');
insertinto customertest(id,name) values(3,'martin');
insertinto customertest(id,name) values(4,'tony');
insertinto customertest(id,name) values(5,'eric');
3. 依次执行如下三个脚本:抽取脚本、建表脚本、加载脚本。
sqoop import \
--connect jdbc:mysql://localhost:3306/wht \
--username root \
--password cloudera \
--table customertest \
--as-textfile \
--target-dir /user/data_exchange/customertest/20180623 \
--fields-terminated-by \\t \
--delete-target-dir \
DROP TABLE IF EXISTS OUTPOST.customertest;
CREATE EXTERNAL TABLE OUTPOST.customertest (
id INT COMMENT '',
name VARCHAR(255) COMMENT '',
last_mod TIMESTAMP COMMENT ''
)
COMMENT 'car备注'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES('field.delim'='\t', 'serialization.encoding'='GBK')
STORED AS TEXTFILE;
DROP TABLE IF EXISTS CASTLE.customertest;
CREATE TABLE CASTLE.customertest (
id INT COMMENT '',
name VARCHAR(255) COMMENT '',
last_mod TIMESTAMP COMMENT ''
)
COMMENT 'customertest备注'
PARTITIONED BY (PARTITION_VAR STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
STORED AS PARQUET;
ALTER TABLE outpost.customertest SET LOCATION '/user/data_exchange/customertest/20180623';
ALTER TABLE castle.customertest DROP IF EXISTS PARTITION(PARTITION_VAR='20180623');
INSERT INTO TABLE castle.customertest PARTITION(PARTITION_VAR='20180623')
SELECT
id,
name,
last_mod
FROM outpost.customertest;
4. 查询结果。

5. 在MySQL中插入新的记录。
6. 依次执行如下两个脚本,增量导入数据。
sqoop import \
--connect jdbc:mysql://localhost:3306/wht \
--username root \
--password cloudera \
--table customertest \
--columns "id,name,last_mod" \
--check-column last_mod \
--incremental lastmodified \
--last-value "2018-06-22 23:59:59" \
--as-textfile \
--target-dir /user/data_exchange/customertest/20180624 \
--fields-terminated-by \\t \
-m 1 \
--append
ALTER TABLE outpost.customertest SET LOCATION '/user/data_exchange/customertest/20180624';
ALTER TABLE castle.customertest DROP IF EXISTS PARTITION(PARTITION_VAR='20180624');
INSERT INTO TABLE castle.customertest PARTITION(PARTITION_VAR='20180624')
SELECT
id,
name,
last_mod
FROM outpost.customertest;
7. 查询结果。