Sqoop导入数据关于--split by 和 --m参数详细解析

最新推荐文章于 2024-08-01 13:30:08 发布

原创最新推荐文章于 2024-08-01 13:30:08 发布 · 3.6k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#hive #sqoop #hadoop

sqoop 专栏收录该内容

2 篇文章

订阅专栏

本文探讨了使用sqoop进行数据导入时，通过调整split-by参数和-m参数来优化数据处理效率的方法。具体分析了不同参数设置下，map任务的划分方式及数据分布情况，指出在split-by字段为int类型且非连续时可能导致数据处理不均衡，提出了通过设置-m为1并使用时间字段作为分区字段的解决方案。

split-by 根据不同的参数类型有不同的切分方法；如int型，Sqoop会取最大和最小split-by字段值，然后根据传入的num-mappers来确定划分几个区域。比如select max(split_by),min(split-by) from得到的max(split-by)和min(split-by)分别为1000和1，而num-mappers（-m）为2的话，则会分成两个区域 (1,500)和(501-1000),同时也会分成2个sql给2个map去进行导入操作，分别为select XXX from table where split-by>=1 and split-by<500和select XXX from table where split-by>=501 and split-by<=1000.最后每个map各自获取各自SQL中的数据进行导入工作。

sqoop import \
--connect jdbc:mysql://hadoop01:3306/feidata \
--username xxx \
--password xxx \
--table emp \
--hive-overwrite \
--delete-target-dir \
--hive-import --hive-database fei_hive \
--hive-table emp_import_partition \
--fields-terminated-by '\t' \
--split-by deptno

如果不设置的话，默认是--m=4

[hadoop@hadoop01 hive]$ hdfs dfs -cat /user/hive/warehouse/fei_hive.db/emp_import_partition/part-m-00000
20/01/13 16:34:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7369	SMITH	CLERK	7902	1980-12-17 00:00:00.0	800.00	null	1
7499	ALLEN	SALESMAN	7698	1981-02-20 00:00:00.0	1600.00	300.00	2
7521	WARD	SALESMAN	7698	1981-02-22 00:00:00.0	1250.00	500.00	3
7876	ADAMS	CLERK	7788	1983-01-12 00:00:00.0	1100.00	null	1
7900	JAMES	CLERK	7698	1981-12-03 00:00:00.0	950.00	null	2
7902	FORD	ANALYST	7566	1981-12-03 00:00:00.0	3000.00	null	3
[hadoop@hadoop01 hive]$ hdfs dfs -cat /user/hive/warehouse/fei_hive.db/emp_import_partition/part-m-00001
20/01/13 16:34:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7566	JONES	MANAGER	7839	1981-04-02 00:00:00.0	2975.00	null	4
7654	MARTIN	SALESMAN	7698	1981-09-28 00:00:00.0	1250.00	1400.00	5
7934	MILLER	CLERK	7782	1982-01-23 00:00:00.0	1300.00	null	4
[hadoop@hadoop01 hive]$ hdfs dfs -cat /user/hive/warehouse/fei_hive.db/emp_import_partition/part-m-00002
20/01/13 16:34:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7698	BLAKE	MANAGER	7839	1981-05-01 00:00:00.0	2850.00	null	6
7782	CLARK	MANAGER	7839	1981-06-09 00:00:00.0	2450.00	null	7
[hadoop@hadoop01 hive]$ hdfs dfs -cat /user/hive/warehouse/fei_hive.db/emp_import_partition/part-m-00003
20/01/13 16:34:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7788	SCOTT	ANALYST	7566	1982-12-09 00:00:00.0	3000.00	null	8
7839	KING	PRESIDENT	null	1981-11-17 00:00:00.0	5000.00	null	9
7844	TURNER	SALESMAN	7698	1981-09-08 00:00:00.0	1500.00	0.00	10

上面划分map区域就是(0,3],(3,5],(5,7],(7,10],

上面遇到的问题就是当split-by是int类型但不是自增长的话，各个map分配的数据是不均衡的，可能会有些map很忙，有些map几乎没有数据处理的情况，数据多的时候，会拉数据的时候数据库出现奔溃的情况

当split-by不是int型时出现如上场景中的问题。目前想到的解决办法是：将-m设置称1，split-by不设置，即只有一个map运行，缺点是不能并行map录入数据，（注意，当-m设置的值大于1时，split-by必须设置字段）数据多的时候，也会在拉数据的时候数据库出现奔溃的情况。

sqoop import \
--connect jdbc:mysql://hadoop01:3306/feidata \
--username xxx \
--password xxx\
--table emp \
--hive-overwrite \
--delete-target-dir \
--hive-import --hive-database fei_hive \
--hive-table emp_import_partition \
--fields-terminated-by '\t' \
--split-by hiredate
--m 1

[hadoop@hadoop01 ~]$ hdfs dfs -cat /user/hive/warehouse/fei_hive3.db/emp_import_partition/part-m-00000
20/01/13 17:07:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7369	SMITH	CLERK	7902	1980-12-17 00:00:00.0	800.00	null	20
7499	ALLEN	SALESMAN	7698	1981-02-20 00:00:00.0	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-02-22 00:00:00.0	1250.00	500.00	30
7566	JONES	MANAGER	7839	1981-04-02 00:00:00.0	2975.00	null	20
7654	MARTIN	SALESMAN	7698	1981-09-28 00:00:00.0	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-05-01 00:00:00.0	2850.00	null	30
7782	CLARK	MANAGER	7839	1981-06-09 00:00:00.0	2450.00	null	10
7788	SCOTT	ANALYST	7566	1982-12-09 00:00:00.0	3000.00	null	20
7839	KING	PRESIDENT	null	1981-11-17 00:00:00.0	5000.00	null	10
7844	TURNER	SALESMAN	7698	1981-09-08 00:00:00.0	1500.00	0.00	30
7876	ADAMS	CLERK	7788	1983-01-12 00:00:00.0	1100.00	null	20
7900	JAMES	CLERK	7698	1981-12-03 00:00:00.0	950.00	null	30
7902	FORD	ANALYST	7566	1981-12-03 00:00:00.0	3000.00	null	20
7934	MILLER	CLERK	7782	1982-01-23 00:00:00.0	1300.00	null	10