Sqoop的使用

最新推荐文章于 2023-03-02 19:02:08 发布

baigp

最新推荐文章于 2023-03-02 19:02:08 发布

阅读量636

点赞数 1

分类专栏： Sqoop 文章标签： Sqoop 数据库

本文链接：https://blog.youkuaiyun.com/u010476994/article/details/72247554

版权

Sqoop 专栏收录该内容

1 篇文章

订阅专栏

介绍：

Sqoop是Apache旗下的一款开源工具，主要用于 关系型数据库（Oracle、Mysql等）与 非关系型数据库（Hive、HBase等）之间的数据传递，可以将关系型数据库中的数据导到HDFS上，也可以将HDFS上的数据导到关系型数据库中。

版本问题：

Sqoop目前有Sqoop1 与 Sqoop2 两个版本。虽然都是Sqoop的版本，但是Sqoop1和Sqoop2 却完全不兼容。在实际项目中，使用更多的是Sqoop1，所以我们使用Sqoop1版本。

架构与原理：

Sqoop的架构十分简单，可以说是Hadoop生态中架构最为简单的一个框架。其工作原理是：通过sqoop的客户端接入Hadoop，将其sqoop任务解析成MapReduce任务执行。

Sqoop架构图如下：

Sqoop的安装本文不作介绍，具体安装步骤请参考：

http://blog.youkuaiyun.com/u010476994/article/details/72247562

sqoop命令执行方式：

1、直接命令行执行：例如

sqoop list-databases --connect jdbc:mysql://192.168.152.101:3306/ --username root --password 123456

sqoop create-hive-table --connect jdbc:mysql://192.168.152.101:3306/ --username root --password 123456 --table TA_MBL_DVLP_PACKAGE_D --hive-table TA_MBL_DVLP_PACKAGE_DAY_TEST

根据mysql中的表，创建hive表，只有表结构，没有数据。 --table是mysql中的表，--hive-table是hive中的表

2、将命令写入文件，读取文件执行。举例：

新建一个文件option1，在option1中写入以下内容：

   list-databases 
 
   --connect 
 
   jdbc:mysql://192.168.152.101:3306/ 
 
   --username 
 
   root 
 
   --password 
 
   123456

一个参数占一行，保存退出。

执行： sqoop --options-file option1

Sqoop常用命令及参数：

说明：一般我们所说的导入导出，都是以hdfs的角度来说的。即sqoop导入说的是往hdfs中导入数据，sqoop导出是从hdfs往外导出数据。以下内容是官网给出的参数：

1、import

Argument	Description
`--append`	Append data to an existing dataset in HDFS
`--as-avrodatafile`	Imports data to Avro Data Files
`--as-sequencefile`	Imports data to SequenceFiles
`--as-textfile`	Imports data as plain text (default)
`--as-parquetfile`	Imports data to Parquet Files
`--boundary-query <statement>`	Boundary query to use for creating splits
`--columns <col,col,col…>`	Columns to import from table
`--delete-target-dir`	Delete the import target directory if it exists
`--direct`	Use direct connector if exists for the database
`--fetch-size <n>`	Number of entries to read from database at once.
`--inline-lob-limit <n>`	Set the maximum size for an inline LOB
`-m,--num-mappers <n>`	Use n map tasks to import in parallel
`-e,--query <statement>`	Import the results of `statement`.
`--split-by <column-name>`	Column of the table used to split work units. Cannot be used with `--autoreset-to-one-mapper` option.
`--autoreset-to-one-mapper`	Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with `--split-by <col>` option.
`--table <table-name>`	Table to read
`--target-dir <dir>`	HDFS destination dir
`--warehouse-dir <dir>`	HDFS parent for table destination
`--where <where clause>`	WHERE clause to use during import
`-z,--compress`	Enable compression
`--compression-codec <c>`	Use Hadoop codec (default gzip)
`--null-string <null-string>`	The string to be written for a null value for string columns
`--null-non-string <null-string>`	The string to be written for a null value for non-string columns

常用参数解释：

(1) 关于文件格式：Sqoop支持三种文件格式，包含一种文本格式和两种二进制格式。二进制格式分别是Avro和SequenceFile。使用--as-avrodatafile或--as-sequencefile以指定具体使用哪种二进制格式。默认是文本格式，也是最常用的格式。

(2) --columns

在执行导入操作时，可以指定将关系型数据库表的哪些字段导入到hdfs中。

(3) --delete-target-dir

如果指定的hdfs目录已经存在，则先进行删除，再执行导入操作。

(4) --m

指定要开启几个map task执行并行导入操作。默认值是1，即没有开启并行功能。一般和 --split-by配合使用，由--split-by指定按照哪个字段进行拆分。

例如： --m 3 --split-by id，即是按id分成三段，并行执行导入操作。具体原理大致如下：

首先根据id（假设为int类型），查询出 max(id) 与 min(id) ，界定id的范围，假如是min(id) = 1，max(id)=15，sqoop执行时，会按1~5、6~10、11~15三段，来并行执行。

sqoop还支持拆分其它类型的字段，如 Date,Text,Float,Integer，Boolean,NText,BigDecimal等等。

在拆分字段时，尽量找值分布均匀的字段，保证并行任务之间的数据量大致相等，以达到最大的执行效率。

(5) --split-by 按照哪个字段进行拆分。

(6) -e 、--query 可以跟一段sql，将sql的执行结果导入到hdfs当中。

在使用该参数时，有一个特别需要注意的地方。在官方文档中，对此进行了说明：

   If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression.  
 
   大意如下：当以并行的方式将查询结果导入时，每一个map task 都需要执行一个查询副本，会将sqoop命令中的约束条件进行分区查询，获取结果。 查询语句中必须包含“$CONDITIONS”， sqoop查询会根据该标识符替换特定的约束条件。 
 
   也就是说， 
  在使用-e、--query参数时，必须包含 where子句 和 "$CONDITIONS"条件。 
 
  另：如果查询语句使用双引号，则 $CONDITIONS 需要使用\进行转义， 即 \$CONDITIONS ; 若使用单引号，则不需要转义。

(7) --table 指定数据来源。

(8) --target-dir 指定数据要导到hdfs下的那个目录下

(9) --where 添加筛选条件

举例：

Mysql --> HDFS

将Mysql中的psn表的三个字段（id、name、age）导入到hdfs的/sqoop/data目录下，筛选条件是 “age = 25”

  sqoop import --connect jdbc:mysql://192.168.152.101:3306/mysql --username root --password 123456 --table psn --columns id,name,age --delete-target-dir --target-dir /sqoop/data -m 3 --split-by id --where "age=25" 

Mysql --> Hive

   sqoop import --connect  jdbc:mysql://192.168.152.101:3306/mysql --username root --password 123456 --as-textfile 
 
   --query  'select id, name, msg from psn where id like "1%" and 
   $CONDITIONS'  --delete-target-dir --target-dir  /sqoop/tmp 
 
   -m  1 --hive-home  /home/hive-1.2.1 
 
   --hive-import  --create-hive-table  --hive-table  t_test

在使用

2、export

Argument	Description
`--columns <col,col,col…>`	Columns to export to table
`--direct`	Use direct export fast path
`--export-dir <dir>`	HDFS source path for the export
`-m,--num-mappers <n>`	Use n map tasks to export in parallel
`--table <table-name>`	Table to populate
`--call <stored-proc-name>`	Stored Procedure to call
`--update-key <col-name>`	Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.
`--update-mode <mode>`	Specify how updates are performed when new rows are found with non-matching keys in database.
	Legal values for `mode` include `updateonly` (default) and `allowinsert`.
`--input-null-string <null-string>`	The string to be interpreted as null for string columns
`--input-null-non-string <null-string>`	The string to be interpreted as null for non-string columns
`--staging-table <staging-table-name>`	The table in which data will be staged before being inserted into the destination table.
`--clear-staging-table`	Indicates that any data present in the staging table can be deleted.
`--batch`	Use batch mode for underlying statement execution.

举例：

   sqoop export  --connect jdbc:mysql://192.168.152.101:3306/mysql --username root --password 123456 
 
   -m  1  --columns  id,name,msg  --export-dir  /sqoop/data  --table  h_psn