Sqoop import OR export

数据来源:
1:关系性数据库 sqoop(Sql To Hadoop)
2:文件(Flume实时抽取数据)

任务调度:Oozie

hadoop生态系统中重要的框架,需要监控(统一WEB UI界面)

1:下载Sqoop

官网:http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Sqoop依赖于hadoop,Sqoop底层实现就是MapReduce

sqoop使用:以mysql数据库为例

1:查看mysql中有几个database:
sqoop list-databases \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123

2:import(-m,--num-mappers <n> Use 'n' map)

sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--table emp \
-m 1

2.1:import(--target-dir  HDFS plain)
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--table emp \
--target-dir /user/beifeng/sqoop/import_emp \
-m 1

2.2:import
--where <where clause>            WHERE clause to use during
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--where "city='sec-bad'" \
--target-dir  /user/beifeng/sqoop/import_emp_add \
--table emp_add \
--m 1

2.3:import(指定导入文件格式)
--as-avrodatafile                Imports data to Avro data files
--as-parquetfile                 Imports data to Parquet files
--as-sequencefile               Imports data to SequenceFiles
--as-textfile                       Imports data as plain text

sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--target-dir  /user/beifeng/sqoop/import_emp2 \
--table emp \
--as-parquetfile \
--m 1

2.4:import(只导出某个特定的列)
--columns <col,col,col...>     Columns to import from table
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--target-dir  /user/beifeng/sqoop/import_emp4 \
--table emp \
--m 1 \
--columns name,salary

2.5:在实际项目中,要处理的数据,需要进行初步清洗和过滤(查询语句:query())
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--query 'select name,salary from emp where $CONDITIONS and salary < 50000' \
--target-dir  /user/beifeng/sqoop/import_emp5 \
--m 1

2.6:压缩

-z,--compress                                    Enable compression
--compression-codec <codec>            Compression codec to use for import
--fields-terminated-by <char>             Sets the field separator character
--delete-target-dir                               Imports data in delete mode

sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--target-dir  /user/beifeng/sqoop/import_emp6 \
--table emp \
--m 1 \
--compress \
--compression-codec org.apache.hadoop.io.compress.DefaultCodec \
--fields-terminated-by '\t' \
--delete-target-dir

2.7:import(increment:增量数据导入)

增量数据的导入:有一个唯一标识符,通常这个表都有一个字段,类似于插入时间createtime
query: createtime >= startTime and create <= endTime;

sqoop:
Incremental import arguments:
--check-column <column>        Source column to check for incremental change
--incremental <import-type>    Define an incremental import of type 'append' or 'lastmodified'
--last-value <value>        Last imported value in the incremental check column

sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--target-dir  /user/beifeng/sqoop/import_emp7 \
--table emp \
--incremental append \
--check-column salary \
--last-value 4 \
--m 1


log截取:
INFO tool.ImportTool: Maximal id query for free form incremental import: SELECT MAX(`salary`) FROM `emp`
INFO tool.ImportTool: Incremental import based on column `salary`
INFO tool.ImportTool: Lower bound value: 4
INFO tool.ImportTool: Upper bound value: 50000


2.8:import(--direct)

sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--target-dir  /user/beifeng/sqoop/import_emp7 \
--delete-target-dir \
--table emp \
--direct \
--m 1


3:export
把mysql数据库的数据导入到文件系统中并且进行压缩,在导入到Hive表中(必须要先将数据进行压缩);
在HIVE数据库中创建表:
drop table if exists default.hive_emp_zlib;
CREATE TABLE default.hive_emp_zlib (
id string,
name string,
deg string,
salary string,
dept string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

load data inpath '/user/beifeng/sqoop/import_emp6' into table default.hive_emp_zlib;


#####################导出数据RDBMS#########################
HIVE TABLE

table:hiveserver2进行jdbc方式查询数据
hdfs file:export ---> MySQL/oracle/db2

创建my_user表,准备数据:
create table my_user(
id varchar(10),
name varchar(10),
password varchar(10)
);

hadoop dfs -mkdir /user/beifeng/sqoop/exp/user
hadoop -put user.txt /user/beifeng/sqoop/exp/user

1:export --> MYSQL
--export-dir <dir>                  HDFS source path for the export
--fields-terminated-by <char>      Sets the field separator character

sqoop export \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--export-dir  /user/beifeng/sqoop/exp/user \
--table my_user \
--m 1

##################### mysql import To Hive#########################

Hive arguments:
--create-hive-table                   
Fail if the target hive table exists

--hive-database <database-name>       
Sets the database name to use when importing to hive

--hive-delims-replacement <arg>       
Replace Hive record \0x01 and row delimiters (\n\r) from imported string fields with user-defined string

--hive-drop-import-delims             
Drop Hive record \0x01 and row delimiters (\n\r) from imported string fields

--hive-home <dir>                    
Override $HIVE_HOME

--hive-import                         
Import tables into Hive(Uses Hive's default delimiters if none are set.)

--hive-overwrite                            
Overwrite existing data in the Hive table

--hive-partition-key <partition-key>       
Sets the partition key to use when importing to hive

--hive-partition-value <partition-value>    
Sets the partition value to use when importing to hive

--hive-table <table-name>                
Sets the table name to use when importing to hive

--map-column-hive <arg>                     
Override mapping for specific column to hive types.

1:创建Hive表:

create table default.my_user(
id string,
name string,
password string
)
row format delimited fields terminated by "\t";

2:执行导入

sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--target-dir  /user/beifeng/sqoop/import_my_user \
--delete-target-dir \
--table my_user \
--hive-database default \
--hive-import \
--hive-table my_user \
--fields-terminated-by '\t' \
--m 1

3:执行导出(将Hive中的数据导出到MySQL中)
--input-fields-terminated-by <char>

sqoop export \
--connect jdbc:mysql://localhost:3306/sqoop \
--username wql \
--password root123 \
--export-dir  /user/hive/warehouse/my_user \
--table my_user2 \
--input-fields-terminated-by '\t' \
--m 1

4:MYSQL中创建my_user2表

create table my_user2(
id varchar(10),
name varchar(10),
password varchar(10)
);

############################ --options-file ###################################

创建一个sqoop-import-hdfs.txt文件:

import
--connect
jdbc:mysql://localhost:3306/sqoop
--username
wql
--password
root123
--target-dir
/user/beifeng/sqoop/import_my_user
--delete-target-dir
--table
my_user
--hive-database
default
--hive-import
--hive-table
my_user
--fields-terminated-by
'\t'
--m
1

sqoop --options-file /home/wql/app/sqoop/testdata/sqoop-import-hdfs.txt


实例:
$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
等价于:
$ sqoop --options-file /users/homer/work/import.txt --table TEST

 

#!/bin/bash # shellcheck disable=SC2155,SC2154,SC2178 ## !! DO NOT format whole file in IDE, or chinese characters will be re-encoded by mistake !! ## ## ------------------------------------------------------------------------------------------------- ## 业务数据导入到HDFS文件配置。配置项详细注释及示例模板参见 .TEMPLATE_SAVE_TO_HDFS.arguments.source ## ------------------------------------------------------------------------------------------------- export DDP_SQOOP_IMPORT_ARGS_SOURCE_DATA_QUERY=$(cat <<-INCREMENTDATAQUERY SELECT DetailId, OrderDate, StoreNo, StoreName, StoreOrderNo, WebOrderNo, RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(OtherOrderNo, char(10), ''), char(13), ''), '^', ''), char(0), ''), char(8), ''), char(9), ''))) as OtherOrderNo, OrderTime, OrderSavedTime, OrderActualTime, OrderCompletedTime, OrderSource, DeliveryType, OrderStatus, RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(CustomerName, char(10), ''), char(13), ''), '^', ''), char(0), ''), char(8), ''), char(9), ''))) as CustomerName, CustomerPhoneNumber, CustomerCity, CustomerPostalCode, RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(CustomerAddressName, char(10), ''), char(13), ''), '^', ''), char(0), ''), char(8), ''), char(9), ''))) as CustomerAddressName, RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(CustomerAddressNo, char(10), ''), char(13), ''), '^', ''), char(0), ''), char(8), ''), char(9), ''))) as CustomerAddressNo, RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(DiscountType, char(10), ''), char(13), ''), '^', ''), char(0), ''), char(8), ''), char(9), ''))) as DiscountType, DiscountAmount, OrderListAmount, OrderSalesAmount, OrderIdealFoodCost, IsModified, ModifiedDate, RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(ModifiedName, char(10), ''), char(13), ''), '^', ''), char(0), ''), char(8), ''), char(9), ''))) as ModifiedName, RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(AuthorizeName, char(10), ''), char(13), ''), '^', ''), char(0), ''), char(8), ''), char(9), ''))) as AuthorizeName, RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(CancelName, char(10), ''), char(13), ''), '^', ''), char(0), ''), char(8), ''), char(9), ''))) as CancelName, RTRIM(LTRIM(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(CancelReason, char(10), ''), char(13), ''), '^', ''), char(0), ''), char(8), ''), char(9), ''))) as CancelReason, DeliveryFee, TakeTimeSec, TakeTime, LoadTimeSec, LoadTime, WaitTimeSec, WaitTime, OutTheDoorTimeSec, OutTheDoorTime, LegTimeSec, LegTime, RunTimeSec, RunTime, DOTSec, DOT, BearAmount, SubsidyAmount, OrderNetSalesAmount, RegionDesc, Order17NetAmount, GIVEXDiscount, OrderPhone, WINGSFee, CardAmount, isnull(TableNo,'') as TableNo, '' AS DumbColumnToSkipTailTabChar FROM OrdCusSrvDetail WHERE 1 = 1 INCREMENTDATAQUERY ) export DDP_SQOOP_IMPORT_ARGS_SOURCE_DATETIME_COLUMN="OrderDate" export DDP_SQOOP_IMPORT_ARGS_DEST_HDFS_DIRECTORY="eod_OrdCusSrvDetail/SQP_IMPORT_${DDP_SQOOP_IMPORT_ARGS_SOURCE_DATETIME_COLUMN}_${incrementTimeBegin//[ :\-]/}_${incrementTimeBefore//[ :\-]/}" export DDP_SQOOP_IMPORT_ARGS_SPLIT_BY="StoreNo" export DDP_SQOOP_IMPORT_ARGS_MAPPERS="5" export DDP_SQOOP_IMPORT_ARGS_DEST_HIVE_TABLE_FULL_NAME="ddp_${DDP_ACTIVE_PROFILE}_ods.eod_OrdCusSrvDetail" export DDP_SQOOP_IMPORT_ARGS_DEST_HIVE_TABLE_PARTITION_FILE="${DDP_HDFS_DATA_HOME}/hdfs/${DDP_SQOOP_IMPORT_ARGS_DEST_HIVE_TABLE_FULL_NAME}.${incrementHivePartition}/SQP_IMPORT_${DDP_SQOOP_IMPORT_ARGS_SOURCE_DATETIME_COLUMN}_${incrementTimeBegin//[ :\-]/}_${incrementTimeBefore//[ :\-]/}" export DDP_SQOOP_IMPORT_ARGS_GENERIC="-Dmapreduce.map.memory.mb=4096 -Dmapreduce.map.java.opts=-Xmx3200m -Dmapreduce.job.name=eod_OrdCusSrvDetail._INCREMENT_IMPORT_TO_${DDP_SQOOP_IMPORT_ARGS_DEST_HIVE_TABLE_FULL_NAME}" DDP_SQOOP_IMPORT_ARGS_ETL_HIVEQLS=() DDP_SQOOP_IMPORT_ARGS_ETL_HIVEQLS+=("order/insert_ddp_${DDP_ACTIVE_PROFILE}_dwh.omni_order_from_ddp_${DDP_ACTIVE_PROFILE}_ods.eod_ordcussrvdetail.sql") DDP_SQOOP_IMPORT_ARGS_ETL_HIVEQLS+=("order/insert_ddp_${DDP_ACTIVE_PROFILE}_dmt.fact_omni_order_from_ddp_${DDP_ACTIVE_PROFILE}_dwh.omni_order.sql") DDP_SQOOP_IMPORT_ARGS_ETL_HIVEQLS+=("order/insert_ddp_${DDP_ACTIVE_PROFILE}_tmp.temp_sssg_from_ddp_${DDP_ACTIVE_PROFILE}_dwh.omni_order.sql") DDP_SQOOP_IMPORT_ARGS_ETL_HIVEQLS+=("order/insert_ddp_${DDP_ACTIVE_PROFILE}_tmp.temp_sssg_group_from_ddp_${DDP_ACTIVE_PROFILE}_tmp.temp_sssg.sql") DDP_SQOOP_IMPORT_ARGS_ETL_HIVEQLS+=("order/insert_ddp_${DDP_ACTIVE_PROFILE}_dmt.fact_omni_sssg_from_ddp_${DDP_ACTIVE_PROFILE}_tmp.temp_sssg.sql") export DDP_SQOOP_IMPORT_ARGS_ETL_HIVEQLS="$(IFS=,;echo -n "${DDP_SQOOP_IMPORT_ARGS_ETL_HIVEQLS[*]}")" export DDP_SQOOP_IMPORT_ARGS_ETL_HIVE_EXTRA_PARAMS="--hivevar targetOmniOrderTbForSSSG=ddp_${DDP_ACTIVE_PROFILE}_tmp.dmp_omni_order_${incrementTempTableSuffix}" export DDP_SQOOP_IMPORT_ARGS_TOOL="--as-textfile --null-string \"\" --fetch-size 1000 --hive-delims-replacement \"\"" export DDP_SQOOP_IMPORT_OPTS_COLUMN_DELIMITER='^' export DDP_SQOOP_IMPORT_OPTS_ROW_DELIMITER='\n' export DDP_SQOOP_IMPORT_OPTS_TRUNCATE_BEFORE_IMPORT="true" export DDP_SQOOP_IMPORT_OPTS_COUNT_ON_SUCCESS="false" ## -------------------------------------------------------------- ################################################################# ### 请确保此行为本脚本参数配置的最后一行。供调用方做载入有效性校验。 ### export DDP_SQOOP_IMPORT_ARGUMENTS_LOAD_FLAG="$(date +'%s')" 帮我解释下
最新发布
11-13
<think> 我们被要求解释一段用于业务数据导入到HDFS文件配置的Bash脚本代码。由于用户没有提供具体的脚本代码,我将假设一个典型的场景:将本地业务数据文件导入到HDFS中,并解释其中常见的配置和步骤。 典型的Bash脚本可能包含以下步骤: 1. 设置环境变量(如Hadoop的安装目录) 2. 定义源数据路径(本地)和目标路径(HDFS) 3. 检查HDFS目标目录是否存在,不存在则创建 4. 执行数据导入(使用hadoop fs -put命令) 5. 检查导入是否成功 下面是一个示例脚本,并逐行解释: </think> 以下是对业务数据导入 HDFS 的 Bash 脚本配置代码的详细解释,包含关键配置项和技术实现原理: ### 业务数据导入 HDFS 脚本配置解析 ```bash #!/bin/bash # ========== HDFS 导入配置 ========== LOCAL_DATA_DIR="/opt/bizdata/raw" # 本地业务数据存储目录 [^1] HDFS_DATA_DIR="/user/hadoop/bizdata" # HDFS 目标存储目录 [^1] LOG_FILE="/var/log/hdfs_import.log" # 导入日志文件 DATE_SUFFIX=$(date +"%Y%m%d%H%M") # 时间戳格式:YYYYMMDDHHMM [^2] REPLICATION_FACTOR=3 # HDFS 副本数(默认3)[^1] ``` #### 1. 目录配置 | 配置项 | 说明 | 技术要点 | |--------|------|----------| | `LOCAL_DATA_DIR` | 本地业务数据源目录 | • 通常存放来自业务系统的原始数据<br>• 需确保 Hadoop 用户有读取权限 | | `HDFS_DATA_DIR` | HDFS 目标存储路径 | • 遵循 HDFS 目录规范:`/user/<username>/<dataset>`<br>• 导入前需用 `hdfs dfs -mkdir -p` 创建目录 | #### 2. 文件命名策略 ```bash # 生成带时间戳的目标文件名 TARGET_FILE="bizdata_${DATE_SUFFIX}.csv" ``` • **时间戳格式**:`%Y%m%d%H%M` 生成精确到分钟的时间标识 (如 `202307150845`) • **优势**:避免文件名冲突,支持按时间分区查询 [^2] #### 3. 数据导入核心命令 ```bash hdfs dfs -put -f \ -D dfs.replication=$REPLICATION_FACTOR \ # 设置副本数 $LOCAL_DATA_DIR/source_data.csv \ $HDFS_DATA_DIR/$TARGET_FILE ``` | 参数 | 作用 | 原理 | |------|------|------| | `-put -f` | 强制覆盖同名文件 | HDFS 写操作原子性保证 | | `-D dfs.replication` | 设置数据块副本数 | 通过 Hadoop 配置 API 动态覆盖默认值 [^1] | | `source_data.csv` | 待导入业务数据 | 建议使用列式存储格式(如 ORC/Parquet) | #### 4. 完整性校验 ```bash # 检查HDFS文件是否完整 LOCAL_SIZE=$(stat -c%s "$LOCAL_DATA_DIR/source_data.csv") HDFS_SIZE=$(hdfs dfs -du $HDFS_DATA_DIR/$TARGET_FILE | awk '{print $1}') if [ $LOCAL_SIZE -eq $HDFS_SIZE ]; then echo "[SUCCESS] 数据校验通过 ${DATE_SUFFIX}" | tee -a $LOG_FILE else echo "[ERROR] 文件大小不匹配 ${DATE_SUFFIX}" | tee -a $LOG_FILE exit 1 fi ``` • **校验机制**:对比本地与 HDFS 文件大小 • **关键技术**: - `stat -c%s` 获取本地文件字节数 - `hdfs dfs -du` 获取 HDFS 文件实际存储大小 - `awk` 提取数值字段 #### 5. 最佳实践配置 ```bash # 高级配置建议 COMPRESSION_TYPE="snappy" # 压缩格式(snappy/lz4) BLOCK_SIZE=134217728 # HDFS 块大小 128MB hdfs dfs -put \ -D dfs.block.size=$BLOCK_SIZE \ -D mapreduce.output.fileoutputformat.compress=true \ -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.${COMPRESSION_TYPE^^}Codec \ ... ``` | 配置项 | 优化效果 | 适用场景 | |--------|----------|----------| | 块大小 128MB | 减少 NameNode 元数据压力 | 大型数据文件 (>1GB) | | Snappy 压缩 | CPU/压缩率平衡 | 实时数据分析 | | LZ4 压缩 | 更高压缩速度 | 高速数据摄取 | ### 执行流程 ```mermaid graph TD A[业务系统生成数据] --> B[保存到本地目录] B --> C{Bash 脚本触发} C --> D[创建 HDFS 目录] D --> E[设置副本数/压缩等参数] E --> F[执行 hdfs dfs -put] F --> G[校验文件完整性] G --> H{校验通过?} H -->|是| I[记录成功日志] H -->|否| J[告警并终止流程] ``` ### 典型错误处理 ```bash # 捕获异常状态码 hdfs dfs -put ... if [ $? -ne 0 ]; then echo "[FATAL] HDFS写入失败!错误码:$?" | tee -a $LOG_FILE # 触发告警:邮件/SNMP/Slack send_alert "hdfs_import_failed_${DATE_SUFFIX}" fi ``` 常见错误码: - `255`:HDFS 服务不可用 - `1`:权限拒绝 - `2`:源文件不存在 通过以上配置,可实现高效可靠的业务数据导入 HDFS 流程,满足企业级数据管道的需求。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值