目录
1、组件环境
名称 | 版本 | 描述 | 下载地址 |
hadoop | 3.4.0 | 官方下载bin | |
hive | 3.1.3 | 下载源码编译 | |
mysql | 8.0.31 | ||
datax | 0.0.1-SNAPSHOT | http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz | |
datax-web | datax-web-2.1.2 | 下载源码进行编译 | github.com |
centos | centos7 | x86版本 | |
java | 1.8 |
2、安装datax
2.1、下载datax并解压
http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
tar -zxvf datax.tar.gz
解压到/cluster目录下
执行测试命令
./bin/datax.py job/job.json
出现报错
File "/cluster/datax/bin/./datax.py", line 114
print readerRef
^^^^^^^^^^^^^^^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?说明:系统中安装python3和python2,默认是python3,由于datax bin中的datax默认支持python2,
所以需要指定python版本为python2,后面会将这个三个文件进行备份替换,datax-web中doc下提供了默认的支持。
- Python (2.x) (支持Python3需要修改替换datax/bin下面的三个python文件,替换文件在doc/datax-web/datax-python3下) 必选,主要用于调度执行底层DataX的启动脚本,默认的方式是以Java子进程方式执行DataX,用户可以选择以Python方式来做自定义的改造
- 参考地址:datax-web/doc/datax-web/datax-web-deploy.md at master · WeiYe-Jing/datax-web (github.com)
python2 bin/datax.py job/job.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2024-10-13 21:04:41.543 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2024-10-13 21:04:41.553 [main] INFO Engine - the machine info =>osInfo: Oracle Corporation 1.8 25.40-b25
jvmInfo: Linux amd64 5.8.13-1.el7.elrepo.x86_64
cpu num: 8totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2024-10-13 21:04:41.575 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"streamreader",
"parameter":{
"column":[
{
"type":"string",
"value":"DataX"
},
{
"type":"long",
"value":19890604
},
{
"type":"date",
"value":"1989-06-04 00:00:00"
},
{
"type":"bool",
"value":true
},
{
"type":"bytes",
"value":"test"
}
],
"sliceRecordCount":100000
}
},
"writer":{
"name":"streamwriter",
"parameter":{
"encoding":"UTF-8",
"print":false
}
}
}
],
"setting":{
"errorLimit":{
"percentage":0.02,
"record":0
},
"speed":{
"byte":10485760
}
}
}2024-10-13 21:04:41.599 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2024-10-13 21:04:41.601 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2024-10-13 21:04:41.601 [main] INFO JobContainer - DataX jobContainer starts job.
2024-10-13 21:04:41.604 [main] INFO JobContainer - Set jobId = 0
2024-10-13 21:04:41.623 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2024-10-13 21:04:41.624 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work .
2024-10-13 21:04:41.624 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2024-10-13 21:04:41.624 [job-0] INFO JobContainer - jobContainer starts to do split ...
2024-10-13 21:04:41.625 [job-0] INFO JobContainer - Job set Max-Byte-Speed to 10485760 bytes.
2024-10-13 21:04:41.626 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [1] tasks.
2024-10-13 21:04:41.627 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [1] tasks.
2024-10-13 21:04:41.649 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2024-10-13 21:04:41.654 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2024-10-13 21:04:41.657 [job-0] INFO JobContainer - Running by standalone Mode.
2024-10-13 21:04:41.666 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2024-10-13 21:04:41.671 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2024-10-13 21:04:41.672 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2024-10-13 21:04:41.685 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2024-10-13 21:04:41.986 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[302]ms
2024-10-13 21:04:41.987 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2024-10-13 21:04:51.677 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.022s | All Task WaitReaderTime 0.040s | Percentage 100.00%
2024-10-13 21:04:51.677 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2024-10-13 21:04:51.678 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work.
2024-10-13 21:04:51.678 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work.
2024-10-13 21:04:51.678 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2024-10-13 21:04:51.680 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /cluster/datax/hook
2024-10-13 21:04:51.682 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s2024-10-13 21:04:51.682 [job-0] INFO JobContainer - PerfTrace not enable!
2024-10-13 21:04:51.683 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.022s | All Task WaitReaderTime 0.040s | Percentage 100.00%
2024-10-13 21:04:51.684 [job-0] INFO JobContainer -
任务启动时刻 : 2024-10-13 21:04:41
任务结束时刻 : 2024-10-13 21:04:51
任务总计耗时 : 10s
任务平均流量 : 253.91KB/s
记录写入速度 : 10000rec/s
读出记录总数 : 100000
读写失败总数 : 0
./bin/datax.py -r hdfsreader -w mysqlwriter
3、安装datax-web
3.0、下载datax-web的源码,进行编译
git@github.com:WeiYe-Jing/datax-web.git
mvn -U clean package assembly:assembly -Dmaven.test.skip=true
打包成功后的DataX包位于 {DataX_source_code_home}/target/datax/datax/ ,结构如下:
$ cd {DataX_source_code_home} $ ls ./target/datax/datax/ bin conf job lib log log_perf plugin script tmp
3.1、在MySQL中创建datax-web元数据
mysql -u root -p
password:******
CREATE DATABASE dataxweb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
use dataxweb;
3.2、安装data-web
3.2.1执行install.sh命令解压部署
在/cluster/datax-web-2.1.2/bin目录下
执行./install.sh
根据提示执行
执行完成后,会解压文件,并初始化数据库。
3.2.1、手动修改 datax-admin配置文件
/cluster/datax-web-2.1.2/modules/datax-admin/bin/env.properties
内容如下
# environment variables
JAVA_HOME=/java/jdk
WEB_LOG_PATH=/cluster/datax-web-2.1.2/modules/datax-admin/logs
WEB_CONF_PATH=/cluster/datax-web-2.1.2/modules/datax-admin/confDATA_PATH=/cluster/datax-web-2.1.2/modules/datax-admin/data
SERVER_PORT=6895PID_FILE_PATH=/cluster/datax-web-2.1.2/modules/datax-admin/dataxadmin.pid
# mail account
MAIL_USERNAME="example@qq.com"
MAIL_PASSWORD="*********************"
#debug
REMOTE_DEBUG_SWITCH=true
REMOTE_DEBUG_PORT=7223
3.2.2、手动修改 datax-executor配置文件
/cluster/datax-web-2.1.2/modules/datax-executor/bi