DataX : 一个异构数据源离线同步框架,通过插件体系完成数据同步过程。reader插件用于读入,writer插件用于写出,中间的framework可以定义transform插件完成数据转化的需要。
Sqoop 只支持关系型数据库与HDFS/Hive 之间的数据同步, DataX 则更为丰富。
目前支持的数据源有:https://github.com/alibaba/DataX/wiki/DataX-all-data-channels
使用:
$ tar zxvf datax.tar.gz
$ sudo chmod -R 755 {YOUR_DATAX_HOME}
$ cd {YOUR_DATAX_HOME}/bin
$ python datax.py ../job/job.json
json配置例子(Mongo > HDFS/Hive):
mongotest.json
{
"job": {
"setting": {
"speed": {
"channel": "2"
}
},
"content": [{
"reader": {
"name": "mongodbreader",
"parameter": {
"address": [""],
"userName": "",
"userPassword": "",
"dbName": "",
"collectionName": "",
"column": [{ "name": "cityid", "type": "string" }, { "name": "searchstr", "type": "string" }, { "name": "pv", "type": "string" } ] }
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [{ "name": "cityid", "type": "string" }, { "name": "searchstr", "type": "int" }, { "name": "pv", "type": "int" } ],
"defaultFS": "hdfs://*",
"fieldDelimiter": "\t",
"fileName": "mongotest",
"fileType": "text",
"path": "/user/hive/warehouse/temp.db/mongotest",
"writeMode": "append" }
}
}
]
}
}
同步过程:
- create Hive table temp.mongotest
- python {DATAX_HOME}/bin/datax.py ../mongotest.json