伪分布式系列 - 第四篇 - datax环境搭建,hive导入mysql测试

简介

  • DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、SQL Server、Oracle、PostgreSQL、HDFS、Hive、HBase、OTS、ODPS 等各种异构数据源之间高效的数据同步功能。
  • DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。

基础环境搭建

要求

  • Linux
  • JDK(1.8以上,推荐1.8)
  • Python(推荐Python2.6.X) linux centos7 默认2.7
  • Apache Maven 3.x (Compile DataX) 如果自己编译

mysql安装

前面hadoop安装文章已经提到过,可以往前参考

  • 默认情况下, Hive的元数据保存在了内嵌的 derby 数据库里, 生产环境使用 MySQL 来存放 Hive 元数据,靠谱.
  • 参考:https://www.cnblogs.com/luohanguo/p/9045391.html
  • mysql登录:mysql -uroot -p
  • mysql开通权限
    • 指定某台机器权限:grant all privileges on . to root@192.168.1.42 identified by ‘root’ with grant option;
    • 指定所有机器权限:grant all privileges on . to root@"%" identified by ‘root’ with grant option;

jdk安装

前面hadoop安装文章已经提到过,可以往前参考

安装方式自行百度

配置java_home,vim /etc/profile
export JAVA_HOME=/usr/local/java_1.8.0_121
export JAVA_BIN= J A V A H O M E / b i n e x p o r t J A V A L I B = JAVA_HOME/bin export JAVA_LIB= JAVAHOME/binexportJAVALIB=JAVA_HOME/lib
export CLASSPATH=.: J A V A L I B / t o o l s . j a r : JAVA_LIB/tools.jar: JAVALIB/tools.jar:JAVA_LIB/dt.jar
export PATH= J A V A B I N : JAVA_BIN: JAVABIN:PATH

验证:java -version
在这里插入图片描述
安装jdk1.8+的版本

datax安装配置

官方文档:https://github.com/alibaba/DataX

下载

配置

  • 下载完成直接解压到目录即可
tar -zxvf datax.tar.gz -C ./
  • 环境变量
# datax 配置
export DATAX_HOME=/opt/bigdata/datax/default
export PATH=${DATAX_HOME}/bin:$PATH

基础测试

  • 从stream读取数据并打印到控制台
配置如下:
vim stream2stream.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello,你好,世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
       }
    }
  }
}

  • 运行
[root@ecs-6531-0002 conf]# datax.py stream2stream.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2019-07-27 20:58:40.474 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2019-07-27 20:58:40.481 [main] INFO  Engine - the machine info  => 

	osInfo:	Oracle Corporation 1.8 25.121-b13
	jvmInfo:	Linux amd64 3.10.0-862.9.1.el7.x86_64
	cpu num:	16

	totalPhysicalMemory:	-0.00G
	freePhysicalMemory:	-0.00G
	maxFileDescriptorCount:	-1
	currentOpenFileDescriptorCount:	-1

	GC Names	[PS MarkSweep, PS Scavenge]

	MEMORY_NAME                    | allocation_size                | init_size                      
	PS Eden Space                  | 256.00MB                       | 256.00MB                       
	Code Cache                     | 240.00MB                       | 2.44MB                         
	Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
	PS Survivor Space              | 42.50MB                        | 42.50MB                        
	PS Old Gen                     | 683.00MB                       | 683.00MB                       
	Metaspace                      | -0.00MB                        | 0.00MB                         


2019-07-27 20:58:40.497 [main] INFO  Engine - 
{
	"content":[
		{
			"reader":{
				"name":"streamreader",
				"parameter":{
					"column":[
						{
							"type":"long",
							"value":"10"
						},
						{
							"type":"string",
							"value":"hello,你好,世界-DataX"
						}
					],
					"sliceRecordCount":10
				}
			},
			"writer":{
				"name":"streamwriter",
				"parameter":{
					"encoding":"UTF-8",
					"print":true
				}
			}
		}
	],
	"setting":{
		"speed":{
			"channel":5
		}
	}
}

2019-07-27 20:58:40.512 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2019-07-27 20:58:40.514 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2019-07-27 20:58:40.514 [main] INFO  JobContainer - DataX jobContainer starts job.
2019-07-27 20:58:40.515 [main] INFO  JobContainer - Set jobId = 0
2019-07-27 20:58:40.531 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2019-07-27 20:58:40.532 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do prepare work .
2019-07-27 20:58:40.532 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2019-07-27 20:58:40.532 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2019-07-27 20:58:40.532 [job-0] INFO  JobContainer - Job set Channel-Number to 5 channels.
2019-07-27 20:58:40.533 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks.
2019-07-27 20:58:40.533 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks.
2019-07-27 20:58:40.558 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2019-07-27 20:58:40.576 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2019-07-27 20:58:40.584 [job-0] INFO  JobContainer - Running by standalone Mode.
2019-07-27 20:58:40.606 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks.
2019-07-27 20:58:40.616 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2019-07-27 20:58:40.617 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2019-07-27 20:58:40.629 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started
2019-07-27 20:58:40.632 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started
2019-07-27 20:58:40.635 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
2019-07-27 20:58:40.651 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
2019-07-27 20:58:40.656 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
10	hello,你好,世界-DataX
2019-07-27 20:58:40.760 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[110]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[105]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[133]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[126]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[128]ms
2019-07-27 20:58:40.761 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2019-07-27 20:58:50.621 [job-0] INFO  StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.003s | Percentage 100.00%
2019-07-27 20:58:50.621 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2019-07-27 20:58:50.622 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do post work.
2019-07-27 20:58:50.622 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do post work.
2019-07-27 20:58:50.622 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2019-07-27 20:58:50.623 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/bigdata/datax/default/hook
2019-07-27 20:58:50.625 [job-0] INFO  JobContainer - 
	 [total cpu info] => 
		averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
		-1.00%                         | -1.00%                         | -1.00%
                        

	 [total gc info] => 
		 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
		 PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
		 PS Scavenge          | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             

2019-07-27 20:58:50.628 [job-0] INFO  JobContainer - PerfTrace not enable!
2019-07-27 20:58:50.629 [job-0] INFO  StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.003s | Percentage 100.00%
2019-07-27 20:58:50.634 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2019-07-27 20:58:40
任务结束时刻                    : 2019-07-27 20:58:50
任务总计耗时                    :                 10s
任务平均流量                    :               95B/s
记录写入速度                    :              5rec/s
读出记录总数                    :                  50
读写失败总数                    :                   0

地下这里是一些基本的统计信息
  • ok 到这里datax就算是简单的可用状态

datax实现hive或者spark table到mysql导入

hive表的创建要求

  • hive to mysql 其实就是从hdfs直接读取数据,然后传入mysql,因为hive表默认是使用特殊的分割符,我们直接使用datax导入的时候回报错的
CREATE TABLE IF NOT EXISTS `${hivevar:target_table}`
(
datekey                                 string  comment '日期'
,project_id                             string  comment ''
,building_id                            string  comment ''
,unit_id                                string  comment ''
,building_name                          string  comment ''
... 此处省略n多行
,note                                   string  comment ''
) comment ''
partitioned by (dt string)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim'='\t','serialization.null.format'='')  -- 此处是重点,需要配置字段分隔符为tab键和null数据默认值为null,null这个很关键,要不然导入到mysql会报转换异常
STORED AS TEXTFILE; -- 选择textfile存储格式,其他格式datax也支持,可以自己配置
  • hive to mysql配置 run.json
{
    "job": {
        "setting": {
            "speed": {
                "channel": 3
            }
        },
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "path": "/user/root/warehouse/real_estate.db/dim_building/dt=20190720/*", -- hdfs目录
                        "defaultFS": "hdfs://ecs-6531-0002.novalocal:9000", --name node地址
                        "column": [ -- 字段索引类型说明 和hive中保持一致
                               {
                                "index": 0,
                                "type": "string"
                               },
                               {
                                "index": 1,
                                "type": "string"
                               },
                               {
                                "index": 2,
                                "type": "string"
                               },
                               {
                                "index": 3,
                                "type": "string"
                               },
                               {
                                "index": 4,
                                "type": "string"
                               },
                               {
                                "index": 5,
                                "type": "string"
                               },
                               {
                                "index": 6,
                                "type": "string"
                               },
                               {
                                "index": 7,
                                "type": "string"
                               },
                               {
                                "index": 8,
                                "type": "string"
                               },
                               {
                                "index": 9,
                                "type": "string"
                               },
                               {
                                "index": 10,
                                "type": "string"
                               },
                               {
                                "index": 11,
                                "type": "string"
                               },
                               {
                                "index": 12,
                                "type": "string"
                               },
                               {
                                "index": 13,
                                "type": "string"
                               },
                               {
                                "index": 14,
                                "type": "string"
                               },
                               {
                                "index": 15,
                                "type": "string"
                               },
                               {
                                "index": 16,
                                "type": "string"
                               },
                               {
                                "index": 17,
                                "type": "string"
                               },
                               {
                                "index": 18,
                                "type": "string"
                               },
                               {
                                "index": 19,
                                "type": "string"
                               },
                               {
                                "index": 20,
                                "type": "string"
                               },
                               {
                                "index": 21,
                                "type": "string"
                               },
                               {
                                "index": 22,
                                "type": "string"
                               },
                               {
                                "index": 23,
                                "type": "string"
                               },
                               {
                                "index": 24,
                                "type": "string"
                               },
                               {
                                "index": 25,
                                "type": "string"
                               },
                               {
                                "index": 26,
                                "type": "string"
                               },
                               {
                                "index": 27,
                                "type": "string"
                               },
                               {
                                "index": 28,
                                "type": "string"
                               },
                               {
                                "index": 29,
                                "type": "string"
                               },
                               {
                                "index": 30,
                                "type": "string"
                               },
                               {
                                "index": 31,
                                "type": "string"
                               },
                               {
                                "index": 32,
                                "type": "string"
                               },
                               {
                                "index": 33,
                                "type": "string"
                               },
                               {
                                "index": 34,
                                "type": "string"
                               },
                               {
                                "index": 35,
                                "type": "string"
                               },
                               {
                                "index": 36,
                                "type": "string"
                               },
                               {
                                "index": 37,
                                "type": "string"
                               }
                        ],
                        "fileType": "text",
                        "encoding": "UTF-8",
                        "fieldDelimiter": "\t" -- 字段分隔符
                    }

                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "writeMode": "insert",
                        "username": "root",
                        "password": "xxx",
                        "column": [ -- mysql字段列表
			    "dt"
                            ,"project_id"
                            ,"building_id"
                            ,"unit_id"
                            ,"building_name"
                            ,"building_unit"
                            ,"unit_name"
                            ,"developer"
                            ,"design_unit"
                            ,"construction_unit"
                            ,"engineering_supervision_unit"
                            ,"project_cost"
                            ,"start_date"
                            ,"end_date"
                            ,"delivery_date"
                            ,"building_structure_type"
                            ,"building_height"
                            ,"total_floor_cnt"
                            ,"onground_floor_cnt"
                            ,"underground_floor_cnt"
                            ,"building_floor_area"
                            ,"building_construction_area"
                            ,"building_onground_area"
                            ,"building_underground_area"
                            ,"usage_area"
                            ,"green_area"
                            ,"floor_area"
                            ,"corridor_area"
                            ,"exterior_wall_area"
                            ,"top_area"
                            ,"base_area"
                            ,"basis_sharing_coefficient"
                            ,"common_area"
                            ,"common_area_sharing_coefficient"
                            ,"elevator_cnt"
                            ,"house_cnt"
                            ,"parking_cnt"
                            ,"note"
                        ],
                        "session": [
                            "set session sql_mode='ANSI'"
                        ],
                        "preSql": [
                            "delete from dim_building where dt='20190720'"
                        ],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://localhost:13306/real_estate?useUnicode=true&characterEncoding=utf-8", -- mysql地址
                                "table": [ -- 表名
                                    "dim_building"
                                ]
                            }
                        ]
                    }
                }
            }
        ]
    }
}
  • 执行导入
datax.py run.json
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值