尚硅谷大数据项目【电商数仓6.0】-Datax-6

安装

tar -zxvf datax.tar.gz -C /opt/module/

测试

python /opt/module/datax/bin/datax.py /opt/module/datax/job/job.json

MySQLToHDFS

根据官方文档写配置json

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "column": [
                            "id",
                            "name",
                            "region_id",
                            "area_code",
                            "iso_code",
                            "iso_3166_2",
							"create_time",
							"operate_time"
                        ],
                        "where": "id>=3",
                        "connection": [
                            {
                                "jdbcUrl": [
                                    "jdbc:mysql://hadoop102:3306/gmall?useUnicode=true&allowPublicKeyRetrieval=true&characterEncoding=utf-8"
                                ],
                                "table": [
                                    "base_province"
                                ]
                            }
                        ],
                        "password": "000000",
                        "splitPk": "",
                        "username": "root"
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "column": [
                            {
                                "name": "id",
                                "type": "bigint"
                            },
                            {
                                "name": "name",
                                "type": "string"
                            },
                            {
                                "name": "region_id",
                                "type": "string"
                            },
                            {
                                "name": "area_code",
                                "type": "string"
                            },
                            {
                                "name": "iso_code",
                                "type": "string"
                            },
                            {
                                "name": "iso_3166_2",
                                "type": "string"
                            },
                            {
                                "name": "create_time",
                                "type": "string"
                            },
                            {
                                "name": "operate_time",
                                "type": "string"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://hadoop102:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "base_province",
                        "fileType": "text",
                        "path": "/base_province",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 1
            }
        }
    }
}

确保路径存在

hadoop fs -mkdir /base_province
cd /opt/module/datax
python bin/datax.py job/base_province.json 
hadoop fs -mkdir /base_province

使用SQL

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "connection": [
                            {
                                "jdbcUrl": [
                                    "jdbc:mysql://hadoop102:3306/gmall?useUnicode=true&allowPublicKeyRetrieval=true&characterEncoding=utf-8"
                                ],
                                "querySql": [
                                    "select id,name,region_id,area_code,iso_code,iso_3166_2,create_time,operate_time from base_province where id>=3"
                                ]
                            }
                        ],
                        "password": "000000",
                        "username": "root"
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "column": [
                            {
                                "name": "id",
                                "type": "bigint"
                            },
                            {
                                "name": "name",
                                "type": "string"
                            },
                            {
                                "name": "region_id",
                                "type": "string"
                            },
                            {
                                "name": "area_code",
                                "type": "string"
                            },
                            {
                                "name": "iso_code",
                                "type": "string"
                            },
                            {
                                "name": "iso_3166_2",
                                "type": "string"
                            },
                            {
                                "name": "create_time",
                                "type": "string"
                            },
                            {
                                "name": "operate_time",
                                "type": "string"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://hadoop102:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "base_province",
                        "fileType": "text",
                        "path": "/base_province",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 1
            }
        }
    }
}

传时间参数

python bin/datax.py -p"-Ddt=2022-06-08" job/base_province.json
{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "connection": [
                            {
                                "jdbcUrl": [
                                    "jdbc:mysql://hadoop102:3306/gmall?useUnicode=true&allowPublicKeyRetrieval=true&characterEncoding=utf-8"
                                ],
                                "querySql": [
                                    "select id,name,region_id,area_code,iso_code,iso_3166_2,create_time,operate_time from base_province where id>=3"
                                ]
                            }
                        ],
                        "password": "000000",
                        "username": "root"
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "column": [
                            {
                                "name": "id",
                                "type": "bigint"
                            },
                            {
                                "name": "name",
                                "type": "string"
                            },
                            {
                                "name": "region_id",
                                "type": "string"
                            },
                            {
                                "name": "area_code",
                                "type": "string"
                            },
                            {
                                "name": "iso_code",
                                "type": "string"
                            },
                            {
                                "name": "iso_3166_2",
                                "type": "string"
                            },
                            {
                                "name": "create_time",
                                "type": "string"
                            },
                            {
                                "name": "operate_time",
                                "type": "string"
                            }
                        ],
                        "compress": "gzip",
                        "defaultFS": "hdfs://hadoop102:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "base_province",
                        "fileType": "text",
                        "path": "/base_province/${dt}",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 1
            }
        }
    }
}

HDFSToMySQL

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "defaultFS": "hdfs://hadoop102:8020",
                        "path": "/base_province",
                        "column": [
                            "*"
                        ],
                        "fileType": "text",
                        "compress": "gzip",
                        "encoding": "UTF-8",
                        "nullFormat": "\\N",
                        "fieldDelimiter": "\t"
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "username": "root",
                        "password": "000000",
                        "connection": [
                            {
                                "table": [
                                    "test_province"
                                ],
                                "jdbcUrl": "jdbc:mysql://hadoop102:3306/gmall?useUnicode=true&allowPublicKeyRetrieval=true&characterEncoding=utf-8"
                            }
                        ],
                        "column": [
                            "id",
                            "name",
                            "region_id",
                            "area_code",
                            "iso_code",
                            "iso_3166_2",
                            "create_time",
                            "operate_time"
                        ],
                        "writeMode": "replace"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 1
            }
        }
    }
}

在MySQL中创建gmall.test_province表

DROP TABLE
IF
	EXISTS `test_province`;
CREATE TABLE `test_province` (
	`id` BIGINT ( 20 ) NOT NULL,
	`name` VARCHAR ( 20 ) CHARACTER 
	SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
	`region_id` VARCHAR ( 20 ) CHARACTER 
	SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
	`area_code` VARCHAR ( 20 ) CHARACTER 
	SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
	`iso_code` VARCHAR ( 20 ) CHARACTER 
	SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
	`iso_3166_2` VARCHAR ( 20 ) CHARACTER 
	SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
	`create_time` VARCHAR ( 20 ) CHARACTER 
	SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
	`operate_time` VARCHAR ( 20 ) CHARACTER 
	SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
	PRIMARY KEY ( `id` ) 
) ENGINE = INNODB CHARACTER 
SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
python bin/datax.py job/test_province.json 

DataX配置生成器

类似代码生成器,生成DataX对应的JSON

硅谷商数6.0项目中,用户数据采集的实现方法主要涉及日志数据和业务数据库数据的采集流程。以下是具体的实现步骤和技术要点: ### 数据采集概述 数据采集是构建数据库的第一步,其核心目标是从各种数据源(如日志文件、数据库等)提取数据,并将其传输到HDFS或Hive等存储系统中,以便后续进行数据清洗和分析。硅谷商数6.0项目采用了Flume、Kafka等工具完成数据的实时和离线采集。 ### 日志数据采集 对于日志数据的采集,项目使用了**Flume**作为主要的数据采集工具。Flume是一种分布式的、可靠的日志采集框架,能够从多个数据源收集日志并传输到HDFS中。具体实现如下: - **Flume Agent配置**:Flume Agent负责监听日志文件目录,并将新增的日志文件传输到指定的目标位置。例如,在`kafka_to_hdfs_db.conf`中,配置了Flume Source监听本地日志目录,Sink指向HDFS,而Channel则作为缓冲区。 - **拦截器(Interceptor)**:为了增强数据的处理能力,Flume中引入了拦截器。例如,通过自定义的`TimestampAndTableNameInterceptor`,可以动态地为每条日志添加时间戳字段,并根据日志内容决定其归属的表名[^2]。 - **日志文件装载**:日志文件被Flume采集后,会以文本格式存储在HDFS中。随后,使用Hive的`LOAD DATA INPATH`命令将原始日志数据加载到Hive表中。例如: ```sql LOAD DATA INPATH '/origin_data/gmall/log/topic_log/2022-06-08' INTO TABLE ods_log_inc PARTITION(dt='2022-06-08'); ``` 这一步骤将日志文件直接移动到Hive表对应的HDFS路径下,并按日期分区存储[^1]。 ### 业务数据库数据采集 除了日志数据,还需要从MySQL等关系型数据库中采集业务数据。这一部分通常采用**Canal**或**DataX**等工具实现: - **Canal**:Canal是阿里巴巴开源的一个MySQL数据库增量日志解析工具,它能够订阅MySQL的binlog日志,并将数据变更实时同步到Kafka或HBase中。这种方式适用于需要实时更新数据的场景。 - **DataX**:对于离线数据采集需求,可以使用DataX工具进行批量导入导出。DataX支持多种数据源之间的数据迁移,例如从MySQL导入到Hive中。 ### 数据采集脚本与调度 在整个数据采集过程中,自动化脚本和任务调度非常重要: - **Shell脚本**:编写Shell脚本用于启动Flume Agent、监控日志目录变化,并定期执行数据装载操作。 - **定时任务**:使用Linux的`crontab`或Apache Airflow等工具进行任务调度,确保每日的数据采集和装载任务能够按时执行。 ### 学习资源 如果希望深入了解硅谷商数6.0项目中的数据采集实现,建议参考以下学习资源: - **视频教程**:硅谷官方发布的商数6.0课程视频,详细讲解了Flume、Kafka、Canal等工具的使用方法以及实际项目中的集成方案。 - **代码示例**:可以通过提供的`CaijiInterceprot.zip`包获取Flume拦截器的源码,并尝试自行编译和打包,进一步理解拦截器的工作原理。 - **文档资料**:查阅项目配套的PDF文档和PPT课件,了解数据采集的整体架构设计及关键配置细节。 ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值