MySQL Applier For Hadoop: Real time data export from MySQL to HDFS

HadoopApplier是一款工具,它能够实现实时从MySQL数据库到Hadoop的二进制日志复制。该工具可以将MySQL的更改事件直接写入HDFS,从而为Hadoop生态中的应用提供实时数据。相较于批量导入工具如Apache Sqoop,HadoopApplier通过读取MySQL二进制日志并在发生更改时立即插入数据,避免了对操作系统的额外负担。

http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-1.html

MySQL replication enables data to be replicated from one MySQL database server (the master) to one or more MySQL database servers (the slaves). However, imagine the number of use cases being served if the slave (to which data is replicated) isn't restricted to be a MySQL server; but it can be any other database server or platform with replication events applied in real-time! 
 
This is what the new Hadoop Applier empowers you to do.
 
An example of such a slave could be a data warehouse system such asApache Hive, which uses HDFS as a data store. If you have a Hive metastore associated with HDFS(Hadoop Distributed File System), theHadoop Applier can populate Hive tables in real time. Data is exported from MySQL to text files in HDFS, and therefore, into Hive tables. It is as simple as running a 'CREATE TABLE' HiveQL on Hive, to define the table structure similar to that on MySQL (and yes, you can use any row and column delimiters you want); and then run Hadoop Applier to start real time data replication.

 

The motivation to develop the Hadoop Applier is that currently, there is no tool available to perform this real time transfer. Existing solutions to import data into HDFS include Apache Sqoop which is well proven and enables batch transfers , but as a result requires re-import from time to time, to keep the data updated. It reads the source MySQL database via a JDBC connector or a fastpath connector, and performs a bulk data transfer, which can create an overhead on your operational systems, making other queries slow. Consider a case where there are only a few changes of the database compared to the size of the data, Sqoop might take too long to load the data. 
 
On the other hand, Hadoop Applier reads from a binary log and inserts data in  real time , applying the events as they happen on the MySQL server; therefore other queries can continue to execute without effect on their speed. No bulk transfers required! Hadoop Applier takes only the changes and insert them, which is a lot faster. 
 

Hadoop Applier can thus be a solution when you need to rapidly acquire new data from MySQL for real-time processing within Hadoop.

 

Introducing The Applier: 
 
 
It is a method which replicates events from the MySQL binary log to provide real time integration of MySQL with Hadoop and related frameworks which work on top of HDFS. There are many use cases for the integration of unstructured data stored in Apache Hadoop and structured data from relational databases such as MySQL. 
 
 
 
Hadoop Applier provides real time connectivity between MySQL andHadoop/HDFS(Hadoop Distributed File System); which can be used for big data analytics: for purposes like sentiment analysis, marketing campaign analysis, customer churn modeling, fraud detection, risk modelling and many more. You can read more about the role of Hadoop Applier in Big data in the  blog  by Mat Keep. Many widely used systems, such as Apache Hive, use HDFS as a data store.
The diagram below represents the integration:


 

Replication via Hadoop Applier happens by reading binary log events , and writing them into a file in HDFS(Hadoop Distributed File System) as soon as they happen on MySQL master. “Events” describe database changes such as table creation operations or changes to table data.

 

As soon as an Insert query is fired on MySQL master, it is passed to the Hadoop Applier. This data is then written into a text file in HDFS. Once data is in HDFS files; other Hadoop ecosystem platforms and databases can consume this data for their own application. 
 

Hadoop Applier can be downloaded from http://labs.mysql.com/

 

Prerequisites:
These are the packages you require in order to run Hadoop Applier on your machine:
 
- Hadoop Applier package from http://labs.mysql.com
- Hadoop 1.0.4 ( that is what I used for the demo in the next post)
- Java version 6 or later (since hadoop is written in Java)
- libhdfs (it comes precompiled with Hadoop distros,
 ${HADOOP_HOME}/libhdfs/libhdfs.so)
- cmake 2.6 or greater 
- libmysqlclient 5.6
- gcc 4.6.3
- MySQL Server 5.6
-FindHDFS.cmake (cmake file to find libhdfs library while compiling. You can get a copy online)
-FindJNI.cmake (optional, check if you already have one: 
$locate FindJNI.cmake
)

 

To use the Hadoop Applier with Hive, you will also need to install Hive , which you can download here.

Please use the comments section of this blog to share your opinion on Hadoop Applier, and let us know more about your requirements.

 

当在MySQL中执行`start slave`命令出现“ERROR 1872 (HY000): Replica failed to initialize applier metadata structure from the repository”错误时,可尝试以下解决方法: - **重置replica**:在slave节点上执行`reset replica`命令,之后再尝试启动复制进程 [^2]。 ```sql reset replica; start replica; ``` - **检查数据库配置**:该错误可能是数据库配置问题,需检查`my.ini`配置。若未配置中继日志,可进行配置后重新启动数据库。以下是一个配置示例: ```ini [mysqld] # 设置3306端口 port=3306 # 设置mysql的安装目录 basedir=D:\MySQL\mysql-8.3.0-winx64 # 设置mysql数据库的数据的存放目录 datadir=D:\MySQL\mysql-8.3.0-winx64\data # 允许最大连接数 max_connections=200 # 允许连接失败的次数。这是为了防止有人从该主机试图攻击数据库系统 max_connect_errors=10 # 服务端使用的字符集默认为UTF8 character-set-server=utf8 # 创建新表时将使用的默认存储引擎 default-storage-engine=INNODB # 主库设置 #这个服务id每个数据库要设置不一样 server_id=101 log_bin=master-slave-bin ## relay_log配置中继日志 relay_log=edu-mysql-relay-bin ## 配置是否写入自身bin-log。如果使用从库作为master再向下同步,必须填写 log-slave-updates=1 # 同步的模式 binlog-format=row #同步的库 replicate-do-db=test #同步的表 replicate-wild-do-table=test.test [mysql] # 设置mysql客户端默认字符集 default-character-set=utf8 [client] # 设置mysql客户端连接服务端时默认使用的端口 port=3306 default-character-set=utf8 ``` 配置完成后,重启MySQL服务 [^4]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值