MySQL Applier For Hadoop: Real time data export from MySQL to HDFS

最新推荐文章于 2021-03-17 03:32:15 发布

原创最新推荐文章于 2021-03-17 03:32:15 发布 · 221 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#数据库 #大数据 #java

Hadoop 专栏收录该内容

34 篇文章

订阅专栏

HadoopApplier是一款工具，它能够实现实时从MySQL数据库到Hadoop的二进制日志复制。该工具可以将MySQL的更改事件直接写入HDFS，从而为Hadoop生态中的应用提供实时数据。相较于批量导入工具如Apache Sqoop，HadoopApplier通过读取MySQL二进制日志并在发生更改时立即插入数据，避免了对操作系统的额外负担。

http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-1.html

MySQL replication enables data to be replicated from one MySQL database server (the master) to one or more MySQL database servers (the slaves). However, imagine the number of use cases being served if the slave (to which data is replicated) isn't restricted to be a MySQL server; but it can be any other database server or platform with replication events applied in real-time!

This is what the new Hadoop Applier empowers you to do.

An example of such a slave could be a data warehouse system such asApache Hive, which uses HDFS as a data store. If you have a Hive metastore associated with HDFS(Hadoop Distributed File System), theHadoop Applier can populate Hive tables in real time. Data is exported from MySQL to text files in HDFS, and therefore, into Hive tables. It is as simple as running a 'CREATE TABLE' HiveQL on Hive, to define the table structure similar to that on MySQL (and yes, you can use any row and column delimiters you want); and then run Hadoop Applier to start real time data replication.

The motivation to develop the Hadoop Applier is that currently, there is no tool available to perform this real time transfer. Existing solutions to import data into HDFS include Apache Sqoop which is well proven and enables batch transfers , but as a result requires re-import from time to time, to keep the data updated. It reads the source MySQL database via a JDBC connector or a fastpath connector, and performs a bulk data transfer, which can create an overhead on your operational systems, making other queries slow. Consider a case where there are only a few changes of the database compared to the size of the data, Sqoop might take too long to load the data.

On the other hand, Hadoop Applier reads from a binary log and inserts data in real time , applying the events as they happen on the MySQL server; therefore other queries can continue to execute without effect on their speed. No bulk transfers required! Hadoop Applier takes only the changes and insert them, which is a lot faster.

Hadoop Applier can thus be a solution when you need to rapidly acquire new data from MySQL for real-time processing within Hadoop.

Introducing The Applier:

It is a method which replicates events from the MySQL binary log to provide real time integration of MySQL with Hadoop and related frameworks which work on top of HDFS. There are many use cases for the integration of unstructured data stored in Apache Hadoop and structured data from relational databases such as MySQL.

Hadoop Applier provides real time connectivity between MySQL andHadoop/HDFS(Hadoop Distributed File System); which can be used for big data analytics: for purposes like sentiment analysis, marketing campaign analysis, customer churn modeling, fraud detection, risk modelling and many more. You can read more about the role of Hadoop Applier in Big data in the blog by Mat Keep. Many widely used systems, such as Apache Hive, use HDFS as a data store.

The diagram below represents the integration:

Replication via Hadoop Applier happens by reading binary log events , and writing them into a file in HDFS(Hadoop Distributed File System) as soon as they happen on MySQL master. “Events” describe database changes such as table creation operations or changes to table data.

As soon as an Insert query is fired on MySQL master, it is passed to the Hadoop Applier. This data is then written into a text file in HDFS. Once data is in HDFS files; other Hadoop ecosystem platforms and databases can consume this data for their own application.

Hadoop Applier can be downloaded from http://labs.mysql.com/

Prerequisites:
These are the packages you require in order to run Hadoop Applier on your machine:

- Hadoop Applier package from http://labs.mysql.com

- Hadoop 1.0.4 ( that is what I used for the demo in the next post)

- Java version 6 or later (since hadoop is written in Java)

- libhdfs (it comes precompiled with Hadoop distros,
${HADOOP_HOME}/libhdfs/libhdfs.so)

- cmake 2.6 or greater

- libmysqlclient 5.6

- gcc 4.6.3

- MySQL Server 5.6

-FindHDFS.cmake (cmake file to find libhdfs library while compiling. You can get a copy online)
-FindJNI.cmake (optional, check if you already have one:
$locate FindJNI.cmake)

To use the Hadoop Applier with Hive, you will also need to install Hive , which you can download here.

Please use the comments section of this blog to share your opinion on Hadoop Applier, and let us know more about your requirements.