Real-Time Data Movement: The Key to Enabling Live Analytics With Hadoop

May 21, 2014

By Robert Hodges and MC Brown

Today, organizations are struggling to achieve real-time integration between MySQL/MariaDB and Hadoop, and Oracle and Hadoop. Fast decision-making depends on real-time data movement that allows businesses to gather data from multiple locations into Hadoop as well as conventional data warehouses.

Unfortunately, traditional ETL tools use slow data-scraping techniques that put a heavy load on operational systems and cannot meet the low latency required by many businesses. The answer is real-time replication that reads the DBMS log and forwards transactions automatically as soon as they commit. Rapid loading of data into Hadoop enables effective analytics and is becoming increasingly crucial as Hadoop itself raises the bar for real-time query.

However, the fact remains that moving data between databases is a significant challenge. The problem with many of the current techniques is that they rely on a relatively manual, even basic, transfer process. Dump your data out, reload it back again into Hadoop.

The approach taken by many tools is to automate what is a tiresome process, but the reality is that at the core these tools still rely on basic dump and load techniques. Serial loading or the incremental transfer of data from MySQL or Oracle with traditional tools is also prone to problems as it requires the data structure and application to be changed by adding a timestamp or other sequence identifier in order to determine what data has changed between each load process.

Incremental loading works by copying all the changed records from a specific point in time. Running it too frequently is counterproductive, as the 15 minutes to perhaps two hours required to perform the loading introduces further delays in the immediacy of the data. The exact duration depends on actual database activity, but the loading duration can easily exceed the period of the changed data being transferred.

What's Needed - A Heterogeneous Interface for Seamless Transfer

Most databases usually have some kind of stream of changes that identifies all the data changes that have happened on the database. In terms of MySQL, that’s the binary log. What's missing is the ability to leverage that log and replicate into MySQL, Oracle, MongoDB, MariaDB and Vertica, among others. These tools are available; indeed, there is a rich ecosystem built around the MySQL binary log, but the heterogeneous interface would allow seamless transfer of information between RDBMS and analytics environments.

Then there is the issue of being able to read the data itself. Reading the data out from MySQL is lightweight; unlike the dump and load technique, which uses read locks and SELECT to identify and extract data with and without the use of LIMIT clauses. Instead, the master just reads the contents of the binary log. What is really needed is the ability to write that data into Hadoop in real-time from MySQL in a continuous fashion, without database or application changes, and without the latency implied by regular bulk dump and load software.

Dealing with Unique Hadoop Issues

However, the loading and nature of Hadoop also presents some interesting issues. For example, data written into Hadoop is immutable, but for data that is constantly changing, an immutable store is not the obvious destination.

The way to address this challenge is to use a batch loading approach that creates change data in an easily readable format such as CSV. These files contain the data, changes and sequence numbers, and then load that information into Hadoop. By using a straightforward query within Hive, record of changes can be used to create a live version of the data. In fact, Hive provides the final crucial stage of the loading process by providing that live view of the change data, and then simplifying that element by providing the core data. It also ensures that the CSV data is in the right format for Hive to use without requiring any changes.

The process is quite remarkable in terms of speed for direct dumps, but when it comes to change data, the difference is that the information is available in real-time. Developers shouldn’t have to wait for the next bulk transfer and load. Data could be moved seamlessly from MySQL to Hadoop in real time. The key is having the ability to just apply the changes written into the binary log. That doesn’t mean that bulk-loading software cannot be part of the solution. Bulk transfer is still an effective way of provisioning existing information, and then replication of the change information can be built on top of this bulk load to provide real-time data.

Of course, one could fine-tune the intervals of the writes of the CSV change data into Hadoop using the block commit properties. This means you can commit into Hadoop every 10s or 1,000 rows, or change it to commit every 5s and 10,000 rows if your data is critical and busy.

Enabling Hadoop as a Live Source of Analytical Data

Using this approach, organizations can handle GBs of change data and information in a live fashion, both across single table and multi-table/multi-schema datasets. What’s particularly valuable is that if you are using Hadoop as a concentrator for all of your MySQL data so that you can analyze it, you will be able to transfer from multiple MySQL servers into Hadoop in parallel and take advantage of the multi-node Hadoop environment to cope with the incoming data load.

However, existing techniques for transferring data between MySQL and Oracle into Hadoop are either expensive to execute on the master, or provide only periodic versions of the information within target cluster. In order to enable both live and point-in-time views within Hadoop, developers need a live stream of all the changes from the source database. Then, they can perform detailed data analysis on this data using live copies of the RDBMS data. As a result, this enables Hadoop to be used as a live source of analytical data, rather than a passive, temporal store.

About the Authors

Robert Hodges is CEO and MC Brown is director of documentation at Continuent.