Bridging the Big Data Divide with Data Integration

Big data is one of the most significant industry disruptors in IT today. Even in its infancy, it has shown significant ROI and has almost universal relevance to a wide cross-section of the industry. Why? Big data turns traditional information architecture on its head, putting into question commonly accepted notions of where and how data should be aggregated, processed, analyzed, and stored. Enter Hadoop and NoSQL, the open source data-crunching platform. Although these technologies are hotter than an Internet IPO, you simply can’t ignore your current investments – those investments in SQL which drive everything from your data warehouse to your ERP, CRM, SCM, HCM and custom applications.

The Big Disruptor for Data Integration Tools

Big data forces us to rethink how we process volumes and varieties of data at higher speeds and at faster velocity.  But before we look at the impact in more detail, let’s first look at the current state of data integration. Data integration at its core solves the issues of bulk data movement, replication, synchronization, transformation, data quality, and data services. These capabilities serve as a key technology component for moving data between data warehousing, business analytics, master data management, enterprise applications, and custom applications. But now there’s more to be moved. Much, much more. In fact, it’s not just the scale, but it’s the complexity of new technologies.  Data integration tools need to evolve to support integration to and interoperation with these technology offerings like Hadoop Distributed File System (“HDFS”), MapReduce, and NoSQL. In addition to these technology offerings, there are some basic fundamental considerations that every data integration tool needs to address:

  • Unified Tooling for Enterprise Data
  • Integrated Big Data Platform
  • Real-time Business Analytics

Unified Tooling for Transforming Big Data and Enterprise Data

The advantage of big data comes into play when you have the ability to correlate big data with your existing enterprise data. There’s an implicit product requirement here in consolidating these various architecture principles into a single integration solution. The advantages of a single solution allow you to address not only the complexities of mapping, accessing, and loading big data but also correlating your enterprise data – and this correlation may require integrating across mixed application environments.

Let’s look at an example. Say you are a retail organization and want to improve customer engagement through creation of a sentiment analysis program.  This data is in the form of social media data, weblogs, and various unstructured data - big data. But there are also important structured elements, too. Sentiment analysis really needs to consider customer data in your CRM applications, e-commerce systems and web tracking systems. Being able to acquire and correlate both these kinds of data can give you a more complete view of your customer, what they purchased, and more importantly, what they are about to purchase.  This type of predictive analytics is just one example of the types of opportunities that exist when applying correlation to both structured and unstructured data sets.

Integrated Big Data Platform

Taking all the miscellaneous technologies around big data – which are new to many organizations - and making them each work with one another is challenging. Making them work together in a production-grade environment is even more daunting. Integrated software and hardware systems can help an organization radically simplify their big data architectures by integrating the necessary hardware and software components to provide fast and cost-efficient access, and mapping, to NoSQL and HDFS.

Combined hardware and software systems can be optimized for redundancy with mirrored disks, optimized for high availability with hot-swappable power, and optimized for scale by adding new racks with more memory and processing power. Take it one step further and you can use these same systems to build out more elastic capacity to meet the flexibility requirements big data demands. 

While big data does enable the ‘farming’ of many smaller transactions to smaller servers, it does still need to be combined and correlated back to the data from your enterprise-class applications. The connection between these two worlds needs to be architected as part of the system architecture. One of the challenges that many organizations have today stems from the fact that they are developing two separate system silos – one for big data and one for their standard enterprise data. Unfortunately, sending petabytes of big data back and forth over an Ethernet cable from one system to the other will result in huge hits to performance. The better approach is to utilize integrated system architectures that are ‘linked’ by optimized hardware connections, i.e., Infinband, which can send 40X the speed of standard Ethernet. With this model of an integrated big data platform, an organization can simplify their architecture, and streamline operations and management costs.

Real-Time Business Analytics

One of the key expectations for big data is to yield real-time analytics that improve business insights. What a business user may see on a business analytics dashboard is dependent on how the data is loaded, transformed, cleansed, and ultimately mastered into many different applications. But how timely the data is is a question that will still need to be asked whether it is big data or traditional enterprise data.

While big data on its own has no means of applying ‘traditional’ change data capture [since there are no log files in NoSQL or HDFS], it’s still an important requirement to implement real-time solutions in conjunction with big data. Otherwise the speed advantage to indexing realms of big data will be undone by sluggish ETL processing that it’s dependent on. Big data can be processed at high volume with high velocity. In fact, this is the entire strategy behind the invention of MapReduce - providing search responses of the highest quality in close to the blink of a human eye.  Combine this power with real-time solutions in replication, change data capture, synchronization, and the integration to business analytics tooling, and you have what amounts to the compelling advantages of real-time business analytics.

You had me at Hadoop

Big data has drawn a lot of ‘adoration’ from the industry, but take away the buzz and there's a simple message. For years, companies have been running their critical business infrastructure and building business insights based on transactional data stored in relational databases. Beyond that critical data, however, is a potential treasure trove of less structured data: weblogs, social media, email, sensors, and photographs [and more] that can be mined for useful information. Companies that are seeking ways to capitalize on the hidden potential of big data need to consider data integration technologies to help bridge the gap and correlate that data across the enterprise.  They need data integration solutions that are complete, are based on emerging technologies (as well as traditional ones), and are poised for success through unified tooling, engineered systems, and real time.