That comes back to the issue of being able to identify and concentrate resources on the data that really matters—which may be but a small fraction of a fraction of all the data pouring through the enterprise. “Not all data is important, but you must be able to discern between what is important and what is unimportant,” says Fisher. “Big data is not about collecting and storing all the data you can find, but about understanding the context of data.”
It’s not that complexities and difficulties with big data integration arise from a lack of trying. Kopp-Hensley cites a recent IBM Institute for Business Value study which found that 57% of responding business leaders said they already have enterprise-level standards, policies and practices in place to integrate data across the organization. “However, this isn’t to say that all integration is being done correctly or efficiently,” she adds. “Businesses must create a plan and understand the vast scale of their data.” Four out of 10 executives admit they face an inability to create, integrate, and manage their data, she relates.
There is Nothing New About Data Integration Challenges
Big data in its current form—as vast stores of machine-generated and user-?generated data—is relatively new for enterprises. However, there’s nothing new about data integration itself, as it has been practiced for decades, remarks Jim Gallo, national director of business analytics for ICC. “Architectural strategies for data integration have been around since the advent of the computer,” he says. “Unfortunately, it’s rare to find organizations that have applied them universally and consistently across the enterprise.”
That’s because data integration is not a quick-and-easy process that can be handled with a few tools, Gallo continues. “Data integration is still fairly labor-intensive, meaning that successes will be measured in years and decades rather than in months or quarters,” he says. “There are no shortcuts—and investments have to be made in data integration technologies.”
Enterprises have been investing time, money, and resources for years in efforts to achieve greater data integration across multiple systems, departments, and channels. For example, organizations have well-entrenched investments in data warehouses and extract, transform, and load (ETL) technologies. The good news is that these will remain viable approaches for years to come for managing the big data explosion.
However, industry observers are divided on how neatly existing and planned data warehouse environments will mesh with emerging big-data-centric platforms, such as Hadoop or NoSQL. Lawrence Schwartz, vice president of marketing for Attunity, for one, says there is a mismatch between what enterprises currently rely on and what they face coming in. “Solutions for data integration such as traditional ETL are no longer keeping pace with the variety of sources and platforms that are being rapidly deployed on premises and in the cloud,” he explains. “While a lot of companies may already have tools in-house to help, these tools are often outdated. Furthermore, they aren’t versatile enough to keep up with the demands of real-time data integration.” The shortcomings of these older solutions also become apparent when attempting to integrate data into the cloud, Schwartz adds.
A New Generation of Databases and Platforms Offer Scalability and Velocity
The new generation of databases and platforms offer the scalability and velocity required for cloud-based, big data-based analytics applications, experts argue. “A key transition in technology for big data integration is from relational database servers to scaled out, NoSQL stores which can support the bandwidth requirements of big data analytics,” says William Bain, CEO and founder of ScaleOut Software.
“NoSQL stores, such as the Hadoop Distributed File System, MongoDB, and Cassandra enable petabyte and larger datasets to be streamed into clustered analytics engines, such as Hadoop MapReduce, without bottlenecks in disk I/O,” he explains. “Big data analytics platforms also provide highly scalable SQL query for very large datasets using new implementations designed for this architecture; examples include Hive, Impala, HANA, and Shark.” Bain says he sees many of these new-generation solutions supplementing traditional relational database servers for querying very large datasets in non-transactional systems.
The Advantages of Relational Databases and Data Warehouses
Traditional relational databases and data warehouses also have their share of proponents. As Jonas Olsson, CEO of Graz, a data warehouse software company, observes, “Big data and Hadoop are certainly getting the vast majority of the attention, but relational databases are hands-down still the largest and fastest growing piece of the enterprise data pie. We expect this growth to continue—companies are having such big problems with integrating the internal structured data that they have not yet started to look at unstructured data in any meaningful way.”