Today, data is critical to every organization and every department within every organization. Yet, all the disparate systems for handling it are creating new challenges. Joe Caserta, founder and president of Caserta, a technology consulting and implementation firm focused on data and analytics strategies and solutions, recently discussed the current state of data integration and what is needed to overcome today's data integration problems.
DBTA: Data integration has been a vexing problem for analytics for a long time. Has it gotten any easier with the advent of big data and more technology choices, or is it actually more difficult?
Joe Caserta: It's more challenging. It's much more challenging.
DBTA: How so?
JC: The big data paradigm solved several problems and it created several. Before "big data," most data came from internal business applications and the data was "known." And, even then, there were data issues and transformations that were needed in order to analyze the data holistically—but it wasn't splitting atoms. I wrote a book 15 years ago on best practices called the Data Warehouse ETL Toolkit, and, for the most part, the problem was solved. What's happening today is that, with most of the data, companies have no control. It's coming from outside sources—either they are buying it, or it's on the cloud from different applications that an organization can't control, and integrating those different data points is incredibly complicated.
DBTA: What else?
JC: Since data has become the backbone of every organization—and every department has its own system—it's not uncommon for the average organization, and not even a super-huge organization, to have several hundred different data sources spanning marketing, finance, products, inventory, and HR. Each department has several systems that they use to run their business. So now, when you want to take all of those systems and integrate them, it's difficult.
DBTA: Companies also merge and acquire other organizations, which may have completely different systems, plus internal users may have their own systems that are not even known to the company, creating what has been called "shadow IT."
JC: Yes. One thing about the cloud is that it makes the speed to market of the new analytics very fast. But the challenge is that IT has been so bogged down with process and governance, and more and more enterprise methods of doing things. Speed is important because it's so competitive out there and it is so critical to be analytics-driven, and the result is that departments are just spinning up their own environments on the cloud. So now there are more datasets sitting out there ungoverned.
'Since data has become the backbone of every organization—and every department has its own system—it's not uncommon for the average organization, and not even a super-huge organization, to have several hundred different data sources spanning marketing, finance, products, inventory, and HR.'
A marketing department might want to get a new data source that they had never used before from an external party to be able to analyze the data, build some models with it, and do some campaigns. Right there, they just added three or four different datasets into a DMP [centralized data management platform] environment to do their look-alike models with external data sources. So the campaign management tool, the CRM tool—all of these tools—need to now have data exported out, imported in, and integrated among them. And more models are built, creating yet more datasets. It just never stops. There is just more and more data, and now we have all of these scores for these customers rating their propensity to buy, and we want to put that in production so now we have another dataset to create.
DBTA: The cycle continues.
JC: It's endless. And you know, marketing does it, engineering does it, and the product people do it. And the accounting people do it because they can't get the data out of their financial system—so they create their own datasets. It's become like cowboys and Indians out there.
DBTA: What is the answer?
JC: The solution has been to build a data lake and integrate it in a one-stop shop—and that's great. But now you have 700 systems that you have to bring into the data lake, and how do you do that? If you do one at a time, you'd be retired by the time you finished. My prediction is that we are at the cusp of yet another revolution.
'Speed is important because it's so competitive out there and it is so critical to be analytics-driven, and the result is that departments are just spinning up their own environments on the cloud.'
DBTA: What will it involve?
JC: Over the course of time, we have gone from mainframes to client-server, from client-server to the cloud, from transaction systems to dimensional models, from dimensional models to big data and analytics. I think the next wave that we're on the cusp of in terms of data integration is that we need to build, for lack of a better term, "data spiders" that will crawl throughout the enterprise, find all of the disparate data sources, and be able to extract the metadata.
DBTA: How would this approach work?
JC: Think of Google. If you wanted to know the score of the Yankee game last night, you would just go to Google and say, "Give me the score for the Yankee game last night." You'd get two numbers, right? The Yankees and whatever team they played and you'll know the score. That data is gathered through a spider technology that goes out and finds it, indexes it, and it makes it readily available. And I think that is the next wave for internal enterprise data.
We need to be able to have a search bar that allows a user to say, "Give me my sales by region for last month," and it just comes up, with where it came from and all of that behind the scenes. But I think that in order to get that information, we need to automate how we're finding that data, how we're integrating that data, and how we're serving up that data. Right now, it's all manual.
DBTA: Companies will use spiders or bots.
JC: Yes. The spiders will crawl your environment, whether it's on-prem or on the cloud, to find the data and extract all of the metadata about that data—all the tables, all the columns, all the structures, all of the relationships. After we have all of the datasets, all the databases, all of the structures, we will need to use machine learning to try to infer all of their relationships between the systems. There are a couple of companies out there that are starting to do some of these things. That's why I think it's achievable. Tamr is doing some of this. Global IDs does a little of it. There are not many, but there is maybe one other.
DBTA: What else is needed?
JC: You need to have machine learning to build very simple pipelines to get the data out of these systems and put it into the data lake. You need to be able to interrogate the data, take that data and put it through some models to infer the relationships. And the other thing that we struggle so much with is figuring out how to go in across different systems. We need to be able to interrogate the data automatically, and use machine learning to figure out how this stuff needs to be joined. And the last part is to systematically build a pipeline that goes over the data integration and builds a unified dataset. And then you're done.
'Something drastic has to change. It's just taking too long to onboard data into the analytics platforms.'
DBTA: What else?
JC: And, I think, basically just index it and put a search engine on top of it. There are a million little details in between but at a high level that has to be the future. There's no other way we can sustain the amount of volume and variety of data that every enterprise is dealing with today. We'll just never finish. There is always a new system to be put in the data lake. It never ends, and if we don't figure out how to automate all of this really very fast, it's just going to be failure after failure.
DBTA: What are the key elements?
JC: There are four pieces. One is bots to find the data, and the machine learning to build the pipelines. Then, there is a part that unifies the data, and there's the search on top of that data when you're done. And I think each one of those will represent the different domains where we'll see new products being born to satisfy data integration and analytics needs.
DBTA: How far off in the future do you think this will become a reality?
JC: I think people are already starting and within 3 years it's going to be the new standard. I think it's moving that fast. Something drastic has to change. It's just taking too long to onboard data into the analytics platforms.
This interview was edited and condensed.