Modern Data Architecture: Data Lakes, Clouds, and Analytics

The data lake has become accepted as an important component of a modern data architecture, enabling a wider variety of data to be accessed by users. Yet, challenges persist. Recently, John O’Brien, CEO and principal advisor, Radiant Advisors, talked about the cultural transformation underway as companies increasingly realize the power of data, and the tools and technologies helping them to expand what is possible.

What are the key trends you are seeing?

Data lakes within the last year have come to the point of acceptance as part of many data strategies. Where I have surprisingly found the challenge to be over the last 3-4 months is still in the existence multiple definitions of the term “data lake.”


Really?

I am shocked that people are quite confidently defining data lakes differently. Maybe it is like the classic data warehouse problem we had years ago. I saw one keynote at a conference where the presenter defined a data lake as the entire data ecosystem, consisting of the data warehouse, the raw data, the data mart, and the analytics. Recently, I saw someone else describe the data warehouse as part of the data lake. It is interesting to think about and all it does is reinforce our definition of a data lake.

Which is?

We look at it as the enterprise data repository of all available sources of data that drives analytics in the company. The forms of analytics can be self-service by people who want to curate and govern it, and it also includes a data warehouse approach which allows people to source from the data lake and not the system directly, and all the data science and AI applications. Internal systems, external systems, user data systems have a very low barrier for data ingestion. That was always the big problem in the enterprise, but now that we have everything in a data lake, and these other analytics application or capabilities can be built on it. I am a little concerned that we may need to promote a clearer definition to help folks understand where it fits in.

Where is all of this going?

Data lakes are being accepted and for us a big trend is that companies that want to modernize their data platforms means including a data lake as part of their strategy.

This includes repositioning their existing data warehouse, and possibly migrating it to the cloud. And there are also people who see this as an opportunity to reassess their current data warehouse design. We say yes, introducing a data lake is a great foundation for those things to be built on, whether it is in the cloud or on premise.

As people expand their data lakes, what are the challenges?

The new challenges are the same as the old: data lake management, metadata, the need for a data catalog. The data lake lowered the barrier and allowed everyone to bring everything in but there was all this data and no one could find anything. The next solution is the data catalog, which we have seen take off in popularity in the last year.

How is it used?

With so much data inside the data lakes now, you need to have artificial intelligence and other routines that keep track of everything in the lake and you need to leverage more crowd-sourcing and self-service for the metadata about the data. This is a little more of an automated artificial intelligence and user self-service driven catalog as opposed to the traditional metadata repository and data catalog that falls under the governance umbrella. You first have to understand where everything is and then decide on the approved definitions. Then in that state of metadata, you approach the lineage to say this data came from here and it is certified because of this, and it has been cleansed for use in this way. We see a lot of activity in that area because data lakes rose to popularity quickly. There was all this data, but nobody could find anything or trust it. They might think: Here’s a great data source but who loaded it? Where did it come from? Why did they load it? What was it intended for? I see the data catalog as a combination of AI for automated information population and self-service with people doing the crowd-sourcing of that.

What’s next?

As trends go, the data lake has opened the door to the next challenge. The first problem when companies are approaching modernization is having a lot of data coming in from new sources and their need to find a place to put it. The data lake has solved that problem, but now the next one is management of the data lake and that has become the bigger topic.

How is it getting solved?

With data lake management tools, self-service tools, and data catalog tools, the challenge is getting resolved, and now we are going to see what door that opens up next.

Does this mean companies are now more frequently doing analytics at the data lake level, not just at the data warehouse level?

Yes, the data lake for us has an increasing amount of analytics occurring as an environment for discovery of new metrics. In order to run the business, there needs to be set of metrics and dashboards to track progress so everything is predefined—the metric, the calculation, the sources of data; the dark data that goes into the data lake is everything that was not predefined by the data warehouse but it may be used for self-service, or departmental use, or a component of a data science routine.

We focus on three different forms of analytics. We help companies by defining enterprise analytics as performance management which is BI dashboards and KPIs, and self-service, which allows people to dig into data, understand it, curate their own data for their needs, or understanding something to institutionalize it into the data warehouse for certified use. The third form of analytics is use by the data scientists who are trying to solve a problem and understand how data can help solve it without the issue of selecting the data. It goes from one end of the spectrum which is heavily biased, and with metrics defined, to the other end of the spectrum which is saying there is no bias and we are just trying to solve the business problem with any dataset that provides that data science capability.

What we have are three different forms of analytics all operating on a common foundation of the data lake. We really try to help companies define what enterprise data analytics looks like and in the form of analytic capabilities that the business requires to be competitively differentiated, understand their customers better, and produce better products. Usually, they need all three forms of analytics but they will adopt and evolve them in their own way.

When you look ahead to 2019, are there any trends you see on the horizon?

I think in the next year we will see more sophistication around data lake management. An exciting area that we are keeping an eye on is the role of the graph databases to apply management to the data. We see some early traction there with the use of the property graph or the knowledge graph to relate everything to the data lake. We are also seeing the use of AI and machine learning in data management, data prep, and governance tools. This allows data to talk to the user instead of the user talking to the data all the time, and is going to be a trend that continues to grow.

What else?

The space for tools that are in the data prep, semantic layer, and collaboration is becoming crowded so I think you will see much more of an emergence of platform tools that include governance, cataloging, and collaboration. The tools are now going to match the way users work and that is part of what we call the modern analytics lifecycle. The tools will converge, enabling fewer steps that users have to take to switch in and out of what they are doing.

And finally, governance is without a doubt going to have a strong stake in the ground. People are working to change the perception that governance is this edict saying, “Thou shalt not do this,” but instead is a collaborative effort and a valuable way to share information. We see the changing face of data governance, and it is part of a cultural change.

The whole modern data-centric approach is a cultural shift and companies are increasingly aware that what is needed is not just another database or another ETL tool.  They are realizing that the relative value comes from people having a better grasp of the power of data and what they can do with it and putting that power into their own hands.

Interview has been edited and condensed.



Newsletters

Subscribe to Big Data Quarterly E-Edition