Video produced by Steve Nathans-Kelly
Florida Blue's Padmesh Kankipati compares essential features of data lakes and data warehouses and the kinds of data that go into them in this clip from his presentation at Data Summit 2019.
Citing data warehousing expert Bill Inmon's definition, Kankipati said a data warehouse can be described as subject-orientating, integrated, time-variant and non-volatile. "So it is domain-specific integrated data. You take different domains and integrate the data and you hold the history of the data. And once you get into the data variables, it's not going to change."
That is how the data warehousing concept has been for decades, said Kankipati, who then also cited a definition from another data warehousing expert Ralph Kimball, explaining that a data warehouse is a copy of transaction data specifically strucutured for query and analysis. So it is structured data with well-defined structures and a data layout that you can easily query and do analysis on, said Kankipati.
And finally, Kankipati quoted James Dixon from Pentaho, who said that if a data warehouse is a store of bottled water, cleansed, packaged and structured, the data lake is a large body of water in a natural state streaming from multiple sources for the user to dive in.
commenting on the ways data lakes differ from data warehouses, Kankipati said, "One is data storage, in that a data lake can hold any kind of data. If we are looking at a data warehouse there is a lot of analysis that goes in. What kind of data needs to go into data warehouse? And what kind needs to be excluded? For the data lake you have everything going in, raw data. And the data lakes are built on commodity hardware that is cost-effective and cheap. And the data storage can expand horizontally."
DBTA’s next Data Summit conference will be held May 19-20, 2020, in Boston, with pre-conference workshops on Monday, May 18.
According to Kankipati, decades before, there was the "SMP architecture—it is very famous—then Teradata introduced MPP, and then Netezza or IBM adopted that. And, nowadays, even the same model but with the different design, we have Hadoop as a big data architecture. The data warehouse can only house structured data, but with the data lake, you can have both structured and unstructured data, and it's schema-less, which means schema-on-read versus schema-on-write when we build a data warehouse, and you need to know how to store the data. With the data lake, the schema is open. Like your schema-on-read, you drop the data and then provide the schema later. And the user types, with the data warehouse, you have users mainly as operational users and business users, and some analysts doing some analytic work."
With the data lake, said Kankipati, in addition to those, an organization can support data scientists doing deep data analytics.
It is possibe to have much deeper analytics using the raw data in the data lake, said Kankipati. "And actually, when you sort the raw data, you can quickly build different paths of data processing for data analytics needs. Whereas in data warehouse, you have to go through the lifecycle model for it, the data lifecycle, data modeling lifecycle, even for the small change. And it takes time to make the change. The agility is an advantage. Data lakes can be used for many purposes. And some of the uses we have mentioned here are data staging, and as a raw data repository. And if you use it just for a raw data repository, data staging, data lakes don't have much value. Because you are just having it in front of the data warehouse, just to feed the data. Data lakes can do data transformation, data discovery, and data analytics. And as I mentioned, they do deep data science."
Many presenters have made their slide decks available on the Data Summit 2019 website at www.dbta.com/DataSummit/2019/Presentations.aspx.
To access the full video of the presentation, "Modern Data Platforms & Application Architecture," go to https://datasummit.brightcovegallery.com/detail/videos/data-summit-2019-track-a/video/6040751347001/a202a.-modern-data-platforms-application-architecture?autoStart=true#links