The New Data Lakehouse: One Size Doesn’t Fit All Yet

Vendors are fairly predictable creatures. Any given vendor will tell you that their new tool will solve virtually all one’s problems; just keep on explaining your problems to the vendor’s rep until they can frame the proper arguments to convince you. Additionally, their new tool will likely support the brand-new framework that you don’t yet understand because it is still being developed—but the new tool will be there. Get on board or get left behind. Case in point: The data lakehouse framework has any number of vendors offering tools that an organization must have to do a data lakehouse properly.

A New Framework

The data lakehouse is a merging of the organization’s data lake and data warehouse into one platform, eliminating data redundancy and loss of time moving data around from one place to another.

Is this newest data savior really better than all the data saviors of the past? Detractors suggest that a data lake with big data flowing in—while a valu­able tool in the chest—is actually of limited use. The limitation is the result of the most important transactional data still being created and managed in highly structured, often relational, sources. And while these transactions are high volume for the organization, that overall volume may be fairly modest in quantity.

“Unstructuring” the data from these operational sources into JSON or something else so that users can run their structured queries and structured reports is potentially what can be considered a solution to a nonexistent problem (except, of course, for the big data/data lake vendors’ “problem” of how to expand their customer base and usage).

Different Needs

One difficulty with having all the data in the data lake, yet not having multiple copies of this data, ultimately concerns the users. Today, different users relate to the data in differing ways. The non-data scientist still has simple needs and perspectives. Simple data structures, such as star schemas, are easily relatable to this class of users. Views, or their equivalent, can provide translations to show data to these users, but such views may or may not provide reasonable performance. In addition, the base structures may not support the right kind of necessary historical change activity.

Data may need to be duplicated to address presentation and change activity. And if it is duplicated, a data virtualization tool can help make the data’s physical existence a moot point. Is it in the data lake or a relational data warehouse? Who needs to care? If we assume technological advances will one day allow infinite performance, perhaps a day will come when operational data does not even need to be moved from its original operational source, and both operational and analytical queries can be executed in place.

Organizational Analytics Requirements

We may have a bit of a wait until a data lakehouse can smoothly answer the entirety of our organizational analytics needs. Maybe, by that time, yet another framework will be desired. Solutions still need to evolve to a point where data structures are virtually meaningless, but that day will come.

Our storage and processing capacities may rise so that they are virtually infinite, and all that needs to be done is to logically stitch the pieces together in a metadata repository of some sort—and all data users will be able to query and obtain their answers quickly and correctly. However, the path to such a future is filled with a much better and deeper understanding of our data along with a consistency in implementation that, as a society, we have never encouraged before. Clever workarounds and shortcuts implemented across our data solutions will need to die off for the greater good. The idea that any new tool or framework can simply sidestep this deeper understanding is shortsighted at best.