The Fast-Shifting Data Landscape: Data Lakes and Data Warehouses, Working in Tandem

<< back Page 2 of 3 next >>

Data warehouses have always been useful “to answer questions reliably, accurately, and in a predictable, timely manner,” said Joe DosSantos, global head of data management strategy at Qlik. “They also provide an auditable transformation process with clear rules that allow users to get the key top-line information they need, ranging from revenue and KPIs to items related to regulatory reporting.” However, he continued, many business questions are ad hoc or open to interpretation such as, “What do we think the impact of a storm will be?” and “What color will be the new black in fashion?” The ability to answer these types of questions requires experimentation with a volume and variety of data that are not often found in a data warehouse, as these systems are built with careful planning of a narrow set of business requirements, DosSantos explained. “Data warehouses are much better-suited to describe what has happened in the past than what’s occurring now or in the future. Once enterprises uncover the initial information, then they can use real-time data to recognize and react to current conditions and potentially alter business decisions.”

Look to data warehouses as “a source for consistent data and repeated processes, such as business reporting and dashboards,” Freivald noted. “The consistency of data warehouses makes them perfect for trend analysis.” In this way, with their different use cases, “data lakes and warehouses complement each other,” he added. “New insights generated with raw data from a lake can inspire new dashboards and reports from the warehouse.”

Data warehouses, conversely, are more suitable for storing structured, curated data in order to execute and operationalize dashboards and eventually perform predictive scoring, which requires a guarantee on the SLA and enables IT to operate the environment in a cost-effective manner, said Raghu Chakravarthi, SVP of R&D at Actian.

In addition, Chakravarthi sees data lakes as being well-suited for the fast ingestion and storage of vast amounts of uncurated data from various data sources. They can also handle a variety of data types very easily and provide connectivity while also supporting multiple language access and enabling data discovery, data preparation, model building, and model fitting, Chakravarthi added.

“The data contained in these data lakes can offer new insights and findings that are much different from the data usually collected in a data warehouse,” according to Kazmaier. “When combining data from data lakes with more traditional data analytics, users can uncover business-critical insights that are not possible with traditional data warehouses or data lakes alone.”

Data lakes also can act “as a staging area for data being prepared for a data warehouse,” said Kazmaier. “The unique value that data lakes offer is to avoid upfront processing, categorizing, and filtering of data. Instead of having two separate tools, the combination of both is the ultimate solution: automatically feeding the data warehouse with data insights based on data out of the data lake and combining it with more structured data in the warehouse to get insights.”

Data lakes “typically have much lower licensing costs per unit of capacity—typically either per-terabyte or per-node pricing,” said Zweben. In addition, he noted, “Data lakes are typically easier to scale to very large sizes.” In combination these characteristics mean that organizations need to be less selective in choosing what to retain. Data that might have been tossed out or been retained in cold storage can instead be stored in the data lake. At the same time, data warehouses will continue to be the primary area of data consolidation in organizations, he continued. “They provide more mature functionality—such as security, lineage, and transactional capabilities.”


Of course, the rise of cloud computing is changing the equation for both data lakes and data warehouses—“drastically,” said Small. “Public cloud providers continue to push the boundary in terms of richness of offerings and deployment flexibility. No modern data analytics strategy is complete without consideration of where public cloud plays a role.”

The emergence of cloud and serverless computing “poses an interesting conundrum that will impact the way organizations store and access data,” said Kaluba. It makes a data strategy even more important because it addresses “what data is stored in the cloud, the management and protection of data, how the information is going to be used, and by whom.”

Increasingly, tying a data strategy to the cloud has become a necessity for many organizations. Scenarios that involve data exploration, data science, and prediction models benefit from elastic scaling and working with virtually unlimited hardware resources, said Kazmaier. Having a data lake and a data warehouse in the cloud with the ability to connect to other databases enables a simple gateway for all enterprise data. “This allows organizations to benefit from unlimited low-cost storage, highly flexible data lakes, and high-performance data warehouses for powerful, real-time business analytics.”

<< back Page 2 of 3 next >>