The Fast-Shifting Data Landscape: Data Lakes and Data Warehouses, Working in Tandem

<< back Page 3 of 3

But moving their data to the cloud is not an option for all companies, Kazmaier cautioned. As a result, a modern data warehouse solution must be able to work on top of on-premise systems while enhancing them with new data sources. “Three key areas need to be covered when looking at data governance in a cloud data warehouse. Data lineage ensures users know exactly where data is created, stored, maintained, and where it’s going. The use of intelligent replication and virtualization ensures that data is refreshed in real time to create a single source of truth. Finally, roles and permissions ensure that each user has appropriate access levels.”

There are complexities with data-driven cloud engagements as well, Zweben said. For example, he noted, if a company is interested in building a data infrastructure comprised of an OLTP database, an OLAP engine, and a machine learning workbench in the Amazon Web Services (AWS) cloud, this would require subscribing to the Amazon S3 storage layer, Redshift or Snowflake data warehouse solutions, Amazon RDS or Amazon Dynamo-DB, and one of nine machine learning engines such as Amazon SageMaker, depending on the use case, and then integrating all of them using AWS Glue, an ETL tool. This is a complex architecture that is expensive to build and operate and requires data movement across platforms that can result in poor business decisions if insights are drawn from stale data, Zweben said.


Data lakes may add a great deal of flexibility to an enterprise data strategy, but they are supported by fast-breaking technologies that require constant vigilance. “Designing data lakes that are future-proof is challenging, given the rate of new technologies coming to the market,” said Desouza. “For example, GPU-accelerated databases are 10 times faster than in-memory databases that are 10 times faster than parallel columnar databases that are 10 times faster than traditional row-oriented databases. As prices drop, so does the balance among them in a data lake.”

Another example of the changing technology foundation for data lakes is the evolution of massive parallel processing, exemplified by the rise of Spark versus MapReduce, Desouza pointed out. “Nowadays, it is difficult to justify the use of MapReduce; it used to dominate data processing only 5 years ago.”

Governance is also a challenge, with the risk of data lakes becoming catch-all repositories for obsolete or irrelevant data. “Data lakes absorb data from many different places, and they’re rarely cleaned up as people leave the company, reorganize, or abandon projects,” said Freivald. “Someday, a data librarian or curator will be needed to help people find the relatively small amount of data they’re looking for in the vast amount of data in the lake. The lack of governance limits the amount of trust that people can place in the information they get from data lakes.”

However, said Kaluba, with a well-grounded data strategy, data lakes can serve many of the same purposes as data warehouses. “That data strategy should be supported by data governance and data management processes to ensure the data inside of the data lakes is reliable for decisioning processes needed by the organization. Otherwise, data lakes will continue to be the data dumping ground they are today. “


The data warehouse and data lake each solve different business problems and impose their own unique challenges, said Anthony Roach, senior product manager of MarkLogic. “A data lake is a co-located storage solution, not a repository optimized to perform analysis and find business insights. The intention of the data lake is to store data from multiple sources in a single location in its raw form. The data warehouse is traditionally a relational system. By definition, that means the inbound data must be transformed into the warehouse model.” The challenge of the data lake arises when you need to combine data from multiple sources. The data has been co-located, but in no way has it been harmonized or rationalized and curation is still required. “The data warehouse, on the other hand, offers a harmonized view of the data, but due to the transformation process, the original data has been altered or lost,” Roach said.

The bottom line is that organizations shouldn’t write off data warehouses—as they evolve, they are taking on new roles in digital enterprises. “The future of the data warehouse probably looks something like what Netflix has constructed,” Brey pointed out. “Data is housed in a cloud object store; serialized in efficient binary formats like Parquet, ORC, or Avro; and schema and other metadata is stored in a surrogate system like the Hive metastore. This allows the use of a plurality of data processing and analytics engines.” 

<< back Page 3 of 3