The Balancing Act Between Data Warehouses and Data Lakes

Cloud data lakes benefit from an open and loosely-coupled architecture that minimizes the risk of vendor lock-in as well as the risk of being locked out of future innovation.

However, the many benefits of cloud data lakes are negated if data is duplicated into a data warehouse and then again into cubes, BI extracts and aggregation tables. Because of this, many organizations are now striving to find the right balance between their data warehouse and data lake investments.

DBTA held a webinar featuring Gabriel Jakobson, senior solution architect, Dremio and Roy Hasson, senior manager, business development - analytics and data lakes, AWS who discussed how to find and best implement that balance between the data warehouse and data lake.

In a pure data warehouse world, bi users are being left behind, Jakobson said. To make a data lake useful for analytics data has to have a meaning, data has to relate to other data, data needs to be queryable via SQL, and SQL needs to run at speed.

Dremio allows users to query Amazon S3 directly with 4x performance and minimal data movement while maintaining control of data, according to Jakobson.

With this solution, data lakes and existing data warehouses can co-exist. Dremio provides massively parallel data lake reads plus heavily optimized pushdowns for DBs/DWs.

According to Jakobson, users can leverage Dremio to put new workloads on the data lake then gradually migrate existing workloads.

“Don’t boil the ocean; data lakes and data warehouses can coexist, with data and workloads migrating to your data lake over time,” said Jakobson. “The key is to make it transparent to your BI users; they should never experience data loss, degradation of performance or inability to use their existing tools.”

Amazon S3 provides capabilities including:

  • Serverless
  • Near infinite scale
  • Secure – access control and strong encryption
  • Automatic life-cycle management of files
  • Robust data access API
  • Highly performant

The key benefits of this data lake storage are reduced duplication, is centrally managed, integrated, and cost effective, Hasson said.

By combining AWS, AWS Glue Data Catalog, and AWS Lake formation, users get a single pane of glass experience, self-service,  a platform that is managed and governed, ubiquitous, and offers flexible pricing models.

An archived on-demand replay of this webinar is available here.