Page 1 of 2 next >>

The Role of ETL and Data Prep in the Cloud

By Joe Hellerstein

Dec 2, 2019

You’ve probably heard of ETL, or heard somebody talk about “ee-tee-elling” their data. It’s a technology from the days of big iron for extracting data from many relational databases, transforming it to a corporate standard schema, and loading it into another unified relational database. That target database, the enterprise data warehouse (EDW), was intended to be a “single source of truth” for the enterprise. It included all the data that IT deemed fit (and available) for decision makers to use.

EOL for ETL?

ETL is old stuff. Architectures for data management have changed in the modern era of mobile phones, IoT, AI, and cloud computing. We’ve moved from recording millions of transactions to billions of interactions. Rather than inputting data into formal business processes, we’re capturing signals that can inform business opportunities. The idea of a terabyte-sized “single source of truth” for a large enterprise is as quaint today as the old Yahoo directory of all the “pages on the web.”

Surprisingly, in the face of all that growth and diversity, many organizations today have improved their ability to centralize data management in recent years, with far more diverse and sophisticated use cases. These data-driven organizations are moving data into modern centralized storage structures (data lakes, cloud blob storage) and using new technologies for data preparation to assess and transform data for use. How did this change happen? And does it mean end-of-life (EOL) for ETL?

Data Lake? S3? Potato? Potahto? What’s This All About?

The phrase “data lake” was coined to capture a new approach to data storage. The idea is simple: Establish an organizational convention that all data captured in an organization must be loaded in raw form into a single, large, inexpensive storage system—a “data lake,” “landing zone,” “data commons”—call it what you will.

Over the last decade, some organizations have invested in building an on-premise data lake. As with any new IT process, this gave rise to philosophical and political debates: warehouses versus lakes, control of data resources, cost, and redundancy. You may have seen headlines or PowerPoint presentations decrying the idea of data lakes as “data swamps.” That critique led to constructive outcomes: New open source and commercial products emerged for cataloging data, making data lakes more transparent to manage and use.

For more articles like this one, go to the 2020 Data Sourcebook

Meanwhile, the cloud became a locus of innovation and best practices in data management. In the cloud, simple economics of storage pricing dictate behavior. “Blob storage” systems such as AWS S3, Azure Blob Storage, and Google Cloud Storage are basically inexpensive, infinitely scalable, write-only file stores. They are also the cost-efficient choices for storing data at rest in the cloud. Simple economics—and the lack of legacy deployments in the cloud—led many cloud-centric organizations naturally to land their data in services such as S3. In effect, these organizations developed an S3-based data lake architecture with little to no controversy. In the cloud, data lakes are the economical and expedient thing to do.

Data warehouses are still a viable value-added target for selected data from the lake. Many cloud data warehouse products such as Snowflake, Google BigQuery, AWS Redshift, Azure Polybase, Databricks, and Cloudera Impala even allow virtual data warehouse structures to be constructed directly over blob storage, without any warehouse loading. In these environments, the data lake doesn’t supplant the data warehouse; it populates it or simply backs it.

Lesson: Decouple Movement From Preparation (aka “E and L Come Before T”)

One of the key lessons of this data lake shift was to decouple data movement from data preparation. Data movement—roughly, the “E” and “L” of ETL—is an operational task for managing storage costs and access control. Data preparation—roughly, the “T” of ETL—is a content-based, analytics-facing task that requires really understanding both the data and a specific use case being pursued.

This clean separation between movement and prep has many benefits. First, postponing transformation radically reduces friction. The person or process loading the data is not responsible for transforming it to anybody else’s specs at load time. It is hard to overstate how important this is: Requiring data to be transformed before it is loaded is an enormous disincentive for sourcing and sharing data. Second, the ease of loading in a shared repository enables IT to manage all of an organization’s data under a single API and authorization framework. At least at the granularity of files, there is a single locus of control. Third, transformation often loses information as it “boils down” raw data for a specific use case. By contrast, raw untransformed data is fodder for unanticipated new use cases and leaves a record for subsequent auditing and compliance.

Page 1 of 2 next >>