<< back Page 2 of 2

The Role of ETL and Data Prep in the Cloud


The T Is Changing: Be Prepared and Wrangle On

Once data movement is separated from data preparation, it becomes natural for a dataset to be reused for many different purposes. For example, consider geolocation logs gathered from a phone app. The marketing department might prepare that data in a specific, fine-grained way for an AI algorithm that offers discounts to individual customers based on their predicted travel trajectories. The DevOps department for the application might prepare that data very differently, to analyze app performance based on coarse-grained groups of customer interactions segmented by operating system, network connectivity, and so on. Both of the prepared datasets are highly valuable, but neither is the one “single source of truth for the enterprise.” In modern analytics environments, data preparation is often part of each specific analytics use case and needs to be done by the people who understand the use case best.

Modern data preparation or “data wrangling” platforms such as Trifacta and Google Cloud Dataprep are designed very differently from the transformation interfaces of legacy ETL tools. Data wrangling targets a new class of use cases in which the users benefit from deeply understanding the data content and its usage. These users need to get their eyes on the data, explore it to assess its content and quality, and play with it fluidly in a visual environment to prepare it for use. As a result, the leading modern data wrangling solutions use visualization and AI-driven suggestions that make traditionally technical tasks accessible to data enthusiasts outside of IT. At the same time, they also empower users to operationalize their data preparation in a governed and scalable manner, so it can run on schedule over datasets varying from spreadsheets to big data. As part of that package, data wrangling solutions also provide the transformation facilities of traditional ETL transformation: mapping schemas between relational databases, computing formulas, loading warehouses, and so on. Those activities still exist in the cloud, albeit in new and more flexible architectures.

For more articles like this one, go to the 2020 Data Sourcebook

The New Order

The ETL philosophies and tools of yesteryear are no longer a fit for modern data management. Instead of ee-tee-elling data from relational databases to relational warehouses, a new order of operations has emerged for wrangling data:

  1. Ingestion. The data lake is loaded (“hydrated”) by processes that connect data sources to lake storage and ingest data. In some cases, the data has to be extracted from another store, such as a traditional operational database. In other cases, the data lands directly in the lake via a process such as an application log writing straight to lake storage. This takes the place of traditional “E” and “L,” with a very simple L and a wide variety of E’s suited to a wide variety of data sources.
  2. Exploration. Upon arrival, data needs to be brought to the attention of users who care, and they need to be able to explore the data. They should be able to easily “unbox” new datasets to assess their content and quality, getting eyes on data in a meaningful way without high-friction activities such as writing code or pre-preparing the data simply to visualize it.
  3. Preparation. Once users are able to see their data, they almost inevitably need to manipulate that data to prepare it for use. This can require domain expertise, because each use case may require its data to be prepared differently. As discussed above, it’s important to enable anyone who works with data to be able to achieve this, whether they are analysts, data scientists, engineers, IT professionals, or business users who are data-savvy. As data is prepared, the changes to the data often necessitate further exploration of content and quality, so this step loops back to the previous one frequently.
  4. Operationalization. If a data preparation process turns out to be useful, it is typically useful on an ongoing basis. Most modern data sources are not one-shot files; they are feeds and streams of data from ongoing processes being monitored. Hence, modern platforms for wrangling data do not stop with one-shot preparation. They allow users to operationalize data wrangling pipelines on an ongoing basis and manage their success over time. These operationalized processes also generate metadata and lineage information so that organizational responsibilities around data governance can be met, enabling lifecycle management, auditing and compliance, and other ongoing data management operations.

What’s Ahead

To return to our original question, the basic steps of ETL may not be EOL, but the legacy approaches to the problem no longer fit—they do not provide the agility, scale, and simplicity required in our new data-rich era. Meanwhile, the pace of change in this space isn’t slowing down anytime soon, and the need for good data will only increase. Organizations that adopt the shift from ETL to data wrangling—especially in the cloud—will have built a solid foundation for their data and analytics success over the coming decade. 

<< back Page 2 of 2


Newsletters

Subscribe to Big Data Quarterly E-Edition