Diving Into Data Lakes With Self-Service Data Prep

The concept of data lakes is a great one, but if not done correctly, this treasure trove of information can quickly turn into a black abyss for data analysts and scientists, let alone business users.

Data lakes are designed to store vast amounts of raw, disparate data from a variety of sources, including structured, semi-structured, and unstructured repositories (e.g., PDFs, Excel, JSON, and images) all in one place. And, this seems like it would be an advantage—right? But, when it comes to data analysis, folks typically only need a finite set of specific data points that are purpose-fit for their needs. In other words, finding the data that you need within a data lake can be like fishing … the fish won’t come automatically; you have to work for them. Complicating matters, to retrieve data from a data lake, analysts typically have to rely on IT, which can result in delays in decision making.

To circumvent all of these challenges, many business users and data analysts are only using data that is accessible on their desktops, such as Excel files and local servers, for analytics. They are missing out on valuable information housed within data lakes because they can’t easily find or access it. Hence, they are gleaning insights from incomplete information.

IT doesn’t escape unscathed here either. IT professionals are faced with the formidable task of ensuring all data in the lake remains in compliance with governance and security principles. Maintaining data quality and governance when new data is constantly flowing into and being removed from data lakes by numerous users is a tough task. Without security and governance in place, data lakes can quickly turn into murky swamps.

Keeping Users Afloat

The good news is that self-service data preparation (prep) tools are rescuing business users and IT teams from drowning in their data lakes. The technology empowers business users to easily and rapidly search for, find, and access “the right data” for their analysis. They can then combine and blend this data with data from other sources (e.g., their desktop) to get a holistic view. Finally, data prep allows them to quickly cleanse and manipulate datasets for analysis, so they can prep less and analyze more. And, because data prep is a “self-service” technology, this can all be done without having to wait for IT to run a specialized report or provide permissions to access certain datasets.

While self-service data prep equips business users and data analysts with the speed and agility that they require to access the right data, it also satisfies IT’s need for security and compliance, providing governance capabilities such as data masking, data retention, data lineage, and role-based permissions. Fortunately, using this technology, the needs of business users and those of IT are no longer mutually exclusive.

The Socialization of Data

A big advantage of self-service data prep is that it enables users to not only take from the data lake but contribute back to it.

For example, let’s say an analyst pulls data A, B, and C from a data lake, and then creates curated dataset D. With data prep, if the analyst is a CMO who has culled marketing data for dataset D, other users who are seeking marketing information in the data lake can quickly search by persona to find dataset D. It can be likened to the expedition of Lewis and Clark: They forged a new path but left bread crumbs behind for others to follow. This concept of reusable models/recipes is all about creating curated datasets, enriching them with data outside of the lake, and then feeding those datasets back into the lake. 

Mastering the Lifecycle

To achieve maximum ROI from information housed in data lakes, business users, data analysts, and data scientists must be able to search for, find, and access the right data for analysis quickly and easily and then contribute curated datasets back into the data lake to help their peers. But this lifecycle isn’t happening within most organizations, causing the value of data lakes to be lost at sea.

As more companies turn to self-service data prep to complement analytics solutions, we expect to see the tide turn, and when it does, companies will be able to successfully tap data lakes to gain actionable insights that enhance decision making, improve operational processes, and deliver business value.


Subscribe to Big Data Quarterly E-Edition