3 Data Lake Power Tips: Making Hadoop Work in the Enterprise

By Paul S. Barth

Sep 12, 2016

Data lakes are quickly transitioning from interesting idea to priority project. A recent study, “Data Lake Adoption and Maturity,” from Unisphere Research showed that nearly half of respondents have an approved budget or have requested budget to launch a data lake project.

What’s driving this rapid rush to the lake? Simply put, data lakes are uniquely capable of delivering data-as-a-service (even in very large, and security conscious companies) and giving business users radically faster, easier and expanded access to enterprise data.

In other words, data lakes are the IT innovation that will transform today’s companies into market leaders or losers, into competitive killers or cows, into brilliant analytic and data driven innovators or business-as-usual B (or C) players.

The transformational next-generation character of data lakes is described by Mary Meeker in her annual Internet Trends report from June 2016. Phase one was about constrained data and how best to leverage the network and software applications to deliver structured data for targeted use cases. In phase two, the explosion of big data, cheap storage, and decentralized systems put the focus on how to use infrastructure to put all data everywhere. Chaos ensued. Phase three – where we are now – puts data at the center of everything and consolidates departmental applications and enterprise-wide analytics with data security and governance. This consolidation brings order to the chaos and gives the entire organization fast, ready access to all the data. From my perspective, data lakes are what makes this third phase a reality.

Thinking about a data lake for your company? Here are three tips which we believe are critical to successful data lake implementations based on our experience working with customers that have been through this process.

Tip 1: The Data Lake must be architected for the enterprise

The data lake needs to be architected for the enterprise, which means it needs to offer a critical set of data management capabilities that one would expect from any enterprise technology platform.

An enterprise data lake is not just a Hadoop platform with lots of data. That’s really just a sandbox, and sandboxes – by their very nature – will not mature into enterprise data lakes. Enterprises need the lake to have robust, embedded data management capabilities – that include metadata, security, administration and integration – to automate processes and protect the data.

The data lake requires end-to-end data management, and that starts with data sourcing, including sourcing from legacy, mainframe or XML files which may need to be converted or flattened. Also, many companies are concerned about field-level encryption and obfuscation to protect sensitive data at the field level while still making it available for analysis.

The data lake also needs to allow self-service for a wide variety of users to find and understand data in the lake. For this reason, a non-technical, user-friendly GUI is important. These same users need enough information about the quality and status of data in the lake to make intelligent choices about how to use it, and they need to be able to collaborate with one another around the data by, for example, annotating and defining data through crowd-sourced business metadata.

At the same time, it’s important that all these activities happen within the data lake and on the cluster to ensure that each user’s activity around data is fully documented and protected through the lake’s metadata, governance and security measures.

These capabilities are essential for the data lake architected for the enterprise and extend beyond what is available today through Hadoop and its associated open source projects.

Tip 2: The data lake needs to solve Hadoop’s dirty little secret: data quality and formatting when on-boarding enterprise data

Hadoop has a dirty little secret. On its own, it does not provide capabilities to address the types of data quality and data formatting problems that nearly always arise when on-boarding enterprise data into the lake.

And this is a real problem. Dirty data can cause havoc in Hadoop and end users may be unaware of data quality issues. Many mainframe, legacy and data sources have quality issues that must be identified and fixed before the data is loaded and made available via HCatalog for access by the user. If those problems aren’t fixed on ingest, the data in the data lake will be incorrect in ways that are hard to detect and harder to fix.

This includes quality issues such as:

Embedded delimiters, where a value contains the delimiter that separates fields, such as commas;
Corrupted records where an operational system may have inadvertently put control characters in values;
Data type mismatches, such as alphabetic characters in a numeric field;
Non-standard representations of numbers and dates;
Headers and trailers that contain control and quality information that need to be processed differently than traditional records;
Multiple record types in a single file;
Mainframe file formats that use different character sets on legacy systems, which Hadoop does not know how to recognize or process.

Since Hadoop lacks the built-in capabilities to address the myriad data cleansing and conversion issues that characterize enterprise data sources, the data lake must bridge this gap when on-boarding enterprise data into the lake to ensure it is profiled and validated.

Power Tip 3: The Data Lake needs to democratize data

Finally, in order to deliver on the core promise of the data lake concept itself – namely to “democratize data” in the enterprise – the data lake needs to include capabilities that make it easy and fast for typical business users to get data from the data lake when they need it, without help from IT.

“Democratize data” means it is business-ready and usable by the rank-and-file business line people – without advanced data science or Hadoop skills. The data has to be within grasp of users in a secure, managed way. But, that doesn’t happen automatically just by standing up some large data set in Hadoop.

To make the data accessible to the greatest number of people, it’s essential that the data lake platform itself has a non-technical graphical interface. It also needs to enable business users to crowd-source metadata, by capturing and sharing business definitions and adding profiling information to the data – thereby enabling more collaboration and delivering more value to users.

With so many business users accessing data in the lake, data governance and security considerations take on heightened importance. Not everyone can or should have access to everything.

Delivering on the data lake’s promise to “democratize data” also therefore requires stringent enforcement of existing enterprise data security and governance measures at the system, server, network, HIVE and Hadoop level. This includes compliance, transparency and auditability with existing authentication, authorization, data access controls, encryption, obfuscation and related constraints. Users should also be able to add additional data access controls in the data lake, using groups, privileges and role to automate the process.

Finally, in a data lake environment administration and maintenance of these security measures needs to be intuitive and simple so that again even administrators with few deep technical skills can independently establish and mange security and governance measures in the lake.

These capabilities make possible a data lake that is secure while providing users with self-service access.

Data Lake Success

What, then, does the successful data lake implementation look like?

By being built for the enterprise, any authorized end user can shop for data themselves by selecting data, which can be customized in SQL with a query or can be exported directly to a target system.

These same users have the ability to take data inputs, filter them, transform data, create derived data sets, cleanse data sets, join data sets, aggregate data sets together and create an output data set that is ready for analysis. This type of end-to-end capability should require no programming. With this combination of functions, an end-user can go into the lake and in minutes create a useful analytical data set for any of the standard reporting, visualization or analytic applications.

Finally, all of this has to be done with corporate standards and compliance practices baked into the data lake.

Paul S. Barth, Ph.D., is CEO of Podium Data (www.podiumdata.com).