Data Lake Power Tips: Making Hadoop Work in the Enterprise

By Paul S. Barth

Sep 27, 2016

Data lakes are quickly transitioning from interesting idea to priority project. A recent study, “Data Lake Adoption and Maturity,” from Unisphere Research showed that nearly half of respondents have an approved budget or have requested budget to launch a data lake project.

What’s driving this rapid rush to the lake? Simply put, data lakes are uniquely capable of delivering data as a service (even in very large, and security conscious companies) and giving business users radically faster, easier, and expanded access to enter- prise data.

In other words, data lakes are the IT innovation that will transform today’s companies into market leaders or losers, into competitive killers or cows, into brilliant analytic and data-driven innovators or business-as-usual B (or C) players.

The transformational next-generation character of data lakes is described by Mary Meeker in her annual Internet Trends report from June 2016. Phase one was about constrained data and how best to leverage the network and software applications to deliver structured data for targeted use cases. In phase two, the explosion of big data, cheap storage, and decentralized systems put the focus on how to use infrastructure to put all data everywhere. Chaos ensued. Phase three—where we are now—puts data at the center of everything and consolidates depart- mental applications and enterprise-wide analytics with data security and governance. This consolidation brings order to the chaos and gives the entire organization fast, ready access to all the data. From my perspective, data lakes are what makes this third phase a reality.

Thinking about a data lake for your company? Here are three tips which we believe are critical to successful data lake implementations based on our experience working with customers that have been through this process.

Tip 1: The data lake must be architected for the enterprise

The data lake needs to be architected for the enterprise, which means it needs to offer a critical set of data management capabilities that one would expect from any enterprise technology platform.

An enterprise data lake is not just a Hadoop platform with lots of data. That’s really just a sandbox, and sandboxes—by their very nature—will not mature into enterprise data lakes. Enterprises need the lake to have robust, embedded data management capabilities—that include metadata, security, administration, and integration—to automate processes and protect the data.

The data lake requires end-to-end data management, and that starts with data sourcing, including sourcing from legacy, mainframe or XML files which may need to be converted or flattened. Also, many companies are concerned about field-level encryption and obfuscation to protect sensitive data at the field level while still making it available for analysis.

The data lake also needs to allow self-service for a wide variety of users to find and understand data in the lake. For this reason, a non-technical, user- friendly GUI is important. These same users need enough information about the quality and status of data in the lake to make intelligent choices about how to use it, and they need to be able to collaborate with one another around the data by, for example, annotating and defining data through crowd- sourced business metadata.

At the same time, it’s important that all these activities happen within the data lake and on the cluster to ensure that each user’s activity around data is fully documented and protected through the lake’s metadata, governance, and security measures. These capabilities are essential for the data lake architected for the enterprise and extend beyond what is available today through Hadoop and its associated open source projects.

Tip 2: The data lake needs to solve Hadoop’s dirty little secret: data quality and formatting when on-boarding enterprise data.

Hadoop has a dirty little secret. On its own, it does not provide capabilities to address the types of data quality and data formatting problems that nearly always arise when on-boarding enterprise data into the lake.

And this is a real problem. Dirty data can cause havoc in Hadoop, and end users may be unaware of data quality issues. Many mainframe, legacy, and data sources have quality issues that must be identified and fixed before the data is loaded and made available via HCatalog for access by the user. If those problems aren’t fixed on ingest, the data in the data lake will be incorrect in ways that are hard to detect and harder to fix.

This includes quality issues such as these:

Embedded delimiters, where a value contains the delimiter that separates fields, such as commas
Corrupted records where an operational system may have inadvertently put control characters in values
Data type mismatches, such as alphabetic characters in a numeric field
Non-standard representations of numbers and dates
Headers and trailers that contain control and quality information that need to be processed differently than traditional records
Multiple record types in a single file
Mainframe file formats that use different character sets on legacy systems, which Hadoop does not know how to recognize or process

Since Hadoop lacks the built-in capabilities to address the myriad data cleansing and conversion issues that characterize enterprise data sources, the data lake must bridge this gap when on-boarding enterprise data into the lake to ensure it is profiled and validated.

Power Tip 3: The data lake needs to democratize data.

Finally, in order to deliver on the core promise of the data lake concept itself—namely to “democratize data” in the enterprise—the data lake needs to include capabilities that make it easy and fast for typical business users to get data from the data lake when they need it, without help from IT.

“Democratize data” means it is business-ready and usable by the rank-and-file business line people—without advanced data science or Hadoop skills. The data has to be within grasp of users in a secure, managed way. But, that doesn’t happen automatically just by standing up some large dataset in Hadoop.

To make the data accessible to the greatest number of people, it’s essential that the data lake platform itself has a non-technical graphical interface. It also needs to enable business users to crowdsource metadata, by capturing and sharing business definitions and adding profiling information to the data—thereby enabling more collaboration and delivering more value to users.

With so many business users accessing data in the lake, data governance and security considerations take on height- ened importance. Not everyone can or should have access to everything.

Delivering on the data lake’s promise to “democratize data” also therefore requires stringent enforcement of existing enterprise data security and governance measures at the system, server, network, HIVE, and Hadoop level. This includes compliance, transparency, and auditability with existing authentication, authorization, data access controls, encryp- tion, obfuscation, and related constraints. Users should also be able to add additional data access controls in the data lake, using groups, privileges and role to automate the process.

Finally, in a data lake environment administration and maintenance of these security measures need to be intuitive and simple so that again even administrators with few deep technical skills can independently establish and mange security and governance measures in the lake.

These capabilities make possible a data lake that is secure while providing users with self-service access.

Data Lake Success

What, then, does the successful data lake implementation look like?

By being built for the enterprise, any authorized end user can shop for data themselves by selecting data, which can be customized in SQL with a query or can be exported directly to a target system.

These same users have the ability to take data inputs, filter them, transform data, create derived datasets, cleanse datasets, join datasets, aggregate datasets together, and create an out- put dataset that is ready for analysis. This type of end-to-end capability should require no programming. With this combi- nation of functions, an end user can go into the lake and in minutes create a useful analytical dataset for any of the stan- dard reporting, visualization, or analytic applications.

Finally, all of this has to be done with corporate standards and compliance practices baked into the data lake.