<< back Page 2 of 2

Mapping the Path to the Data Lake

The Journey Is Unique

Earlier, we paralleled the formation of a natural lake in its unique ecosystem with the adoption of the data lake in the enterprise. Revisiting that, if we think of the adoption of the data lake as an evolution of (or a component to) the current state of the enterprise data architecture, then we can overcome the overwhelming “sink or swim” barrier of data lake adoption. Instead, we can break down data lake adoption in a series of maturity milestones within a sequence of transformation-defined stages. Ultimately, the data lake is a destination—a goal—that requires ongoing momentum through the years to fulfill.

In “The Definitive Guide to the Data Lake,” we outlined a four-stage maturity model for the data lake that follows this evolutionary journey from the first few hesitant paddles (Stage 1) all the way to full-scale rowing (Stage 4 and beyond). These stages support—and encourage—an incremental path to maturity by starting small, managing data lake risks with projects in pilot roles before the data lake earns a spot in the safety zone. As companies gain expertise and a comfort level, they move from a “silos and applications” mindset to a foundational, platform mindset—not the data lake as “a database” or “an application,” but as a true enterprise repository of all data in a flexible and managed environment.

Maturity, then, is the evolution of a big data architecture that balances the adoption of the data lake as a platform while gaining expertise and learning how to mitigate risks. As they move through the stages of maturity, companies learn to recognize Hadoop for its complementary strengths and as a data operating system. A marker of maturity is playing to Hadoop’s strengths and optimizing those to further enable the whole of the modern data architecture.

Today’s early data lake adopters are pioneers who garner confidence through the support of seminal data management principles and common sense. These companies have the goal of achieving the broad benefits of the data lake, and see the potential of leveraging the increasing strengths as a data technology within the architecture—seeing beyond big data storage and analytics, but recognizing the leading role of the data lake in the enterprise data architecture for all operational data needs. After all, this data lake strategy is what drives the evolution of the data lake, not the differentiation of definitions between vendors or individual technologies.

Put the Data Swamp in Perspective

Fear and hesitation are natural human defense mechanisms for good reason: they protect you from the perils of risky, unknown situations. Experienced data management professionals spent decades avoiding silos, reducing data duplication and proliferation, and organizing and governing the context and usage of data in the enterprise. So, it’s no wonder that alarm bells sound when such low data ingestion barriers and affordable scalability are presented to data scientists and analysts who desire unfettered access to all data. While declarations of “data swamps” and “data dumps” are made almost as an autonomic response, it’s that same data management experience (paired with common sense) that enables the creation of proper data governance and processes to balance the risk and the rewards of data discovery and business analysis.

The data lake is at an inflection point between Hadoop being used to solve big data challenges and opportunities by early adopter companies and Hadoop being a part of the overall enterprise data strategy in the modern data architecture. This is where Hadoop meets enterprise data management—when mainstream companies begin adopting it as a data lake. And, this is a good thing: Experienced data management professionals will pioneer the new way of organizing and governing the data lake by extending existing processes to include the data lake.

The data lake has led the way in rediscovering enterprise BI by providing a critical capability that was often a challenge in building enterprise data warehouses: discovering the hidden truth in requirements analysis. Agile BI was good at breaking down this challenge into smaller, faster cycles that led to a “learn as you go” approach with the OK to fail-fast and keep going. The data lake further enables business analysts to work with all data to discover information context and analytic algorithms with its low barrier of data entry and schema-less, flexible data storage engine that makes it easier for analysts to discover relationships and schemas on data read. Consider this: Previously, an analyst had to define a table in order to write data into it for analysis. Now, the analyst can simply load data into Hadoop and use tools to discover what the definition should be.

The data lake and data discovery are not here to replace the enterprise data warehouse, but to support and enable the EDW where structured context is governed for consistent reporting and analytics required in decision support systems and performance management. The EDW is essential to running the business, achieving established goals, and providing information needed for decisions. The data lake and discovery simply enable insights and discoveries that broaden that context—or, likewise, support the discovery of new contexts and business models to exploit.

Preventing your data lake from becoming a data swamp is a matter of extending data governance and data management principles to include the data lake. New data owners and data stewards will be identified for new internal and external big datasets; new roles for data scientists and analysts will be defined. And, governing fast data ingestion and unfettered access (maybe just for data scientists), and data consumption by systems, applications, and users will require new processes and governance policies. Again, this marks the evolution of data governance, not a revolution.

Swim for Success

Let’s agree that the data lake can be defined as the architectural destination of an organization-unique journey that requires maturity and transformation to reach. As you look to begin your adoption of the data lake, following a few simple critical success factors will equip you for the journey ahead.

Set Stages
As you go about your journey, use a framework such as our four stages of data lake maturity to set recognizable milestones to help see transformations—and transitions—as you continue to evolve the data lake’s role within your enterprise architecture.

Measure Progress
Define data lake success in smaller, achievable measurements. This is critical to keeping the momentum needed to achieve critical mass in the data lake. Defining long-term success in nebulous or overly post dated terms (i.e., in 5 years) is preparing your team to jump ship early, without having the opportunity to celebrate incremental success along the way.

Plan Ahead
The journey to the data lake should be proactive, not reactive. Planning data lake governance and organization is a key step of readiness before embarking on the data lake journey, rather than trying to retrofit (which may or may not be possible) for out-of-control, sludgy data lakes already underway. In our research and client interactions, we have found early adopters who focus on organization and governance upfront are better prepared to accelerate successful data lake adoption. After all, with understanding and clarity, people are more willing and able to move forward. Consider this the equivalent of doing your due diligence and packing well for your journey.

Accept Variation
The jury is still out on the role that the data lake will play in the future of the enterprise architecture, but adoption continues. Many vendors report already supporting customers in the hundreds. While in the scheme of things, hundreds is a small sample, it is the tipping edge of the mainstream. As data lake adoption continues, we are reaching the point at which there is pressure to identify definitions and data management standards are being applied by early adopters of Hadoop distributions.

For today’s data lake-curious companies, the decision is ultimately whether you will be an early adopter, a mainstream adopter, or a late adopter. Will you be the wily salmon that swims upstream, will you swim along with the tide, or will you be a pirate in the data lake and do things your own way? While best practices have begun to form and emerge, they are embryotic at best. And, even as they continue to become more established, new data lake practices aren’t absolute rules, they—like pirate code itself—are more like guidelines anyway.

<< back Page 2 of 2


Subscribe to Big Data Quarterly E-Edition