Page 1 of 2 next >>

Mapping the Path to the Data Lake

In ecology, the formation of a lake is a gradual process ripe with variation. Some lakes are formed as the result of glacial, tectonic, or volcanic activity, while others are the result of other natural or artificial processes. The majority of lakes are freshwater, with most having at least one natural outflow to a river or stream, while others are constructed artificially for agricultural use or water supply. There is even evidence of extraterrestrial lakes on one of the moons orbiting the planet Saturn. And, while each lake is an important ecosystem in its own right, lakes are extremely varied in terms of origin, occurrence, size, shape, depth, water chemistry, and so forth.

The point here is that just as no lake can be defined as exact to another, neither does any lake’s journey follow a one-size-fits-all approach. Each lake—and its corresponding origin story, characteristics, and place in the ecosystem—is unique. And, it is this uniqueness and differentiation that paves the way for variety and diversity in an ever-dynamic lake ecology. This is a good thing—and a necessary one.

The same can be said for the data lake: There is no one single path to the data lake within the data architecture of the organization. Likewise, each data lake is unique, with inputs and decisions from the organization contributing a variety of essential elements in organization, governance, security, and so on. Although the data lake is still an emerging concept, many organizations have already begun their journeys—following the guiding light of established data management and information principles. Some are diving into the data lake, others dredge several data pools and work to deepen the channels to a singular data lake. But, regardless of how they get there, as Hadoop continues to move into mainstream adoption, more and more companies are including the Hadoop platform as a core technology within their modern data architectures and realizing the competitive advantages gained from harvesting the data lake for big data and analytics. Moreover, these early data lake diggers are contributing the implementation experiences and lessons learned that will eventually shape best practices and industry standards in the days to come. Even so, while brave early adopters have already begun to trench their data lake, adoption of the data lake strategy itself continues to be hindered by the onslaught of alarmist headlines that reduce the potential of the data lake to a swampy fallacy.

In “The Definitive Guide to the Data Lake,” a special report published jointly by Unisphere Research and Radiant Advisors, and sponsored by Hortonworks, MapR, Teradata, and Voltage Security, we clarified the murkiness of the data lake concept, drawing from fundamental data and information principles. We shared pragmatic insights earned through the experiences of those companies within our advisory network which are using good judgment and good principles to move forward with data lake strategies in the absence of any concrete “how to trench a data lake” reference guides. While the “Definitive Guide” is based on those early data lake adopters’ experiences to emphasize and provide navigation for recognized areas of concern, this article is focused on pointing out root causes of confusion in the industry today that are hindering data lake adoption for companies wary of dredging a data swamp. Our goal is to help companies considering the data lake navigate the tides of uncertainty and find a way to move forward with confidence. In this article, we discuss a few important ideas that data lake adopters should keep top of mind in their journey forward.

You Are Not Alone

First, companies should recognize that they are not alone in the struggle to find a clear and definitive articulation of what the data lake is and/or how it impacts the core enterprise data architecture in a long-term way. When it comes to data lake definitions, there is definitely more than one fish in the pool.

The data lake is an emerging concept; therefore, its definition is likewise still incubating and awaiting industry consensus. Multiple definitions are something of an inevitability, and thus, confusion is a function of the simple truth that many of the definitions floating around the industry today are vague, abstract, or even contradictory as multiple sources (including vendors and marketing) compete to harness the concept. For example, one definition from Strata in 2014 described the data lake as a “centrally managed repository using low cost technologies to land any and all data that might potentially be valuable for analysis ...” A contributor to also in 2014 defined the data lake as the dream of a place with “data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment.” At Radiant Advisors, and in our “Definitive Guide,” we explore the data lake definition as both the end state of the modern data architecture and a guiding light for data architecture-related decisions to achieve critical mass for the data lake.

Another complexity to keep in mind in evaluating data lake definitions is that they, like the data lake itself, are evolutionary and should be kept within a transitory context too. There are data lake definitions today that only capture its role at a certain maturity stage. The role of Hadoop continues to evolve (driven by business demands), and absolute definitions cannot keep pace with the rate of change. Like many facets of technology, architectural definitions are a system of technologies and components that become snapshots (or milestones) along the path to maturity that change and progress over time.

The resulting reality for now, then, is that many companies may get caught in the crossfire of rival or hazy definitions, and this fosters hesitancy that ultimately stunts data lake adoption. However, rather than thinking of the data lake as requiring an end-all, be-all definition, consider that definitions change within the context of maturity. Expect the definition of the data lake to be something that evolves along with the data lake itself. Additionally, think beyond the definition to the information that already exists in well-defined and established data management principles, and use that as the guiding light to the data lake.

Subscribe to Big Data Quarterly.

Image courtesy of Shutterstock.

Page 1 of 2 next >>


Subscribe to Big Data Quarterly E-Edition