Data Lake Lessons From the Coffee Can

Considering a data lake? Or reconsidering one? You are not alone. The topic of the value a data lake can add to an enterprise data architecture remains evergreen. However, the verdict on the value it delivers remains divided. And value is not the only division of opinion; so is the definition of a data lake, its purpose, how it is managed, and how it is used.

The challenge is that the definition, purpose, use, management, and value of a data lake are all dependent on several factors, including who you ask and what they are trying to accomplish. It is very contextual. To bypass the subjectivity of context in understanding the value and use of data lake, perhaps an analogy is better than expert opinions. Your mileage may vary, but there is an allegorical household phenomenon that provides some lessons in understanding the data lake—the coffee can.

It is not just any coffee can. It is the coffee can that ends up in the garage or tool space of a home to collect all the stray, old-project hardware someone in the household cannot bear to part with. They may say something like, “We shouldn’t throw these out. We might need them some day.” Or, they may say, “You never know when these might come in handy.” And no matter how valuable these extra pieces and parts may be one day, to this person, they are not worthy of the organization required so that they may one day be easily found again. Oh no—they are simply added to the coffee can where other nuts and bolts from long-ago projects have been brought together to one day serve their not-yet-defined purpose.

Sound familiar to your data world? Have you heard comments such as, “If we can just get all of our data in one place, then we can manage it”? Or, have you heard, “Since storage costs are no longer a concern with our open source storage options, we should collect and store all the data we can now. We can analyze it later”? In fact, some approach the data lake idea with a “store everything” mentality, resulting in data hoarding behavior. It is similar to the coffee can: Don’t throw it out—we might need it someday. But, even the coffee can has rules and limitations. And, when used appropriately, value. And so should your data lake.

First, let’s take a closer look at the rules, limitations, and value of the coffee can.

  1. Rules. Albeit, the rules of the coffee can are unwritten, but understood—perhaps passed down from previous generations of tool-bench-coffee-can owners. The purpose of the coffee can is to collect fastener hardware in case a fastener is needed for a future unexpected project. Screws, nuts, bolts, nails … you name it; if it’s left over and it can be used to fasten something later down the road, it goes in the coffee can. But, it’s not for tools. Hex keys or Allen wrenches do not belong, no matter how small they are. Drawer pulls? Nope. Pegboard hooks? Nope.
  2. Limitations. The coffee can is about 7 inches tall and just more than 6 inches round. At some point, it will get full. But it can never get completely full, because you still need room for rummaging when needed. The can can’t be exposed to the elements or many of your fasteners will rust. It needs to be readily accessible, yet out of reach from those who could hurt themselves. (I love the data lake corollary here. But, hold tight—we will get to that.) Ideally, the can would have a lid on it to prevent some of these issues, but that is fundamentally against one of the unwritten tool-bench-coffee-can rules. #NoLids
  3. Value. In addition to tool-time pride, the coffee can of collected orphaned hardware delivers value in a couple of ways.
  4. The one-off/ad hoc project. A project that occurs once or requires similar but not identical hardware. These are not large-scale projects and once complete are not likely to be repeated. Take, for instance, a trundle drawer pull that loses a screw and the handle hangs loose—a perfect job to dig through the coffee can to find a screw good enough to fix the handle. The fastener specifications don’t have to be precise if the drawer handle is ultimately secured. It could be a Phillips head screw, a flat head screw, a carriage bolt, a wood screw, or a machine screw. As long as it meets a few needs (width, thread size, and length), the project is complete, and no other coffee can diving is necessary.
  5. The prep work for a large-scale project. Ultimately, you have big plans for an extensive project, maybe even one that will be ongoing. The decisions you make must be scalable and sustainable, but at the outset you are not exactly sure what you need. You don’t know what type of fastener is best. An example here is the Poke-A-Pumpkin game, the Pinterest project of all Pinterest projects. It’s a fun fall game where players get to poke through a tissue covered cup to grab a hidden surprise. What’s really the surprise is how tricky it is to build. We can see a fastener is needed for each cup, but without instructions (who follows Pinterest instructions, anyway?) the specific width, length, and weight (a surprisingly important factor) are completely unknown. The coffee can affords you the opportunity to dig through its contents and try many types, sizes, and weights to determine what is best before you scale up and out. This testing-ground or sandbox environment allows you to test small and scale large with confidence.

Now, with a better understanding of the tool bench coffee can, let’s consider the data lake.

  1. Rules. Written rules are better in business environments than rules assumed and passed down through tribal knowledge. But the unwritten rules of the coffee can are still better than the no-rules approach typically applied to data lakes. Data lakes cannot be a free-for-all data dumping ground. What goes in the data lake should be predicated on what is expected to come out of the lake and how the lake is to be used. Sure, you can store data of all types, structured, semi-structured, and unstructured, but do so with purpose.
  2. Limitations. A typical driver for data lake consideration is reduction in storage costs. While “the more data, the better” sounds wonderful, there are always limits. And, the limits become greater the more specific the business use case becomes. Processing, response time, access, and security are typical project needs where a data lake is likely to fall short. As a project becomes more defined, do not be afraid to move from the lake to another environment. Just as with the coffee can, the data lake cannot be exposed to the elements as it puts the contents at risk. And, similar to digging through the coffee can without gloves, digging in the data lake without protection hurts too. Most certainly, the data lake should not be accessible to those that could hurt themselves (or the company).
  3. Value. The data lake serves as a fantastic storage area for data with potential use and purpose. However, it does not deliver value until the data is used. For ad hoc projects, the data lake data may be used in a once-and-done fashion and deliver immediate value. In these instances, even good enough fits the bill.

But, in extensive projects where good-enough does not work, where security, scalability, and repeatability matter, the data lake provides value as a sandbox environment or testing ground, preventing costly mistakes if deployed at scale. Once specific data needs are determined, scaling and operationalizing these projects should then be executed in more appropriate, sustainable, and secure data environments.

As you consider your data lake options, consider these lessons from the tool bench coffee can. A controlled, managed environment to store data “that might come in handy one day” might not be such a bad idea after all.



Subscribe to Big Data Quarterly E-Edition