How Hadoop is Driving New Information Management Strategies

Unless you have been trapped under an elephant, you have heard a lot about big data. About 5 years ago, the term really gained momentum as IT organizations started to embrace the Hadoop big data management stack, which eliminated previous barriers and costs associated with the traditional data warehouse and storage approach.

Today, most IT teams are either contemplating a Hadoop strategy or are well under way in the hopes of making data actionable. The obvious upside to the business is that all the data it could ever use will now be collected and stored. However, the bad news is they don’t really know how to get access to it or worse, make sense of it.

Based on the maturity of the deployment, business leaders have gone so far as to question if they will see any real return from their Hadoop investments beyond just storage-related cost savings. In fact, what these mature organizations really want to know is if all of this amazing data that is being collected will actually be usable to drive business decisions and value.

If the entirety of the enterprise IT landscape were comprised only of Web 2.0 companies like Google, Yahoo, Facebook, and Twitter then Hadoop would be considered an unqualified success for big data projects. However, the reality is quite different as many Hadoop undertakings inside of mainstream enterprises are still largely deemed “science projects” that have not been integrated into production environments.

In addition, the major Hadoop distributions from the major players are still largely unable to work with each other. As Hadoop encroaches further into mainstream enterprises, incumbent database vendors are responding more aggressively in an effort to protect their primary turf.

All the while, Hadoop faces a tough time demonstrating its applicability and value inside big data projects across the larger enterprise market, which constitutes a much larger commercial opportunity. According to Statista research, the enterprise software market alone is estimated at nearly a $300 billion per year and growing. Yet, most business applications have no way of tapping into these vast data pools.

In time, these issues will work themselves out and it should not deter enterprises from adoption, as Hadoop has already proven itself as a business-critical technology.

However, there are significant organizational issues that need to be addressed for Hadoop big data implementations to mature to the next level:

  • IT must expose the data in a self-service model. While IT has done a fine job building out their ingestion process, it is no longer good enough to simply collect all the data available. IT teams need to supply “pipes” which allow business users to extract the data they need instead of swimming in a pool full of old, unusable data.
  • The business has to be pragmatic about what data they need to support their analytics. Since the leaders within the lines of business are most familiar with what intelligence they need out of that data, the impetus now lies with them to take Hadoop projects to the next stage. The difficulty is the overwhelming volume and variety of data. Sometimes, less is more, and once they realize they don’t need ALL of the data, the situation becomes more manageable. Perhaps they just need to take a portion of the data and combine it with third party sources to make it more contextual for analytics. For instance, many companies collect web traffic statistics, but not all of it is useful to every department. Perhaps the marketing team wants to overlay web traffic location data with social media content from to see if regional campaigns are driving the intended interactions with online communities. The security team, on the other hand, may need a different portion of that data to understand visitor profiles and trends which later help them detect anomalous behaviors. In both cases, IT should make it possible for the business to organize and shape the data on the fly without waiting for models to be built or IT to make it usable.
  • Both IT and the business must agree to the underlying governance requirements in advance. Since the data is not restricted by schemas, models, or conventions, lines of business will have to work in conjunction with IT to prove that they can safely and consistently extract value from the data. This collaboration is critical if Hadoop is to continue enjoying the limelight.

The demand to access the right data at the right time is what’s driving the need for better tools and automation solutions across the entire analytic process. As organizations continue to look for technology that enables the lines of business to better use data in conjunction with IT, new solutions have emerged. Front and center is a new class of adaptive, self-service data preparation solutions, which simplify, automate and reduce the manual steps of getting the data into a useable form. In fact, this is accomplished without risking loss of control over who uses the data, for what analytics, and how users prepare it for their own consumption.

Self-service data preparation toolsets enable analysts within the business to collaborate and dynamically govern the data integration, data quality and enrichment processes at scale from their Hadoop-based data lake store. With the power of machine learning and sophisticated algorithms, business analysts can be proactively guided through a process that helps to aggregate and enrich the data sets, identify patterns and relationships, and find and fix quality issues without the help of IT.

Self-service data preparation solutions can also offer a data library, which is a secure environment where business analysts and IT can share data sets with the business, as well as become the one-stop shop for all completed and in-process data prep projects. These libraries can include full versioning and tracking of all uploaded data and published AnswerSets as well as export logs, which track everything leaving the system and the configurations used. All the data sets and AnswerSets are stored and accessed through a library, which sits on top of the Hadoop Distributed File System (HDFS). This enables two deployment options for data persistence: either by using an existing Hadoop cluster or by creating a specific Hadoop cluster.

As more and more business professionals struggle with IT-centric information management strategies, it is now clear that to make Hadoop implementations an unqualified success, the business needs to step up to prove the value in partnership with IT.

Innovation within the self-service tools and automation market, such as data prep, are critical to driving the process forward. After all, data that simply sits in a mammoth repository gathering dust like pre-historic relics to be admired, may reach the same fate as its ancient and now extinct ancestors.

About the Author: Rik Tamm-Daniels is the Vice President of Technology and Partnerships at Paxata. Prior to Paxata, Rik was the Co-founder and Vice President of Technology at Attivio, where he was responsible for development of technical alliances, partner technical enablement during both pre-sales and delivery, technical aspects of go-to-market strategy development and OEM and service provider partner program management. Rik is a frequent speaker at industry events and is the author of The Key to Smart Big Data: Know Thy Technology. Rik received his BS and MS Computer Systems Engineering at Boston University.

Image courtesy of Shutterstock.



Subscribe to Big Data Quarterly E-Edition