Getting Big Return From Big Data

Q&A with Shaun Connolly, Hortonworks’ VP of Corporate Strategy

Founded in 2011, Hortonworks provides the Hortonworks Data Platform, built on Apache Hadoop and architected for the enterprise. With the Hadoop ecosystem expanding rapidly, Hortonworks’ Shaun Connolly recently discussed how Hadoop and related technologies are growing and being used, as well as the factors for a successful deployment.

In your view, how far along is Hadoop deployment in the enterprise?

Shaun Connolly: We are seeing a clear change from a discussion of technology feature/function to a discussion of high-value use cases across every industry, and that is usually indicative of an early majority type of adoption curve. At Hadoop Summit, we discussed two separate data points that support that view. Tom DelVecchio, founder of Enterprise Technology Research, surveys about 700 CIOs on a quarterly basis and his research shows that Hadoop is the top area of expected IT spending across the 25 major sectors he covers. Gartner also recently published numbers on Hadoop adoption at about 26%, with another 11% expected in the coming year.

If a company has not yet deployed Hadoop at this point, why should it consider doing so now?

SC: It really is about the fact that we are in the midst of a sea change of sorts in the world of data management. We talk about this concept of a new data paradigm which includes Internet of Things data as well as social, clickstream, geolocation, and other new data types that are in addition to the more structured data types that have traditionally powered ERP, CRM, supply chain systems—the transactional systems in the enterprise. The existing systems aren’t equipped for this new world of data and the tsunami of data. And so, a new data architecture has emerged in which Hadoop plays a key role because it is able to store and process the full fidelity of data for a much longer period of years due to its cost structure and its technical architecture. 

We can dovetail that fact with where we are in the adoption curve, and a lot of the transformational use cases being represented across all the major industries. The retail sector is able to transition from mass branding to individualized experiences that are real time. The financial sector is able to move from daily risk analysis to real-time trade surveillance that saves millions from a risk and fraud perspective. And healthcare is moving from mass treatment to personalized delivery. When you map those transformational use cases into the question of why a company should consider deploying Hadoop, fundamental to the answer is that each industry is being impacted by a new data architecture with Hadoop as part of that equation.

What are the barriers to Hadoop adoption that companies are grappling with?

SC: The biggest barrier is figuring out which use cases to start with. At Hadoop Summit we had about 75 presentations from end users sharing use cases, so the pace of use cases is a dramatic uptick from last year. You need a lot of those stories with people saying, “I wasn’t able to do this before and now I am able to do this with these new data sources joined, aggregated, and analyzed with my existing data sources in ways that I have not been able to do before.”

Early on, many of the use cases started as purely a cost optimization play. Hadoop’s cost, with respect to some of the other more traditional data management platforms, is, at times 10x to 100x cheaper. The point is that Hadoop unlocks some advanced analytics use cases, predictive use cases, data discovery and exploration use cases that were not able to be done before. That is one barrier to adoption—the need to demystify those use cases.

Second, there is a skills gap. But frankly, I have seen DBAs, application developers, and others ramp up fairly quickly so I think supply and demand will cure that one as people retool their skills. And last, there are the capabilities of the platform around operation, security, and governance. But that is getting addressed at the open source level, and it is probably a distant third right now.

What are the differences between deploying a commercial distribution versus open source Apache Hadoop?

SC: The great thing about open source is that you can download and get started with a particular piece of technology and get familiar with it. Apache Hadoop is one such project. But if you look at the broader open enterprise Hadoop platform space, Apache Hadoop is one open source project technology out of, in our case, in our platform, more than two dozen open source Apache projects.

As needs get richer and you want Apache Hive for SQL access, Apache HBase for a NoSQL database, Apache Kafka and Apache Storm for Internet of Things and real-time stream processing use cases, then you are greeted with an integration and certification complexity that a commercial provider such as Hortonworks can take out, while building all this into a consumable enterprise-grade platform, albeit 100% open source. As the choice of components gets richer and more complicated, that tends to drive people to a vendor for support in the use of those projects as a cohesive whole.

How does a company decide whether to deploy Hadoop in the cloud or on-premise?

SC: That is a classic question—should Hadoop be deployed up there or down here? And my answer has been yes to both. My background is in application platform and cloud technologies, and for a while I have been seeing the convergence of on-premise and cloud use cases. In the case of big data processing in Hadoop, I don’t think that it is any different than other sectors. But my answer tends to come from two perspectives. One is the application lifecycle, and the other is the actual nature of the use case.

How do these affect deployment?

SC: The application lifecycle starts with development, test, and initial pilot, and rather than procuring hardware before you even know how much Hadoop you need, so to speak, you can get started in the cloud on demand, and prove out your initial use cases. Even after deployment, we see customers with large on-premise Hadoop clusters also continue to do their dev-test work in the cloud just because it is on-demand and the operational cost is much more efficient for them.

The use case point really gets more into the “data gravity” notion of whether an enterprise has a lot of existing data on-premise or has a strong preference because they are more conservative in how they want their data to be managed and really want their data to be on-premise. That use case will drive Hadoop deployment down here. Over the past few years that we have been in the market, clearly, 85%–90% of our deployments have been on-premise. But with the IoT use cases, we are seeing data being born “up there” in the cloud so I expect that mix to change over the coming years, and we may see companies starting in the cloud and staying in the cloud because that is where the data is born.

Hortonworks is partnering with Microsoft on Azure. Is this the endorsed approach for the cloud?

SC: Hortonworks was founded in July 2011. Our partnership with Microsoft started right at the tail end of 2011. In large part it was focused on making sure that Hadoop is available everywhere—not just on Linux but also on Windows for the Windows ecosystem; and not just on-premise but also in the cloud as a first-class service. We wanted to make sure that users have a consistent experience no matter where they choose to deploy it. So we have a very strong partnership with Microsoft. Microsoft definitely has a lot of tools and technologies in the business intelligence space, the machine-learning space, and, increasingly, Azure services that are relevant to being used with Hadoop in the cloud, and so there is a richness of capabilities for enterprises to tap into as well as a portability and compatibility experience. We also have customers that deploy on the other cloud services—Amazon, the Google Cloud Platform, as well as OpenStack cloud platforms from Rackspace and others.

In your view, what are some of the key accompanying technologies that go hand in hand with the core Hadoop technology?

When people say Hadoop they tend to just think of Apache Hadoop, and in our platform there are more than two dozen other Apache projects that surround Hadoop to make it an enterprise data platform. In the past few years, the YARN component in Apache Hadoop has unlocked it as a broad data operating system. And so you see real-time stream processing and online data serving, and NoSQL databases, and interactive SQL solutions that all plug in natively and run on that YARN data operating system component of Hadoop.

There is also Apache Ambari for consistent operations, Apache Ranger for comprehensive security, and a new project called Apache Atlas for trusted data governance. It is a rich ecosystem of enterprise-class capabilities. A lot has happened in the past 5 years which has been fantastic to watch.

In addition, we now have Hortonworks DataFlow (HDF), as a result of our news in August to acquire Onyara, the lead contributor to the Apache NiFi open source project. Apache NiFi was developed within the NSA for 8 years, and was made available to the Apache Software Foundation through the NSA Technology Transfer Program back in 2014. Hortonworks DataFlow powered by Apache NiFi is a new support subscription and is complementary to Hortonworks Data Platform powered by Apache Hadoop. Through the combined use of Hortonworks Data Platform and Hortonworks DataFlow, data at rest as well as real-time data in motion can now be blended to provide historical and perishable insights for predictive analytics.  

If you could give someone embarking on a big data project one piece of advice, what would that be?

SC: Everybody makes a big deal out of this “data lake” notion. My big advice is: Don’t look to fill your data lake right away. Start small. The successful customers that we see are the ones that focus on advanced analytic applications that are more business unit-aligned and line-of-business-aligned. Two of those projects become four, become eight. And you get a successful adoption path. Take an application-led approach. Don’t make filling the cluster your primary objective; let business transformation and interesting applications be the driver.

Image courtesy of Shutterstock.


Subscribe to Big Data Quarterly E-Edition