Riding Two Major Apache Hadoop Trends: Cloud and Standards

By Mike Maciag and John Mertic

Apr 7, 2017

Over the past year, we have witnessed a sea of change in how enterprises are thinking about Apache Hadoop and big data. Just 12 months ago, a majority of enterprises were still committed to building their own on-premises Apache Hadoop implementations and were agonizing over which distribution to select, weighing which constellation of applications would run with that particular distribution, since there were no standards for Apache Hadoop. Today, a majority of enterprises are thinking about the cloud first, not on-premises, and are increasingly relying on ecosystem standards to drive their Apache Hadoop distribution selection.

Apache Hadoop Is Moving to the Cloud—Fast

Historically, Apache Hadoop implementations have been on-premises. IT teams wanted complete control and enterprises were willing to accept the high implementation costs, the need to hire a full Apache Hadoop operations team, and the need to integrate, manage, and upgrade all software themselves. Security concerns often made companies hesitate before considering cloud. However, this is rapidly changing. According to a Gartner special report, “Hybrid Cloud: The Shift From IT Control to IT Coordination,” which surveyed hundreds of enterprises, Apache Hadoop is rapidly moving to the cloud. More than 50% of enterprises surveyed say that they will use cloud Apache Hadoop or a hybrid cloud environment for Apache Hadoop.

Why cloud now? It’s from the hard lessons learned from running Apache Hadoop on-premises. Big data in the cloud makes sense for four reasons:

First, you can get started more quickly and for far lower cost. On-premises implementations require buying hardware, integrating software, setting up networking, and hiring an Apache Hadoop operations team. With a cloud implementation, you can have multi-terabytes of processing power at your fingertips within hours.

Second, it is easier to scale an implementation when it is in the cloud. With an on-premises implementation, growing data volumes demand more hardware, more people, and more significant cash investments. With the cloud, scaling is as simple as just ramping up your account. Some solutions have “compute bursting,” or elastic scaling, in which you don’t need to order more nodes. The solutions automatically recognize the need for more resources and automatically provide them.

As a side note, it’s also important to be able to scale down when needed. Scaling down an on-premises deployment is currently not possible, since the resources may be needed again soon. However, a cloud solution can quickly ramp down when the processing power is not needed. For vendors with elastic scaling, the size of your implementation can be automatically reduced, so you only pay for what you actually use.

Third, cloud vendors ensure that you have the latest, production-ready capabilities. This could mean the latest hardware, as well as the latest Apache Hadoop ecosystem project. With on-premises, implementations can get “frozen in time” since nobody wants to go through the pain of upgrading, or someone may be too concerned about disrupting existing data pipelines. Cloud vendors are motivated to ensure that their customers have the best proven software capabilities at their disposal.

And fourth, some cloud vendors provide Apache Hadoop operations support to their customers. This means that enterprises don’t have to hire a large Apache Hadoop team (if they can even find the right people), and data scientists do not have to double as operations people. Operations is the key to ongoing Apache Hadoop success, since operational excellence is what ensures that critical data pipelines continue to run at top performance.

Not All Cloud Vendors Are Alike

Many organizations start with an automated Apache Hadoop cloud, since organizations can sign up immediately with just a credit card, no sales conversation required. However, automated cloud providers offer only one type of Apache Hadoop cloud, and enterprises might find that other types of vendors better meet their needs. Automated Apache Hadoop clouds are often a starting point for those interested in Apache Hadoop as they offer immediate gratification.

After testing automated Apache Hadoop clouds for a while, enterprises often mature to a different cloud offering. Why? Automated cloud vendors often compete on price, so they offer generic infrastructure or sub-optimal infrastructures that require data to move around more than it needs to. As a result, performance is not the best, and, despite the low per-computational minute pricing, it ends up costing quite a bit to pay for inefficiency.

In addition, automated Apache Hadoop clouds don’t provide any operational support. You are fully responsible for operations management. If your job failed, then you are on your own to figure out why and to fix it. The vendor provides no guidance. Since these vendors run on generic infrastructure, job failure rates can be high, leaving your operations team pretty busy fixing jobs. In the meantime, the pricing is prone to unpredictable spikes, putting your budget at risk.

Finally, many of these cloud providers do not conform to emerging standards for Apache Hadoop and charge you when you pull your data out. This leaves companies concerned with being locked in to a solution that they can’t grow with and could be expensive to leave.

Another option is bring-your-own Apache Hadoop clouds (BYO clouds). Some major Apache Hadoop distribution providers have announced BYO cloud efforts, and, while these efforts are just emerging, they may provide a more seamless experience when running in a hybrid environment, since you can have the same distribution running on-premises as in the cloud. In addition, membership in ODPi, a nonprofit organization committed to simplification and standardization of the big data ecosystem, signifies that it adheres to standards for Apache Hadoop, addressing concerns about lock-in.

Finally, there are fully managed Apache Hadoop cloud vendors which provide Apache Hadoop operations services with their offerings, sometimes known as “concierge cloud.” By providing Apache Hadoop operations services for the customer, the services can be higher performance, helping to increase job success rates, reduce job completion times, and result in lower overall costs and a happier data science team.

Cloud May not Be Best

While the arguments for leveraging the cloud for Apache Hadoop are compelling, there are still specific use cases in which deployment on-premises is the better fit. For customers in heavily regulated industries, the security concern is still a common one. While cloud providers have reassured key financial services and healthcare sectors with proof of security, such as SOC 2 certification, HIPAA compliance, and PCI compliance, it may still be more practical to consider an on-premises solution.

In addition, data privacy laws in regions such as the EU often make it prohibitive to leverage a cloud deployment. These laws require certain data to always reside within a defined region, and not all cloud providers are equipped and/or can provide such guarantees. As a customer looking at cloud deployments, be sure to understand all the legal and regulatory pieces at play that could impact leveraging cloud. Experienced service providers and vendors that work in your industry and/or region can help provide guidance on leveraging the cloud for your organization.

The Whole Ecosystem

Cloud vendors help solve some of the challenges of Apache Hadoop, such as the capital costs of getting started, the challenges of keeping up with a rapidly evolving ecosystem, and the heavy burden of managing and operating the solution. Standards for Apache Hadoop, as set by the ODPi, address another key issue—helping enterprises avoid vendor lock-in and ensuring that applications can run on the distribution that you have selected.

The question that many ask here is, “Why are standards important?” Doesn’t the open source nature of the upstream components encourage these projects to be of the highest quality and continue to develop and innovate in a way to benefit the ecosystem as a whole?

It is important to understand that Apache Hadoop isn’t a monolithic project, but a collection of associated open source projects that often has some proprietary glue or components that compose a canonical distribution. How these components come together across Apache Hadoop vendor distribution products is far from consistent. Additionally, the pace of development and strategies for versioning and backward compatibility is very inconsistent. Downstream consumers, including application vendors, solution providers, and end users, without the skillset or bandwidth to become experts of the inner workings of Apache Hadoop componentry, open an area of risk with dependence on a non-standard core which causes ripple issues on product development, data strategy, and project success.

ODPi specifications serve as a tool to help make Apache Hadoop easier for organizations to understand and work with. These specifications define a base-level of expectations for anyone who looks to work with an Apache Hadoop cluster or distribution, so consumers know exactly what to expect and what is available for them to use. This doesn’t prevent innovation from each distribution but instead enables vendors to have more time to focus on that versus stuck in the rut of compatibility and portability issues.

Standards Matter

Whether you are still committed to building your own on-premises Apache Hadoop implementations or are taking a cloud-first approach, a strong and stable ecosystem and standards matter. While cloud helps solve some of the challenges of Apache Hadoop—the capital costs of getting started, the challenges of maintaining and running the solution—standards help solve another: future flexibility and a broader market for your solution.