Don't Run Mission-Critical Applications on Hadoop Without Knowing This

May 8, 2014

By Michele Nemschoff

Apache Hadoop’s computing and storage capabilities are persuading many CIOs to make the move to this distributed computing architecture. Unfortunately, the unreliability of some Hadoop distributions is overshadowing its remarkable qualities. This leaves many businesses, wanting to run mission-critical applications, with an inflated sense of risk.

Foundational Necessities for Hadoop Distributons for Optimal Dependability

It is an inevitable fact that every software system will have problems. An enterprise-grade Hadoop infrastructure puts minimizing and managing these system errors at the forefront.

When considering a distribution’s dependability, you should evaluate a Hadoop distribution’s position in these five foundational necessities:

Removing the Single Point of Failure (SPOF)

The default Hadoop architecture is designed with a single NameNode. This one master NameNode is responsible for holding metadata and locating all of the data on subsequent data nodes. This makes the NameNode a SPOF. If the NameNode were to fail, the entire cluster would be useless. This becomes a major red flag to businesses wanting to run mission-critical applications.

Some distributions use secondary NameNodes as a solution to this dependability glitch. Unfortunately, secondary NameNodes cannot be relied upon as a failsafe. They only replicate data from the primary NameNode periodically which makes the data unreliable. The best solution to avoiding a SPOF is to use a Hadoop implementation that utilizes a distributed metadata architecture. This removes the NameNode and distributes the metadata in a way that provides high availability (HA) with automatic failover.

Minimizing Moving Parts

Reducing the number of moving parts in any software system increases both performance and dependability. For example, inefficiencies from navigating through several layers in your HBase environment can obstruct its reliability. Instead, you should work within an HBase environment that doesn’t require a series of separate layers such as the HBase Master and RegionServer or the Java Virtual Machine.

Minimizing Manual Tasks

Automating functions such as data compactions, administration and pre-splitting will save you time. It will also decrease the possibility of human error.

Ensuring Your Data’s Integrity

Use internal checksums, replication, data protection and disaster recovery features like true point-in-time recovery snapshots and mirroring. All of these things work together to enhance your data’s integrity and to protect your data.

Optimizing Task Workflow

Small ad hoc queries can get stuck behind large jobs that you schedule in advance. Your runtime environment should optimize this process and put the smaller jobs first.

Understanding Where High Availability Must Integrate With Your Hadoop Implementation

High availability (HA) refers to a system that is continuously operational regardless of imminent system errors.

In order for your Hadoop implementation to handle mission-critical applications, it requires the following five HA capabilities:

HA Should Be Default

When evaluating your Hadoop distribution, HA should be the default behavior. Taking advantage of HA shouldn’t require any further action on your part.

Distributing the Metadata

As discussed earlier, creating a SPOF by using a single NameNode can risk your system’s availability.

HA in MapReduce

Seamless automation features should handle any failure within MapReduce gracefully. A lag in task completion should never hinder your SLAs. Make sure manual restart requirements aren’t stifling your system’s availability.

Minimized Recovery Time

If your system’s data center suffers a site-wide failure, how long is it until you can recover and access your files? If you are comparing Hadoop distributions, this is a great differentiator. Look for a distribution that uses mirroring as a method of disaster recovery. Mirroring increases system availability by ensuring a remote replica is ready to take the place of the primary cluster upon site-wide failure.

Rolling Upgrades

Rolling upgrades allow you to incrementally install system updates without disrupting availability.

Protecting Your Data

The methods you use to protect your data should be simple and dependable. Replication and snapshots are the two standard approaches. Most distributions utilize both of these, but with major variances in dependability.

Default replication makes three copies of data. The preferred method of replication is entirely automated and will copy not just the data but the metadata as well. You should store at least one of the three replications on a separate rack from the other two.

Just like replication, all snapshot techniques aren’t created equal. To create a completely accurate snapshot within HDFS, you must close all files at the time you take the snapshot. Point-in-time (PIT) recovery snapshots are a far superior alternative to the HDFS solution. PIT snapshots do not require you to close files before they are updated with appended data. Files are also never duplicated so additional storage isn’t needed. This high performance and space efficient alternative can create a snapshot of a petabyte volume within seconds.

In conclusion, making Hadoop ready for your mission-critical applications should require no further effort on your part other than ensuring your Hadoop distribution has incorporated the previously covered capabilities.

About the author

Michele Nemschoff, who is vice president of corporate marketing at MapR Technologies, received an MBA from StanfordUniversity and a BS in Economics from The Wharton School.