Big News with Big Data - Commercial Enhancements to Apache Hadoop Usher in a New Era

Apr 11, 2012

By Jack Norris

Big Data. It's a term used to characterize those applications that have such enormous data sets, they have exceeded the capabilities and capacities of traditional database management systems. These huge data sets - measured in terabytes, petabytes, exabytes, and zettabytes-must instead span large clusters of servers and storage arrays. Search engine companies were the first to face this situation, and the result is an open source solution called Hadoop.

The Past, Present and Future of Hadoop

Hadoop was created in 2004 by Doug Cutting (who named the project after his son's toy elephant). Initiatives from a few years earlier at both Google and Yahoo! contributed key elements, including a distributed file system and the framework necessary for processing the huge, distributed datasets. In 2006, Hadoop became a subproject of Lucene (a popular text search library) at the Apache Software Foundation, and became its own top-level Apache project in 2008.

Essentially, Hadoop provides a way to capture, organize, store, search, share, analyze and visualize disparate data sources (unstructured, semi-structured, etc.) across a large cluster of computers, and is designed to scale up from a handful to thousands of servers, each offering local computation and storage.

While there are numerous elements that are now part of Hadoop, two are fundamental to its operation. The first is the Hadoop Distributed File System (HDFS), which serves as the primary storage system. HDFS replicates and distributes the blocks of source data to the compute nodes throughout a cluster to be analyzed by one or more applications. The second is MapReduce, which creates a software framework and a programming model for writing applications capable of processing the vast amounts of distributed data in parallel on very large clusters.

The open source nature of Apache Hadoop creates an ecosystem that facilitates constant advancements in its capabilities, performance, reliability and ease of use. These enhancements can be made by any individual or organization-a global community of contributors-and are then either contributed to the basic Apache library or made available in a separate (often free) distribution. One of the major enhancements made recently is the ability to use the industry standard Network File System (NFS).

Hadoop and NFS - A Marriage of Convenience

Having a unique file system is sometimes necessary to satisfy the particular needs of a demanding application, or to improve its reliability or performance. Such was the case with big data when, at the time HDFS was created, the easiest way to implement MapReduce functionality was with its own distributed file system capable of handling enormous datasets.

Distributed file systems like HDFS utilize a network of nodes to partition a very large dataset across multiple servers. The inherent complexity of a networked environment makes distributed file systems more difficult to use than simple, disk-based file systems. The management complexity in a larger cluster involves the use of a dedicated HDFS namenode server to host the file system index. One of the fundamental problems with HDFS is its lack of random read/write file access by multiple readers and writers. Like a conventional CD-ROM, HDFS prevents files from being modified once they have been written; files must be written in sequential order; and files cannot be read before they are fully written and closed. While some Hadoop distributions support appending data at the end of an existing file, this is normally only possible by the original application; in other words, there is no way for multiple applications to write data to a single file.

Additional limitations of HDFS also result from its requirement for batch processing of data movement and management. The typical application first logs the data to direct- or network-attached storage. On some predetermined time interval, data is then loaded by a batch process into HDFS. At this point, the analytics are run creating a result set, which must then be unloaded - also by a batch process - for any additional analysis. These batch processes cause a significant time lag between data production by the application and analysis in the Hadoop cluster. Loading data at a higher frequency minimizes the time lag, but results in a larger number of small files that increase management complexity, consume storage resources, and can ultimately exceed the maximum scalability of some Hadoop distributions.

Recent advances have made it possible to use NFS directly in a way that is compatible with MapReduce and other Hadoop applications. With NFS, any remote client can simply mount the cluster, and application servers can then write their data and log files directly into the cluster, rather than writing first to direct- or network-attached storage. This is made possible by a re-architected storage services layer. This new lockless storage service provides fully random read/write storage that can be accessed through the HDFS application programming interface (API) of course, but also through the NFS.

Replacing HDFS with a new storage services layer affords many advantages, particularly because the storage services layer provides NFS support along with 100% compatibility at the API layer for MapReduce and HDFS. Here is a summary of some additional advantages that derive from the convenience associated by this new storage services layer:

Direct access to NFS, with its real-time read/write data flows, simplifies operation and can improve performance dramatically.
Unlike with the write-once HDFS, files can be modified, overwritten and read as required, potentially with multiple concurrent reads and writes on any file.
Graphical file browsers can be used to access and manipulate cluster data. Users can simply browse files, automatically open associated applications, or drag-and-drop files and directories into and out of the cluster.
Files in the cluster can be edited directly by ordinary text editors and/or integrated development environments.
Standard command-line tools and UNIX applications and utilities (such as rsync, grep, sed, tar, sort and tail) can be used directly on data in the cluster. With HDFS, by contrast, the user must either develop custom tools, or copy the data out of the cluster in order to use standard tools.
The use of NFS can reduce the need for log collection tools, such as Flume that often require agents on every application server, by making it possible to either write data directly into the cluster or use standard tools like rsync to synchronize data between local disks and the cluster.
Application binaries, libraries and configuration files can be stored inside the cluster and accessed directly, greatly simplifying operation.

This architectural change drives other advantages, as well. For example, while data compression can be added to Apache Hadoop using third-party tools, it is difficult to install and use, and does not normally utilize the cluster's full capabilities. Data is typically compressed manually before copying it into the cluster and a special MapReduce job is then run to index the compressed data (assuming parallelism is desired in the application). Applications must also be modified so that they can access the compression indices. The re-architected storage services layer provides automatic and transparent compression, resulting in improved performance, simplified development and considerable savings in storage costs.

Another example can be found in those organizations that operate multiple Hadoop clusters to separate different data or applications, or to improve performance. A Hadoop distribution re-architected to expose an NFS interface can provide direct access among multiple clusters, and these clusters can be accessed from both inside and outside of Hadoop. An organization may have separate clusters for development and test, which are also separate from the production cluster(s). With NFS, the files on each cluster's directory are accessible by all others. Files can easily be copied between and among the clusters as desired with a simple copy command.

With NFS support the process is much, much simpler. With the conventional Hadoop configuration shown in Figure 1, there are many intermediate steps where data needs to be written to the cluster. These writes and all subsequent accesses are made via the HDFS API. This requires entire workflows to be rewritten and changed to accommodate the limitations and inefficiencies of HDFS. Contrast this complexity with the simplicity of the distribution supporting NFS access shown in Figure 2. Different steps in the workflow can access data (read and/or write) directly via NFS. This vastly simplifies how new applications can be written and, more importantly, how existing applications and algorithms can easily benefit from the power and scale of Hadoop. A more detailed and animated version of this example can be found on YouTube at http://www.youtube.com/watch?v=Wlf0wSh0Pmk.

Conclusion

For Hadoop to be viable for more applications and larger workloads, it must be easier to provision, use and manage at scale. The integration of NFS access within a Hadoop distribution does just that by making it much easier to move data into and out of Hadoop clusters, and by making it possible to manage very large clusters with a small staff. The combination of NFS support and HDFS compatibility at the API level also provides full investment protection and improved price/performance, making the case for NFS access quite compelling indeed.

About the author:

Jack Norris is vice president of marketing at MapR Technologies.